이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 4. Configuring persistent storage and inferencing the model

You should configure persistent storage for AI Inference Server to store the model images before you inference the model.

Note

Configuring persistent storage is an optional but recommended step.

Prerequisites

You have installed a mirror registry on the bastion host.
You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.

Procedure

In the disconnected OpenShift Container Platform cluster, configure persistent storage using Network File System (NFS).

Create a Deployment custom resource (CR). For example, the following Deployment CR uses AI Inference Server to serve a Granite model on a CUDA accelerator.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite
  namespace: rhaiis-namespace


  labels:
    app: granite
spec:
  replicas: 0
  selector:
    matchLabels:
      app: granite
  template:
    metadata:
      labels:
        app: granite
    spec:
      containers:
        - name: granite
          image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:137ac606b87679c90658985ef1fc9a26a97bb11f622b988fe5125f33e6f35d78'
          imagePullPolicy: IfNotPresent
          command:
            - python
            - '-m'
            - vllm.entrypoints.openai.api_server
          args:
            - '--port=8000'
            - '--model=/mnt/models'
            - '--served-model-name=granite-3.1-2b-instruct-quantized.w8a8'
            - '--tensor-parallel-size=1'
          resources:
            limits:
              cpu: '10'
              nvidia.com/gpu: '1'
            requests:
              cpu: '2'
              memory: 6Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: cache-volume
              mountPath: /mnt/models
            - name: shm
              mountPath: /dev/shm


      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: granite-31-w8a8
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
      restartPolicy: Always

1: The metadata.namespace value must match the namespace where you configure the Hugging Face Secret CR.
2: The /dev/shm volume mount is required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail when the /dev/shm volume mount is not set.

Create a Service CR for the model inference. For example:

apiVersion: v1
kind: Service
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  selector:
    app: granite
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000

Optional. Create a Route CR to enable public access to the model. For example:

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: granite
  namespace: rhaiis-namespace
spec:
  to:
    kind: Service
    name: granite
  port:
    targetPort: 80

Get the URL for the exposed route:

$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'

Example output

granite-rhaiis-namespace.apps.example.com

Query the model by running the following command:

curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-3.1-2b-instruct-quantized.w8a8",
    "messages": [{"role": "user", "content": "What is AI?"}],
    "temperature": 0.1
  }'

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 4. Configuring persistent storage and inferencing the model

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links