第 3 章 OpenShift Container Platform 中带有 AI Inference Server 的 inference service modelcar 镜像
通过在带有 OpenShift Container Platform 的 modelcar 容器中部署语言模型,方法是配置 secret、持久性存储和部署自定义资源(CR)来使用 Red Hat AI Inference Server 为 modelcar 容器镜像提供服务。
先决条件
-
已安装 OpenShift CLI(
oc)。 -
您已以具有
cluster-admin权限的用户身份登录。 - 您已为底层 AI 加速器硬件安装了 NFD 和所需的 GPU Operator。
- 您已为语言模型创建了 modelcar 容器镜像,并将其推送到容器镜像 registry。
流程
创建 Docker secret,以便集群可以从容器 registry 下载 Red Hat AI Inference Server 镜像。例如,要创建一个包含本地
~/.docker/config.json文件内容的SecretCR,请运行以下命令:oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace创建
PersistentVolumeClaim(PVC)自定义资源(CR),并在集群中应用它。以下示例PVCCR 使用默认的 IBM VPC Block 持久性卷。apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: rhaiis-namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: ibmc-vpc-block-10iops-tier注意配置集群存储以满足您的要求超出了此步骤的范围。如需更多信息,请参阅 配置持久性存储。
创建拉取 modelcar 镜像的
Deployment自定义资源(CR)并部署 Red Hat AI Inference Server 容器。参考以下示例DeploymentCR,它使用 AI Inference Server 为 modelcar 镜像提供服务。apiVersion: apps/v1 kind: Deployment metadata: name: rhaiis-oci-deploy namespace: rhaiis-namespace labels: app: granite spec: replicas: 0 selector: matchLabels: app: rhaiis-oci-deploy template: metadata: labels: app: rhaiis-oci-deploy spec: imagePullSecrets: - name: docker-secret volumes: - name: model-volume persistentVolumeClaim: claimName: model-cache1 - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" - name: oci-auth secret: secretName: docker-secret items: - key: .dockercfg path: config.json initContainers:2 - name: fetch-model image: ghcr.io/oras-project/oras:v1.2.0 command: ["/bin/sh","-c"] args: - | set -e # Only pull if /model is empty if [ -z "$(ls -A /model)" ]; then echo "Pulling model…" # Update with the modelcar container image registry URL oras pull <YOUR_MODELCAR_REGISTRY_URL> \ --output /model \ else echo "Model already present, skipping pull" fi volumeMounts: - name: model-volume mountPath: /model - name: oci-auth mountPath: /auth readOnly: true containers: - name: granite image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd' imagePullPolicy: IfNotPresent env: - name: VLLM_SERVER_DEV_MODE value: '1' command: - python - '-m' - vllm.entrypoints.openai.api_server args: - '--port=8000' - '--model=/model' - '--served-model-name=ibm-granite/granite-3.1-2b-instruct'3 - '--tensor-parallel-size=1' resources: limits: cpu: '10' nvidia.com/gpu: '1' memory: 16Gi requests: cpu: '2' memory: 6Gi nvidia.com/gpu: '1' volumeMounts: - name: model-volume mountPath: /model - name: shm mountPath: /dev/shm4 restartPolicy: Always- 1
spec.template.spec.volumes.persistentVolumeClaim.claimName必须与您创建的PVC的名称匹配。- 2
- 这个示例部署使用在主应用程序容器前运行的简单
initContainers配置来下载所需的 modelcar 镜像。如果模型目录已经填充了之前部署中,则会跳过模型拉取步骤。 - 3
- 更新值 for
-served-model-name,以匹配您要部署的模型。 - 4
- NVIDIA Collective Communications 库(NCCL)需要
/dev/shm卷挂载。当/dev/shm卷挂载没有设置时,Tensor parallel vLLM 部署会失败。
将部署副本数增加到所需的数量。例如,运行以下命令:
oc scale deployment granite -n rhaiis-namespace --replicas=1可选:查看部署并确保它成功:
$ oc get deployment -n rhaiis-namespace --watch输出示例
NAME READY UP-TO-DATE AVAILABLE AGE rhaiis-oci-deploy 0/1 1 0 2s rhaiis-oci-deploy 1/1 1 1 14s为模型推测创建
ServiceCR。例如:apiVersion: v1 kind: Service metadata: name: rhaiis-oci-deploy namespace: rhaiis-namespace spec: selector: app: rhaiis-oci-deploy ports: - name: http port: 80 targetPort: 8000可选:创建一个
RouteCR 以启用对模型的公共访问。例如:apiVersion: route.openshift.io/v1 kind: Route metadata: name: rhaiis-oci-deploy namespace: rhaiis-namespace spec: to: kind: Service name: rhaiis-oci-deploy port: targetPort: http获取公开的路由的 URL。运行以下命令:
$ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'输出示例
rhaiis-oci-deploy-rhaiis-namespace.apps.example.com
验证
通过查询模型来确保部署成功。运行以下命令:
curl -v -k http://rhaiis-oci-deploy-rhaiis-namespace.apps.modelsibm.ibmmodel.rh-ods.com/v1/chat/completions -H "Content-Type: application/json" -d '{
"model":"ibm-granite/granite-3.1-2b-instruct",
"messages":[{"role":"user","content":"Hello?"}],
"temperature":0.1
}'| jq
输出示例
{
"id": "chatcmpl-07b177360eaa40a3b311c24a8e3c7f43",
"object": "chat.completion",
"created": 1755189746,
"model": "ibm-granite/granite-3.1-2b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "Hello! How can I assist you today?",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 61,
"total_tokens": 71,
"completion_tokens": 10,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"kv_transfer_params": null
}