2.6. model-serving 运行时

您可以使用模型服务运行时在单模式服务平台上提供模型。模型服务运行时的配置由 ServingRuntime 和 InferenceService 自定义资源定义(CRD)定义。

2.6.1. ServingRuntime

ServingRuntime CRD 创建一个服务运行时，这是一个用于部署和管理模型的环境。它为不同格式动态加载和卸载模型的 pod 创建模板，并公开服务端点以推断请求。

以下 YAML 配置是 KServe model-serving 运行时的 vLLM ServingRuntime 示例。配置包括各种标志、环境变量和命令行参数。

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' 1
    openshift.io/display-name: vLLM ServingRuntime for KServe 2
  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
spec:
     annotations:
          prometheus.io/path: /metrics 3
          prometheus.io/port: "8080" 4
     containers :
          - args:
               - --port=8080
               - --model=/mnt/models 5
               - --served-model-name={{.Name}} 6
             command: 7
                  - python
                  - '-m'
                  - vllm.entrypoints.openai.api_server
             env:
                  - name: HF_HOME
                     value: /tmp/hf_home
             image: 8
quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21
          name: kserve-container
          ports:
               - containerPort: 8080
                   protocol: TCP
    multiModel: false 9
    supportedModelFormats: 10
        - autoSelect: true
           name: vLLM

1: 建议与运行时搭配使用的加速器。
2: 显示服务运行时的名称。
3: Prometheus 用来提取用于监控指标的端点。
4: Prometheus 用来提取用于监控指标的端口。
5: 模型文件存储在运行时容器中的路径。
6: 将运行时容器规格中的 {{.Name}} 模板变量指定的模型名称传递给运行时环境。{{.Name}} 变量映射到 InferenceService 元数据对象的 spec.predictor.name 字段。
7: 启动运行时容器的 entrypoint 命令。
8: 服务运行时使用的运行时容器镜像。此镜像根据所使用的加速器类型而有所不同。
9: 指定运行时用于单模式服务。
10: 指定运行时支持的模型格式。

2.6.2. InferenceService

InferenceService CRD 创建一个服务器或推测服务，该进程会查询，将其传递给模型，然后返回 inference 输出。

inference 服务还执行以下操作：

指定模型的位置和格式。
指定用于服务模型的服务运行时。
为 gRPC 或 REST inference 启用 passthrough 路由。
为部署的模型定义 HTTP 或 gRPC 端点。

以下示例显示了部署带有 vLLM 运行时的 granite 模型时生成的 InferenceService YAML 配置文件：

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vLLM ServingRuntime for KServe
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

其他资源

Service Runtimes

2.6. model-serving 运行时

2.6.1. ServingRuntime

2.6.2. InferenceService

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Red Hat legal and privacy links

Red Hat legal and privacy links