3.6. model-serving 运行时

您可以使用模型服务运行时在单模式服务平台上提供模型。模型服务运行时的配置由 ServingRuntime 和 InferenceService 自定义资源定义(CRD)定义。

3.6.1. ServingRuntime
复制链接

ServingRuntime CRD 创建一个服务运行时，这是一个用于部署和管理模型的环境。它为不同格式动态加载和卸载模型的 pod 创建模板，并公开服务端点以推断请求。

以下 YAML 配置是 KServe model-serving 运行时的 vLLM ServingRuntime 示例。配置包括各种标志、环境变量和命令行参数。

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'


    openshift.io/display-name: vLLM ServingRuntime for KServe


  labels:
    opendatahub.io/dashboard: "true"
  name: vllm-runtime
spec:
     annotations:
          prometheus.io/path: /metrics


          prometheus.io/port: "8080"


     containers :
          - args:
               - --port=8080
               - --model=/mnt/models


               - --served-model-name={{.Name}}


             command:


                  - python
                  - '-m'
                  - vllm.entrypoints.openai.api_server
             env:
                  - name: HF_HOME
                     value: /tmp/hf_home
             image:


quay.io/modh/vllm@sha256:8a3dd8ad6e15fe7b8e5e471037519719d4d8ad3db9d69389f2beded36a6f5b21
          name: kserve-container
          ports:
               - containerPort: 8080
                   protocol: TCP
    multiModel: false


    supportedModelFormats:


        - autoSelect: true
           name: vLLM

1: 建议与运行时搭配使用的加速器。
2: 显示服务运行时的名称。
3: Prometheus 用来提取用于监控指标的端点。
4: Prometheus 用来提取用于监控指标的端口。
5: 模型文件存储在运行时容器中的路径。
6: 将运行时容器规格中的 {{.Name}} 模板变量指定的模型名称传递给运行时环境。{{.Name}} 变量映射到 InferenceService 元数据对象的 spec.predictor.name 字段。
7: 启动运行时容器的 entrypoint 命令。
8: 服务运行时使用的运行时容器镜像。此镜像根据所使用的加速器类型而有所不同。
9: 指定运行时用于单模式服务。
10: 指定运行时支持的模型格式。

3.6.2. InferenceService
复制链接

InferenceService CRD 创建一个服务器或推测服务，该进程会查询，将其传递给模型，然后返回 inference 输出。

inference 服务还执行以下操作：

指定模型的位置和格式。
指定用于服务模型的服务运行时。
为 gRPC 或 REST inference 启用 passthrough 路由。
为部署的模型定义 HTTP 或 gRPC 端点。

以下示例显示了部署带有 vLLM 运行时的 granite 模型时生成的 InferenceService YAML 配置文件：

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite
    serving.knative.openshift.io/enablePassthrough: 'true'
    sidecar.istio.io/inject: 'true'
    sidecar.istio.io/rewriteAppHTTPProbers: 'true'
  name: granite
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '6'
          memory: 24Gi
          nvidia.com/gpu: '1'
        requests:
          cpu: '1'
          memory: 8Gi
          nvidia.com/gpu: '1'
      runtime: vLLM ServingRuntime for KServe
      storage:
        key: aws-connection-my-storage
        path: models/granite-7b-instruct/
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

3.6. model-serving 运行时

3.6.1. ServingRuntime
复制链接

3.6.2. InferenceService
复制链接

学习

尝试、购买和销售

社区

關於紅帽

让开源更具包容性

关于红帽文档

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.6. model-serving 运行时

3.6.1. ServingRuntime复制链接链接已复制到粘贴板!

3.6.2. InferenceService复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

關於紅帽

让开源更具包容性

关于红帽文档

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.6.1. ServingRuntime
复制链接

3.6.2. InferenceService
复制链接