2.17. 优化模型运行时

您可以选择增强 OpenShift AI 中提供的预装的模型服务运行时，以利用额外的优点和功能，如优化延迟、缩短延迟和调优资源分配。

2.17.1. 启用 speculative decoding 和 multi-modal inferencing
复制链接

您可以为 KServe 运行时配置 vLLM NVIDIA GPU ServingRuntime，以使用规范解码（并行处理技术）优化大型语言模型(LLM)的推断时间。

您还可以配置运行时来支持对版本语言模型(VLM)的推断。VLM 是多模式模型的子集，可集成视觉和文本数据。

以下流程描述了为 speculative decoding 和 multi-modal inferencing 自定义 vLLM NVIDIA GPU ServingRuntime for KServe 运行时。

先决条件

您已以具有 OpenShift AI 管理员特权的用户身份登录到 OpenShift AI。
如果您使用 vLLM 模型-serving 运行时与草案模型进行规范解码，您已将原始模型和规范模型存储在兼容 S3 对象存储中的同一文件夹中。

流程

按照以下步骤部署模型，如在单模式服务平台上部署模型中所述。
在 Serving runtime 字段中，为 KServe 运行时选择 vLLM NVIDIA GPU ServingRuntime。
要通过与提示中的 ngrams 匹配 ngrams 来配置 vLLM 模型-serving 运行时，请在 Configuration parameters 部分的 Additional serving runtime 参数 中添加以下参数：
```
--speculative-model=[ngram]
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
--use-v2-block-manager
```
```
--speculative-model=[ngram]
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
--use-v2-block-manager
```
Copy to Clipboard Toggle word wrap
1. 将 <NUM_SPECULATIVE_TOKENS&gt; 和 <NGRAM_PROMPT_LOOKUP_MAX > 替换为您自己的值。
  注意
  推断吞吐量因用于使用 n-gram 推测的模型而异。

要使用草案模型为 speculative decoding 配置 vLLM 模型- serving 运行时，请在 配置参数 部分的附加服务 运行时参数下添加以下参数：

--port=8080
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--model=/mnt/models/<path_to_original_model>
--speculative-model=/mnt/models/<path_to_speculative_model>
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--use-v2-block-manager

--port=8080
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--model=/mnt/models/<path_to_original_model>
--speculative-model=/mnt/models/<path_to_speculative_model>
--num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
--use-v2-block-manager

Copy to Clipboard

Toggle word wrap

将 <path_to_speculative_model & gt; 和 <path_to_original_model > 替换为兼容 S3 对象存储上的 speculative 模型和原始模型的路径。
将 <NUM_SPECULATIVE_TOKENS > 替换为您自己的值。

要为多模式推断配置 vLLM 模型运行时，请在 Configuration parameters 部分中的 Additional service runtime 参数 中添加以下参数：
```
--trust-remote-code
```
```
--trust-remote-code
```
Copy to Clipboard Toggle word wrap
注意
仅使用来自可信源的模型的 --trust-remote-code 参数。
点 Deploy。

验证

如果您为 speculative decoding 配置了 vLLM 模型-serving 运行时，请使用以下示例命令验证部署的模型的 API 请求：

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"

Copy to Clipboard

Toggle word wrap

如果您已经为多模式情况配置了 vLLM 模型定义运行时，请使用以下示例命令验证您部署的 vision-language 模型(VLM)的 API 请求：

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
-d '{"model":"<model_name>",
     "messages":
        [{"role":"<role>",
          "content":
             [{"type":"text", "text":"<text>"
              },
              {"type":"image_url", "image_url":"<image_url_link>"
              }
             ]
         }
        ]
    }'

curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
-d '{"model":"<model_name>",
     "messages":
        [{"role":"<role>",
          "content":
             [{"type":"text", "text":"<text>"
              },
              {"type":"image_url", "image_url":"<image_url_link>"
              }
             ]
         }
        ]
    }'

Copy to Clipboard

Toggle word wrap

返回顶部

2.17. 优化模型运行时

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.17. 优化模型运行时

2.17.1. 启用 speculative decoding 和 multi-modal inferencing复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.17.1. 启用 speculative decoding 和 multi-modal inferencing
复制链接