2.9. inference 端点

2.9.1. Caikit TGIS ServingRuntime for KServe
复制链接

:443/api/v1/task/text-generation
:443/api/v1/task/server-streaming-text-generation

示例命令

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.2. Caikit Standalone ServingRuntime for KServe
复制链接

如果您为多个模型提供服务，您可以查询 /info/models 或 :443 caikit.runtime.info.InfoService/GetModelsInfo 来查看服务模型列表。

REST 端点

/api/v1/task/embedding
/api/v1/task/embedding-tasks
/api/v1/task/sentence-similarity
/api/v1/task/sentence-similarity-tasks
/api/v1/task/rerank
/api/v1/task/rerank-tasks
/info/models
/info/version
/info/runtime

gRPC 端点

:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict
:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict
:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict
:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict
:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict
:443 caikit.runtime.info.InfoService/GetModelsInfo
:443 caikit.runtime.info.InfoService/GetRuntimeInfo

注意

默认情况下，Caikit 独立运行时会公开 REST 端点。要使用 gRPC 协议，请手动部署自定义 Caikit Standalone ServingRuntime。如需更多信息，请参阅为单模型服务平台添加自定义模型运行时。

caikit-tgis-serving GitHub 存储库中提供了一个示例清单。

REST

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

gRPC

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

grpcurl -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.3. TGIS Standalone ServingRuntime for KServe
复制链接

重要

KServe 文本 Generation Inference Server (TGIS) Standalone ServingRuntime 已弃用。如需更多信息，请参阅 OpenShift AI 发行注记。

:443 fmaas.GenerationService/Generate
:443 fmaas.GenerationService/GenerateStream
注意
要查询 TGIS 独立运行时的端点，还必须在 OpenShift AI text-generation-inference 存储库的 proto 目录中下载文件。

示例命令

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate

Copy to Clipboard

Toggle word wrap

2.9.4. OpenVINO Model Server
复制链接

/v2/models/<model-name>/infer

示例命令

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.5. vLLM NVIDIA GPU ServingRuntime for KServe
复制链接

:443/version
:443/docs
:443/v1/models
:443/v1/chat/completions
:443/v1/completions
:443/v1/embeddings
:443/tokenize
:443/detokenize
注意
- vLLM 运行时与 OpenAI REST API 兼容。有关 vLLM 运行时支持的模型列表，请参阅支持的模型。
- 要在 vLLM 中使用 embeddings inference 端点，您必须使用 vLLM 支持的嵌入式模型。您不能将 embeddings 端点与 generative 模型搭配使用。如需更多信息，请参阅 vLLM 中支持的嵌入模型。
- 从 vLLM v0.5.5 开始，您必须在使用 /v1/chat/completions 端点查询模型时提供 chat 模板。如果您的模型不包含预定义的 chat 模板，您可以使用 chat-template 命令行参数在自定义 vLLM 运行时中指定 chat 模板，如示例所示。将 <CHAT_TEMPLATE > 替换为模板的路径。
  
  containers: - args: - --chat-template=<CHAT_TEMPLATE>
  
  Copy to Clipboard Toggle word wrap
  
  您可以在此处使用 .jinja 文件提供的 chat 模板，或者使用 /app/data/template 下的 vLLM 镜像。https://github.com/opendatahub-io/vllm/tree/main/examples如需更多信息，请参阅 Chat 模板。
如所示的路径所示，single-model 服务平台使用 OpenShift 路由器的 HTTPS 端口（通常是端口 443）来提供外部 API 请求。

示例命令

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

2.9.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
复制链接

对于 KServe，请参阅 vLLM NVIDIA GPU ServingRuntime。

2.9.7. vLLM AMD GPU ServingRuntime for KServe
复制链接

对于 KServe，请参阅 vLLM NVIDIA GPU ServingRuntime。

2.9.8. NVIDIA Triton Inference Server
复制链接

REST 端点

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

注意

ModelMesh 不支持以下 REST 端点：

v2/health/live
v2/health/ready
v2/models/<model_name>[/versions/]/ready

示例命令

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

gRPC 端点

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

示例命令

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

Copy to Clipboard

Toggle word wrap

2.9.9. Seldon MLServer
复制链接

REST 端点

v2/models/[/versions/<model_version>]/infer
v2/models/<model_name>[/versions/<model_version>]
v2/health/ready
v2/health/live
v2/models/<model_name>[/versions/]/ready
v2

示例命令

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'

Copy to Clipboard

Toggle word wrap

gRPC 端点

:443 inference.GRPCInferenceService/ModelInfer
:443 inference.GRPCInferenceService/ModelReady
:443 inference.GRPCInferenceService/ModelMetadata
:443 inference.GRPCInferenceService/ServerReady
:443 inference.GRPCInferenceService/ServerLive
:443 inference.GRPCInferenceService/ServerMetadata

示例命令

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

grpcurl -cacert ./openshift_ca_istio_knative.crt -proto ./grpc_predict_v2.proto -d @ -H "Authorization: Bearer <token>" <inference_endpoint_url>:443 inference.GRPCInferenceService/ModelMetadata

Copy to Clipboard

Toggle word wrap

2.9.1. Caikit TGIS ServingRuntime for KServe
复制链接

2.9.2. Caikit Standalone ServingRuntime for KServe
复制链接

2.9.3. TGIS Standalone ServingRuntime for KServe
复制链接

2.9.4. OpenVINO Model Server
复制链接

2.9.5. vLLM NVIDIA GPU ServingRuntime for KServe
复制链接

2.9.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
复制链接

2.9.7. vLLM AMD GPU ServingRuntime for KServe
复制链接

2.9.8. NVIDIA Triton Inference Server
复制链接

2.9.9. Seldon MLServer
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.9. inference 端点

2.9.1. Caikit TGIS ServingRuntime for KServe复制链接链接已复制到粘贴板!

2.9.2. Caikit Standalone ServingRuntime for KServe复制链接链接已复制到粘贴板!

2.9.3. TGIS Standalone ServingRuntime for KServe复制链接链接已复制到粘贴板!

2.9.4. OpenVINO Model Server复制链接链接已复制到粘贴板!

2.9.5. vLLM NVIDIA GPU ServingRuntime for KServe复制链接链接已复制到粘贴板!

2.9.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe复制链接链接已复制到粘贴板!

2.9.7. vLLM AMD GPU ServingRuntime for KServe复制链接链接已复制到粘贴板!

2.9.8. NVIDIA Triton Inference Server复制链接链接已复制到粘贴板!

2.9.9. Seldon MLServer复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.9.1. Caikit TGIS ServingRuntime for KServe
复制链接

2.9.2. Caikit Standalone ServingRuntime for KServe
复制链接

2.9.3. TGIS Standalone ServingRuntime for KServe
复制链接

2.9.4. OpenVINO Model Server
复制链接

2.9.5. vLLM NVIDIA GPU ServingRuntime for KServe
复制链接

2.9.6. vLLM Intel Gaudi Accelerator ServingRuntime for KServe
复制链接

2.9.7. vLLM AMD GPU ServingRuntime for KServe
复制链接

2.9.8. NVIDIA Triton Inference Server
复制链接

2.9.9. Seldon MLServer
复制链接