8.7. 管理员对分布式工作负载的常见问题

8.7.1. 用户的 Ray 集群处于暂停状态
复制链接

问题

集群队列配置中指定的资源配额可能不足，或者资源类别可能尚未创建。

诊断

用户的 Ray 集群头 pod 或 worker pod 处于暂停状态。检查使用 RayCluster 资源创建的 Workloads 资源的状态。status.conditions.message 字段提供暂停状态的原因，如下例所示：

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'

Copy to Clipboard

Toggle word wrap

解决方案

检查是否创建了资源类别，如下所示：
1. 在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
2. 点 Home Search，然后从 Resources 列表中选择 ResourceFlavor。
3. 如有必要，创建资源类别。
检查用户代码中的集群队列配置，以确保它们请求的资源在为项目定义的限值内。
如有必要，增加资源配额。

有关配置资源类型和配额的详情，请参考为分布式工作负载配置配额管理。

8.7.2. 用户的 Ray 集群处于失败状态
复制链接

问题

用户可能没有足够的资源。

诊断

用户的 Ray 集群 head pod 或 worker pod 没有运行。创建 Ray 集群时，它最初会进入 失败状态。这个失败状态通常在协调过程完成并且 Ray 集群 pod 正在运行后解决。

解决方案

如果失败的状态仍然存在，请完成以下步骤：

在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
点 Workloads Pods。
单击用户的 Pod 名称，以打开 Pod 详情页面。
点 Events 选项卡，并查看 pod 事件以确定问题的原因。
检查使用 RayCluster 资源创建的 Workloads 资源的状态。status.conditions.message 字段提供失败状态的原因。

8.7.3. 用户收到 CodeFlare Operator 调用 Webhook 错误消息失败
复制链接

问题

用户运行 cluster.up （） 命令后，会显示以下错误：

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}

Copy to Clipboard

Toggle word wrap

诊断

CodeFlare Operator pod 可能没有运行。

解决方案

在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
点 Workloads Pods。
验证 CodeFlare Operator pod 是否正在运行。如有必要，重启 CodeFlare Operator pod。

查看 CodeFlare Operator pod 的日志，以验证 webhook 服务器是否服务，如下例所示：

INFO	controller-runtime.webhook	  Serving webhook server	{"host": "", "port": 9443}

INFO	controller-runtime.webhook	  Serving webhook server	{"host": "", "port": 9443}

Copy to Clipboard

Toggle word wrap

8.7.4. 用户为 Kueue 收到调用 Webhook 错误消息失败
复制链接

问题

用户运行 cluster.up （） 命令后，会显示以下错误：

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

Copy to Clipboard

Toggle word wrap

诊断

Kueue pod 可能没有运行。

解决方案

在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
点 Workloads Pods。
验证 Kueue pod 是否正在运行。如有必要，重启 Kueue pod。

查看 Kueue pod 的日志以验证 webhook 服务器是否服务，如下例所示：

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

Copy to Clipboard

Toggle word wrap

8.7.5. 用户的 Ray 集群没有启动
复制链接

问题

用户运行 cluster.up （） 命令后，当运行 cluster.details （） 命令或 cluster.status （） 命令时，Ray 集群状态会保留为 Starting，而不是更改为 Ready。没有创建 pod。

诊断

检查使用 RayCluster 资源创建的 Workloads 资源的状态。status.conditions.message 字段提供处于 Starting 状态的原因。同样，检查 RayCluster 资源的 status.conditions.message 字段。

解决方案

在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
点 Workloads Pods。
验证 KubeRay pod 是否正在运行。如有必要，重启 KubeRay pod。
查看 KubeRay pod 的日志以识别错误。

8.7.6. 用户收到 Default Local Queue … not found 错误消息
复制链接

问题

用户运行 cluster.up （） 命令后，会显示以下错误：

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Copy to Clipboard

Toggle word wrap

诊断

没有定义默认本地队列，且集群配置中没有指定本地队列。

解决方案

检查用户项目中是否存在本地队列，如下所示：
1. 在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
2. 点 Home Search，然后从 Resources 列表中选择 LocalQueue。
3. 如果没有找到本地队列，请创建一个本地队列。
4. 为用户提供项目中的本地队列详情，并建议他们将本地队列添加到其集群配置中。
定义默认本地队列。
有关创建本地队列和定义默认本地队列的详情，请参考为分布式工作负载配置配额管理。

8.7.7. 用户收到提供的 local_queue 不存在错误消息
复制链接

问题

用户运行 cluster.up （） 命令后，会显示以下错误：

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

Copy to Clipboard

Toggle word wrap

诊断

为集群配置中的本地队列指定不正确的值，或者定义了不正确的默认本地队列。指定的本地队列不存在，或者存在于不同的命名空间中。

解决方案

在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
1. 单击 Search，然后从 Resources 列表中选择 LocalQueue。
2. 使用以下方法之一解决这个问题：
  - 如果没有找到本地队列，请创建一个本地队列。
  - 如果找到一个或多个本地队列，请为用户提供项目中本地队列的详细信息。建议用户确保在集群配置中正确拼写本地队列名称，并且集群配置中的 namespace 值与其项目名称匹配。如果用户没有在集群配置中指定 namespace 值，则会在当前项目中创建 Ray 集群。
3. 定义默认本地队列。
  有关创建本地队列和定义默认本地队列的详情，请参考为分布式工作负载配置配额管理。

8.7.8. 用户不能创建 Ray 集群或提交作业
复制链接

问题

用户运行 cluster.up （） 命令后，会显示类似以下文本的错误：

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}

Copy to Clipboard

Toggle word wrap

诊断

用户笔记本代码的 TokenAuthentication 部分中没有指定正确的 OpenShift 登录凭证。

解决方案

建议用户识别并指定正确的 OpenShift 登录凭证，如下所示：
1. 在 OpenShift 控制台标头中，单击您的用户名，再单击 Copy login command。
2. 在打开的新标签页中，以您要使用的凭据的用户身份登录。
3. 单击 Display Token。
4. 从 带有此令牌部分的 Log in 中，复制 token 和 server 值。
5. 在笔记本代码中指定复制的 令牌和 服务器 值，如下所示：
  auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
  Copy to Clipboard Toggle word wrap
验证用户具有正确的权限，并且是 rhoai-users 组的一部分。

8.7.9. Kueue 置备的用户 pod 会在拉取镜像前终止
复制链接

问题

Kueue 会在将工作负载标记为就绪前等待一段时间，以便启用所有工作负载 pod 被置备并运行。默认情况下，Kue 会等待 5 分钟。如果 pod 镜像非常大，且仍然在 5 分钟等待期限后被拉取，Kue 会失败并终止相关的 pod。

诊断

在 OpenShift 控制台中，从 Project 列表中选择用户的项目。
点 Workloads Pods。
单击用户的 Pod 名称，以打开 Pod 详情页面。
点 Events 选项卡，并查看 pod 事件来检查镜像拉取是否已成功完成。

解决方案

如果 pod 需要超过 5 分钟才能拉取镜像，请使用以下方法之一解决这个问题：

为由 Kueue 管理的资源添加 OnFailure 重启策略。
在 redhat-ods-applications 命名空间中，编辑 kueue-manager-config ConfigMap 来为 waitForPodsReady 属性设置自定义超时。有关此配置选项的更多信息，请参阅 Kueue 文档中的启用 waitForPodsReady。

8.7.1. 用户的 Ray 集群处于暂停状态
复制链接

8.7.2. 用户的 Ray 集群处于失败状态
复制链接

8.7.3. 用户收到 CodeFlare Operator 调用 Webhook 错误消息失败
复制链接

8.7.4. 用户为 Kueue 收到调用 Webhook 错误消息失败
复制链接

8.7.5. 用户的 Ray 集群没有启动
复制链接

8.7.6. 用户收到 Default Local Queue … not found 错误消息
复制链接

8.7.7. 用户收到提供的 local_queue 不存在错误消息
复制链接

8.7.8. 用户不能创建 Ray 集群或提交作业
复制链接

8.7.9. Kueue 置备的用户 pod 会在拉取镜像前终止
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

8.7. 管理员对分布式工作负载的常见问题

8.7.1. 用户的 Ray 集群处于暂停状态复制链接链接已复制到粘贴板!

8.7.2. 用户的 Ray 集群处于失败状态复制链接链接已复制到粘贴板!

8.7.3. 用户收到 CodeFlare Operator 调用 Webhook 错误消息失败复制链接链接已复制到粘贴板!

8.7.4. 用户为 Kueue 收到 调用 Webhook 错误消息失败复制链接链接已复制到粘贴板!

8.7.5. 用户的 Ray 集群没有启动复制链接链接已复制到粘贴板!

8.7.6. 用户收到 Default Local Queue …​ not found 错误消息复制链接链接已复制到粘贴板!

8.7.7. 用户收到 提供的 local_queue 不存在 错误消息复制链接链接已复制到粘贴板!

8.7.8. 用户不能创建 Ray 集群或提交作业复制链接链接已复制到粘贴板!

8.7.9. Kueue 置备的用户 pod 会在拉取镜像前终止复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

8.7.1. 用户的 Ray 集群处于暂停状态
复制链接

8.7.2. 用户的 Ray 集群处于失败状态
复制链接

8.7.3. 用户收到 CodeFlare Operator 调用 Webhook 错误消息失败
复制链接

8.7.4. 用户为 Kueue 收到调用 Webhook 错误消息失败
复制链接

8.7.5. 用户的 Ray 集群没有启动
复制链接

8.7.6. 用户收到 Default Local Queue … not found 错误消息
复制链接

8.7.7. 用户收到提供的 local_queue 不存在错误消息
复制链接

8.7.8. 用户不能创建 Ray 集群或提交作业
复制链接

8.7.9. Kueue 置备的用户 pod 会在拉取镜像前终止
复制链接