此内容没有您所选择的语言版本。

Chapter 7. Known issues


This section describes known issues in Red Hat OpenShift AI 3.4 EA1 and EA2 and any known methods of working around these issues.

7.1. Issues discovered at version 3.4 EA2

RHOAIENG-58765 - Distributed inference with llm-d prefill and decode disaggregation fails on FIPS-enabled clusters

Using distributed inference with llm-d prefill and decode disaggregation for LLM deployments on FIPS-enabled clusters causes the routing sidecar pod to enter a crash loop, preventing the LLM deployment from functioning. This issue is caused by a runtime image introduced in the 3.4 EA2 release that is not FIPS-compatible.

Workaround
Do not use prefill and decode disaggregation with llm-d in Red Hat OpenShift AI 3.4 EA2 on FIPS-enabled clusters. Other features continue to work correctly on FIPS-enabled clusters.

RHOAIENG-3816 - Encrypted PDF uploads to Llama Stack vector stores fail on FIPS-enabled clusters

On FIPS-enabled clusters, registering certain encrypted PDF files into Llama Stack vector stores fails due to a limitation in the underlying PDF parsing library. The library uses an MD5-based digest as part of its encryption handling, which is not allowed in FIPS mode and triggers an UnsupportedDigestmodError: [digital envelope routines] unsupported error during file processing. Due to this, on FIPS-enabled clusters, affected encrypted PDFs cannot be uploaded into Llama Stack vector stores. As a result, these documents are not indexed and are unavailable to retrieval-augmented generation (RAG) workflows, which may lead to incomplete or missing answers when those documents are expected to be part of the context.

Workaround
Use non‑encrypted PDF files when ingesting documents into Llama Stack vector stores on FIPS-enabled clusters, or re-export or convert the original document to a non‑encrypted PDF before uploading it to the vector store. Until the underlying PDF library is updated to use FIPS-compliant cryptographic primitives, encrypted PDFs that rely on disallowed digests in FIPS mode will continue to fail to upload to Llama Stack vector stores on FIPS-enabled clusters.

RHOAIENG-57224 - ROCm universal image training produces NaN on MI300X due to torch aotriton 0.11.1 regression

ROCm universal training image (th06) produces NaN values on MI300X due to aotriton 0.11.1 regression in AIPCC-built PyTorch wheel.

Workaround
Use th05 image or set attn_implementation="flash_attention_2".

RHOIAIENG-57427 - RAG in Gen AI Playground doesn’t work with default system prompt and model Qwen/Qwen3-14B-AWQ

In Gen AI Playground RAG, the default system prompt might not reliably trigger the knowledge search/tool-calling behavior for some models, so document retrieval is not performed. Due to this, questions about uploaded documents can return answers without using the vector store, resulting in incomplete/incorrect responses unless the prompt is adjusted.

Workaround
Manually edit the system prompt to explicitly instruct the model to use the knowledge search tool first for document-based/factual questions (as documented in the Gen AI Playground RAG documentation). As a result, after updating the system prompt, RAG retrieval works and the model can answer based on the uploaded document content.

RHOAIENG-54005 - Generate MaaS Token Endpoint Removed - breaks Gen AI Studio Playground

The /v1/token API was removed and this endpoint was merged in with the new post creation of /v1/api-keys. As a result, Gen AI Playground cannot generate a token on the fly for MaaS and cannot talk to MaaS Models in 3.4 EA2.

Workaround
There is no existing workaround for this known issue. As a result, there is no access to MaaS and Playground in 3.4 EA2.

RHOAIENG-48753 - Pipeline Name must be DNS-compliant to use “Store pipeline definitions in Kubernetes”

Elyra does not convert the pipeline name to a DNS-compliant name when using the default Kubernetes storage. As a consequence, if you don’t use a DNS-compliant name when you start an Elyra pipeline, it gives a cryptic error "[TIP: did you mean to set https://ds-pipeline-dspa-robert-tests.apps.test.rhoai.rh-aiservices-bu.com/pipeline as the endpoint, take care not to include s at end]".

Workaround
Use DNS-compliant naming when running Elyra pipelines.

7.2. Issues discovered at version 3.4 EA1

RHOAIENG-54101 - Deployments not listed in Model Registry on IBM Z

When you deploy a model from the Model Registry on IBM Z, the deployment does not appear under the Deployments tab in the Model Registry.

Workaround
Access and manage the deployment from the global Deployments page in the OpenShift AI dashboard.

RHOAIENG-53206 - Spark driver pods fail to communicate due to RpcTimeoutException

After installing the Spark Operator, Spark executor pods cannot communicate with the driver pod because the redhat-ods-applications namespace defaults to a "deny-all" traffic rule. SparkApplication pods hang and fail with an RpcTimeoutException.

Workaround

Create a NetworkPolicy in the redhat-ods-applications namespace to allow communication between the pods created by the SparkApplication controller:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-operator-allow-internal
spec:
  podSelector:
    matchLabels:
      sparkoperator.k8s.io/launched-by-spark-operator: "true"
  policyTypes:
    - Ingress
  ingress:
    - ports:
        - port: 7078
          protocol: TCP
        - port: 7079
          protocol: TCP
        - port: 4040
          protocol: TCP
      from:
        - podSelector: {}
        - namespaceSelector:
            matchLabels:
              network.openshift.io/policy-group: ingress

RHOAIENG-52130 - Workbenches with Feast integration fail to start due to missing ConfigMap

Workbenches with Feast integration enabled fail to start in OpenShift AI 3.4 EA1. Pods remain stuck in ContainerCreating state with the following error:

+

[FailedMount] [Warning] MountVolume.SetUp failed for volume "odh-feast-config"
  configmap "jupyter-nb-kube-3aadmin-feast-config" not found
Workaround

Restart the Feast Operator after DSC deployment completes:

$ kubectl rollout restart deployment/feast-operator-controller-manager -n redhat-ods-applications

RHOAIENG-53239 - Custom ServingRuntime required for IBM Z (s390x) vLLM Spyre deployments

When deploying models using the vLLM Spyre runtime on IBM Z (s390x) systems, the default ServingRuntime cannot be used directly for KServe-based deployments. Model deployment fails if the runtime is used without modification.

Workaround

Create a custom ServingRuntime by duplicating the vllm-spyre-s390x-runtime ServingRuntime and removing the command section from the container specification. Keep all other configuration, including environment variables, ports, and volume mounts, unchanged.

The following example shows only the affected section. Your complete ServingRuntime must include all other fields from the original template:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-spyre-s390x-runtime-copy
spec:
  containers:
    - name: kserve-container
      image: <image>
      # Remove the 'command' section that appears here in the original
      args:
        - --model=/mnt/models
        - --port=8000
        - --served-model-name={{.Name}}
      # ... keep all env, ports, volumeMounts from original ...

RHOAIENG-50523 - Unable to upload RAG documents in Gen AI Playground on disconnected clusters

On disconnected clusters, uploading documents in the Gen AI Playground RAG section fails. The progress bar never exceeds 50% because Llama Stack attempts to download the ibm-granite/granite-embedding-125m-english embedding model from HuggingFace, even though the model is already included in the Llama Stack Distribution image in OpenShift AI 3.3.

Workaround

Modify the LlamaStackDistribution custom resource to include the following environment variables:

export MY_PROJECT=my-project

oc patch llamastackdistribution lsd-genai-playground \
  -n $MY_PROJECT \
  --type='json' \
  -p='[
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "SENTENCE_TRANSFORMERS_HOME",
        "value": "/opt/app-root/src/.cache/huggingface/hub"
      }
    },
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "HF_HUB_OFFLINE",
        "value": "1"
      }
    },
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "TRANSFORMERS_OFFLINE",
        "value": "1"
      }
    },
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "HF_DATASETS_OFFLINE",
        "value": "1"
      }
    }
  ]'

The Llama Stack pod restarts automatically after applying this configuration.

RHAIENG-2827 - Unsecured routes created by older CodeFlare SDK versions

Existing 2.x workbenches continue to use an older version of the CodeFlare SDK when used in OpenShift AI 3.x. The older version of the SDK creates unsecured OpenShift routes on behalf of the user.

Workaround
To resolve this issue, update your workbench to the latest image provided in OpenShift AI 3.x before using CodeFlare SDK.

RHOAIENG-48867 - TrainJob fails to resume after Red Hat OpenShift AI upgrade due to immutable JobSet spec

TrainJobs that are suspended (e.g., queued by Kueue) before a Red Hat OpenShift AI upgrade cannot resume after the upgrade completes. The Trainer controller fails to update the immutable JobSet spec.replicatedJobs field.

Workaround
To resolve this issue, delete and recreate the affected TrainJob after the upgrade.

RHOAIENG-45142 - Dashboard URLs return 404 errors after upgrading Red Hat OpenShift AI from 2.x to 3.x

The Red Hat OpenShift AI dashboard URL subdomain changed from rhods-dashboard-redhat-ods-applications.apps.<cluster>`to `data-science-gateway.apps.<cluster> due to the use of Gateways in OpenShift AI version 3.x. Existing bookmarks to the dashboard using the default rhods-dashboard-redhat-ods-applications.apps.<cluster> format will no longer function after you upgrade to OpenShift AI version 3.0 or later. It is recommended that you update your bookmarks and any internal documentation to use the new URL format: data-science-gateway.apps.<cluster>.

Workaround
To resolve this issue, deploy an nginx-based redirect solution that recreates the old route name and redirects traffic to the new gateway URL. For instructions, see Dashboard URLs return 404 errors after RHOAI upgrade from 2.x to 3.x
Note

Cluster administrators must provide the new dashboard URL to all Red Hat OpenShift AI administrators and users. In a future release, URL redirects may be supported.

RHOAIENG-43686 - Red Hat build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error

Installing Red Hat build of Kueue 1.2 or upgrading from Red Hat build of Kueue 1.1 to 1.2 fails if legacy Kueue CustomResourceDefinitions (CRDs) remain in the cluster from a previous Red Hat OpenShift AI 2.x installation. As a result, when the legacy v1alpha1 CRDs are present, the Kueue operator cannot reconcile successfully and the Data Science Cluster (DSC) remains in a Not Ready state.

Workaround
To resolve this issue, delete the legacy Kueue CRDs, cohorts.kueue.x-k8s.io/v1alpha1 or topologies.kueue.x-k8s.io/v1alpha1 from the cluster. For detailed instructions, see Red Hat Build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error.

RHOAIENG-49389 - Tier management unavailable after deleting all tiers

If you delete all service tiers from Settings > Tiers, the Create tier button is no longer displayed. You cannot create tiers through the dashboard until at least one tier exists. To avoid this issue, ensure at least one tier remains in the system at all times.

Workaround

Create a basic tier using the CLI, then configure its settings through the dashboard. You must have cluster administrator privileges for your OpenShift cluster to perform these steps:

  1. Retrieve the tier-to-group-mapping ConfigMap:

    $ oc get configmap tier-to-group-mapping redhat-ods-namespace -o yaml tier-config.yaml
  2. Edit the ConfigMap to add a basic tier definition:

    apiVersion: v1
      kind: ConfigMap
      metadata:
        name: tier-to-group-mapping
        namespace: redhat-ods-applications
      data:
        tiers.yaml: |
          - name: basic
            displayName: Basic Tier
            level: 0
            groups:
              - system:authenticated
  3. Apply the updated ConfigMap:

    $ oc apply -f tier-config.yaml
  4. In the dashboard, navigate to Settings Tiers to configure rate limits for the newly created tier.

RHOAIENG-47589 - Missing Kueue validation for TrainJob

A TrainJob creation without a defined Kueue LocalQueue passes without validation check, even when Kueue managed namespace is enabled. As a result, it is possible to create TrainJob not managed by Kueue in Kueue managed namespace.

Workaround
None.

RHOAIENG-49017 - Upgrade RAGAS provider to Llama Stack 0.4.z / 0.5.z

In order to use the Ragas provider in OpenShift AI 3.3, you must update your Llama Stack distribution to use llama-stack-provider-ragas==0.5.4, which works with Llama Stack >=0.4.2,<0.5.0. This version of the provider is a workaround release that is using the deprecated register endpoints as a workaround. See the full compatibility matrix for more information.

Workaround
None.

RHOAIENG-44516 - MLflow tracking server does not accept Kubernetes service account tokens

Red Hat OpenShift AI does not accept Kubernetes service accounts when you authenticate through the dashboard MLflow URL.

Workaround

To authenticate with a service account token, complete the following steps:

  • Create an OpenShift Route directly to the MLflow service endpoints.
  • Use the Route URL as the MLFLOW_TRACKING_URI when you authenticate.
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2026 Red Hat
返回顶部