Chapter 7. Known issues

This section describes known issues in Red Hat OpenShift AI 3.4 EA1, 3.4 EA2, and 3.4 GA, and any known methods of working around these issues.

7.1. Issues discovered at version 3.4 GA
Copy link

RHOAIENG-37916 - Deployed llm-d model shows failed in UI Models deployed using the "{llm-d}" will initially show a Status of Failed in the OpenShift AI Dashboard. To get more information on actual status, use the OpenShift Console to follow the status of pods in the project. When the model is fully ready, the OpenShift AI Dashboard will display a status of Started" Workaround:: Wait until the status changes, or consult the pod statuses.

RHOAIENG-60940 - BUG: NemoGuardrails RBAC ClusterRoles missing Kubernetes aggregation labels cause 403 for non-cluster-admin users

When testing NemoGuardrails with a non-cluster-admin user, a 403 Forbidden error is returned: bq. User "X" cannot get resource "nemoguardrails" in API group trustyai.opendatahub.io in namespace "Y".

The nemoguardrail-viewer-role and nemoguardrail-editor-role ClusterRoles on the operator side are missing the standard Kubernetes RBAC aggregation labels (aggregate-to-view, aggregate-to-edit, aggregate-to-admin). Without these labels, regular namespace users do not inherit the NemoGuardrails permissions through the default aggregated ClusterRoles, resulting in access denied errors.

Workaround

Cluster admins can create ClusterRoleBindings that apply the nemoguardrail-editor-role or nemoguardrail-viewer-role to users that require edit or view permissions to NeMo Guardrail resources:

$ oc create clusterrolebinding <binding-name> \ --clusterrole=nemoguardrail-editor-role \ --user=<username>

$ oc create clusterrolebinding <binding-name> \ --clusterrole=nemoguardrail-viewer-role \ --user=<username>

RHOAIENG-60855 - Upgrade error: Llama Stack Operator produces invalid Deployment when storage is configured

When upgrading OpenShift AI from 3.3 to 3.4, the Llama Stack Operator can fail to reconcile an existing LlamaStackDistribution custom resource that includes a storage specification, for example storage.size: 2Gi. Due to an upgrade-strategy change, the operator may generate an invalid Deployment that specifies both spec.strategy.type: Recreate and spec.strategy.rollingUpdate, which Kubernetes rejects with an error similar to: Deployment.apps "llama-stack-distribution-upgrade" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy 'type' is 'Recreate'

Workaround

Delete the affected Deployment so that the operator recreates it with a valid strategy:

oc delete deployment <cr-name> -n <namespace>

Replace <cr-name> with the name of the LlamaStackDistribution custom resource and <namespace> with its namespace. Llama stack operator will recreate deployment and new pod will work as expected.

RHOAIENG-60292 - MLflow does not automatically run database migrations after upgrade

There are database migration changes that are automatically applied after upgrade.

Workaround: It is recommended to bring down the MLflow replicas to 1 on the MLflow CR, the spec.replicas parameter, during upgrade.

RHOAIENG-58969 - precise-prefix-cache-scorer returns score of zero due to PodIdentifier key format mismatch

In OpenShift AI 3.4, the precise prefix cache scorer was updated to identify routing endpoints using a combination of the IP address and port. Previous versions relied solely on the IP address. If the vLLM configuration is not updated to provide the port, the scorer cannot match cache entries to the correct endpoints. As a result, traffic routing ignores prefix cache locality entirely and falls back to standard load balancing, which can reduce performance efficiency.

Workaround: Update the vLLM arguments in your configuration to include the port number in the KV events topic. Modify the topic format from the older format (kv@${POD_IP}@<model>) to include the port. For example, kv@${POD_IP}:8000@<model>. The precise scorer will correctly identify cache locations, successfully restoring prefix cache locality and routing requests to the pods with the most relevant cached data.

RHOAIENG-59950 - Search space preparation fails when too many models are provided

User starts AutoRAG experiment with more than 3 embedding models or more than 2 generation models. Consequence: search_space_preparation component runs models pre-selection and produces incorrect search_space_prep_report that cannot be properly parsed in the next component: rag_templates_optimization. User sees:

AutoRAG fails with: SearchSpaceValueError: Missing required parameters in the search space: {'embedding_model', 'foundation_model'}

Workaround: User may start the experiment with up to 2 foundation models and up to 3 embedding models.

7.2. Issues discovered at version 3.4 EA2
Copy link

RHOAIENG-60236 - Workbench images and external accelerator runtimes are not part of Designed for FIPS

Red Hat OpenShift AI workbench images and external accelerator runtimes are not part of Designed for FIPS. While the core platform components support FIPS-enabled clusters, workbench images and external accelerator runtimes have not been designed or tested for FIPS compliance.

Workaround: None.

RHOAIENG-58765 - Distributed Inference with llm-d prefill and decode disaggregation fails on FIPS-enabled clusters

Using Distributed Inference with llm-d prefill and decode disaggregation for LLM deployments on FIPS-enabled clusters causes the routing sidecar pod to enter a crash loop, preventing the LLM deployment from functioning. This issue is caused by a runtime image introduced in the 3.4 EA2 release that is not FIPS-compatible.

Workaround: Do not use prefill and decode disaggregation with Distributed Inference with llm-d in Red Hat OpenShift AI 3.4 EA2 on FIPS-enabled clusters. Other features continue to work correctly on FIPS-enabled clusters.

RHOAIENG-3816 - Encrypted PDF uploads to Llama Stack vector stores fail on FIPS-enabled clusters

On FIPS-enabled clusters, registering certain encrypted PDF files into Llama Stack vector stores fails due to a limitation in the underlying PDF parsing library. The library uses an MD5-based digest as part of its encryption handling, which is not allowed in FIPS mode and triggers an UnsupportedDigestmodError: [digital envelope routines] unsupported error during file processing. Due to this, on FIPS-enabled clusters, affected encrypted PDFs cannot be uploaded into Llama Stack vector stores. As a result, these documents are not indexed and are unavailable to retrieval-augmented generation (RAG) workflows, which may lead to incomplete or missing answers when those documents are expected to be part of the context.

Workaround: Use non‑encrypted PDF files when ingesting documents into Llama Stack vector stores on FIPS-enabled clusters, or re-export or convert the original document to a non‑encrypted PDF before uploading it to the vector store. Until the underlying PDF library is updated to use FIPS-compliant cryptographic primitives, encrypted PDFs that rely on disallowed digests in FIPS mode will continue to fail to upload to Llama Stack vector stores on FIPS-enabled clusters.

RHOAIENG-57224 - ROCm universal image training produces NaN on MI300X due to torch aotriton 0.11.1 regression

ROCm universal training image (th06) produces NaN values on MI300X due to aotriton 0.11.1 regression in AIPCC-built PyTorch wheel.

Workaround: Use th05 image or set attn_implementation="flash_attention_2".

RHOIAIENG-57427 - RAG in Gen AI Playground doesn’t work with default system prompt and model Qwen/Qwen3-14B-AWQ

In Gen AI Playground RAG, the default system prompt might not reliably trigger the knowledge search/tool-calling behavior for some models, so document retrieval is not performed. Due to this, questions about uploaded documents can return answers without using the vector store, resulting in incomplete/incorrect responses unless the prompt is adjusted.

Workaround: Manually edit the system prompt to explicitly instruct the model to use the knowledge search tool first for document-based/factual questions (as documented in the Gen AI Playground RAG documentation). As a result, after updating the system prompt, RAG retrieval works and the model can answer based on the uploaded document content.

RHOAIENG-54005 - Generate MaaS Token Endpoint Removed - breaks Gen AI Studio Playground

The /v1/token API was removed and this endpoint was merged in with the new post creation of /v1/api-keys. As a result, Gen AI Playground cannot generate a token on the fly for MaaS and cannot talk to MaaS Models in 3.4 EA2.

Workaround: There is no existing workaround for this known issue. As a result, there is no access to MaaS and Playground in 3.4 EA2.

RHOAIENG-48753 - Pipeline Name must be DNS-compliant to use “Store pipeline definitions in Kubernetes”

Elyra does not convert the pipeline name to a DNS-compliant name when using the default Kubernetes storage. As a consequence, if you don’t use a DNS-compliant name when you start an Elyra pipeline, it gives a cryptic error "[TIP: did you mean to set https://ds-pipeline-dspa-robert-tests.apps.test.rhoai.rh-aiservices-bu.com/pipeline as the endpoint, take care not to include s at end]".

Workaround: Use DNS-compliant naming when running Elyra pipelines.

7.3. Issues discovered at version 3.4 EA1
Copy link

RHOAIENG-54101 - Deployments not listed in Model Registry on IBM Z

When you deploy a model from the Model Registry on IBM Z, the deployment does not appear under the Deployments tab in the Model Registry.

Workaround: Access and manage the deployment from the global Deployments page in the OpenShift AI dashboard.

RHOAIENG-53206 - Spark driver pods fail to communicate due to RpcTimeoutException

After installing the Spark Operator, Spark executor pods cannot communicate with the driver pod because the redhat-ods-applications namespace defaults to a "deny-all" traffic rule. SparkApplication pods hang and fail with an RpcTimeoutException.

Workaround

Create a NetworkPolicy in the redhat-ods-applications namespace to allow communication between the pods created by the SparkApplication controller:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-operator-allow-internal
spec:
  podSelector:
    matchLabels:
      sparkoperator.k8s.io/launched-by-spark-operator: "true"
  policyTypes:
    - Ingress
  ingress:
    - ports:
        - port: 7078
          protocol: TCP
        - port: 7079
          protocol: TCP
        - port: 4040
          protocol: TCP
      from:
        - podSelector: {}
        - namespaceSelector:
            matchLabels:
              network.openshift.io/policy-group: ingress

RHOAIENG-52130 - Workbenches with Feast integration fail to start due to missing ConfigMap

Workbenches with Feast integration enabled fail to start in OpenShift AI 3.4 EA1. Pods remain stuck in ContainerCreating state with the following error:

[FailedMount] [Warning] MountVolume.SetUp failed for volume "odh-feast-config"
  configmap "jupyter-nb-kube-3aadmin-feast-config" not found

Workaround

Restart the Feast Operator after DSC deployment completes:

$ kubectl rollout restart deployment/feast-operator-controller-manager -n redhat-ods-applications

RHOAIENG-53239 - Custom ServingRuntime required for IBM Z (s390x) vLLM Spyre deployments

When deploying models using the vLLM Spyre runtime on IBM Z (s390x) systems, the default ServingRuntime cannot be used directly for KServe-based deployments. Model deployment fails if the runtime is used without modification.

Workaround

Create a custom ServingRuntime by duplicating the vllm-spyre-s390x-runtime ServingRuntime and removing the command section from the container specification. Keep all other configuration, including environment variables, ports, and volume mounts, unchanged.

The following example shows only the affected section. Your complete ServingRuntime must include all other fields from the original template:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-spyre-s390x-runtime-copy
spec:
  containers:
    - name: kserve-container
      image: <image>
      # Remove the 'command' section that appears here in the original
      args:
        - --model=/mnt/models
        - --port=8000
        - --served-model-name={{.Name}}
      # ... keep all env, ports, volumeMounts from original ...

RHOAIENG-50523 - Unable to upload RAG documents in Gen AI Playground on disconnected clusters

On disconnected clusters, uploading documents in the Gen AI Playground RAG section fails. The progress bar never exceeds 50% because Llama Stack attempts to download the ibm-granite/granite-embedding-125m-english embedding model from HuggingFace, even though the model is already included in the Llama Stack Distribution image in OpenShift AI 3.3.

Workaround

Modify the LlamaStackDistribution custom resource to include the following environment variables:

export MY_PROJECT=my-project

oc patch llamastackdistribution lsd-genai-playground \
  -n $MY_PROJECT \
  --type='json' \
  -p='[
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "SENTENCE_TRANSFORMERS_HOME",
        "value": "/opt/app-root/src/.cache/huggingface/hub"
      }
    },
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "HF_HUB_OFFLINE",
        "value": "1"
      }
    },
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "TRANSFORMERS_OFFLINE",
        "value": "1"
      }
    },
    {
      "op": "add",
      "path": "/spec/server/containerSpec/env/-",
      "value": {
        "name": "HF_DATASETS_OFFLINE",
        "value": "1"
      }
    }
  ]'

The Llama Stack pod restarts automatically after applying this configuration.

RHAIENG-2827 - Unsecured routes created by older CodeFlare SDK versions

Existing 2.x workbenches continue to use an older version of the CodeFlare SDK when used in OpenShift AI 3.x. The older version of the SDK creates unsecured OpenShift routes on behalf of the user.

Workaround: To resolve this issue, update your workbench to the latest image provided in OpenShift AI 3.x before using CodeFlare SDK.

RHOAIENG-48867 - TrainJob fails to resume after Red Hat OpenShift AI upgrade due to immutable JobSet spec

TrainJobs that are suspended (e.g., queued by Kueue) before a Red Hat OpenShift AI upgrade cannot resume after the upgrade completes. The Trainer controller fails to update the immutable JobSet spec.replicatedJobs field.

Workaround: To resolve this issue, delete and recreate the affected TrainJob after the upgrade.

RHOAIENG-45142 - Dashboard URLs return 404 errors after upgrading Red Hat OpenShift AI from 2.x to 3.x

The Red Hat OpenShift AI dashboard URL subdomain changed from rhods-dashboard-redhat-ods-applications.apps.<cluster>`to `data-science-gateway.apps.<cluster> due to the use of Gateways in OpenShift AI version 3.x. Existing bookmarks to the dashboard using the default rhods-dashboard-redhat-ods-applications.apps.<cluster> format will no longer function after you upgrade to OpenShift AI version 3.0 or later. It is recommended that you update your bookmarks and any internal documentation to use the new URL format: data-science-gateway.apps.<cluster>.

Workaround: To resolve this issue, deploy an nginx-based redirect solution that recreates the old route name and redirects traffic to the new gateway URL. For instructions, see Dashboard URLs return 404 errors after RHOAI upgrade from 2.x to 3.x

Note

Cluster administrators must provide the new dashboard URL to all Red Hat OpenShift AI administrators and users. In a future release, URL redirects may be supported.

RHOAIENG-43686 - Red Hat build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error

Installing Red Hat build of Kueue 1.2 or upgrading from Red Hat build of Kueue 1.1 to 1.2 fails if legacy Kueue CustomResourceDefinitions (CRDs) remain in the cluster from a previous Red Hat OpenShift AI 2.x installation. As a result, when the legacy v1alpha1 CRDs are present, the Kueue operator cannot reconcile successfully and the Data Science Cluster (DSC) remains in a Not Ready state.

Workaround: To resolve this issue, delete the legacy Kueue CRDs, cohorts.kueue.x-k8s.io/v1alpha1 or topologies.kueue.x-k8s.io/v1alpha1 from the cluster. For detailed instructions, see Red Hat Build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error.

RHOAIENG-49389 - Tier management unavailable after deleting all tiers

If you delete all service tiers from Settings > Tiers, the Create tier button is no longer displayed. You cannot create tiers through the dashboard until at least one tier exists. To avoid this issue, ensure at least one tier remains in the system at all times.

Workaround

Create a basic tier using the CLI, then configure its settings through the dashboard. You must have cluster administrator privileges for your OpenShift cluster to perform these steps:

Retrieve the tier-to-group-mapping ConfigMap:

$ oc get configmap tier-to-group-mapping redhat-ods-namespace -o yaml tier-config.yaml

Edit the ConfigMap to add a basic tier definition:

apiVersion: v1
  kind: ConfigMap
  metadata:
    name: tier-to-group-mapping
    namespace: redhat-ods-applications
  data:
    tiers.yaml: |
      - name: basic
        displayName: Basic Tier
        level: 0
        groups:
          - system:authenticated

Apply the updated ConfigMap:
```
$ oc apply -f tier-config.yaml
```
In the dashboard, navigate to Settings Tiers to configure rate limits for the newly created tier.

RHOAIENG-47589 - Missing Kueue validation for TrainJob

A TrainJob creation without a defined Kueue LocalQueue passes without validation check, even when Kueue managed namespace is enabled. As a result, it is possible to create TrainJob not managed by Kueue in Kueue managed namespace.

Workaround: None.

RHOAIENG-49017 - Upgrade RAGAS provider to Llama Stack 0.4.z / 0.5.z

In order to use the Ragas provider in OpenShift AI 3.3, you must update your Llama Stack distribution to use llama-stack-provider-ragas==0.5.4, which works with Llama Stack >=0.4.2,<0.5.0. This version of the provider is a workaround release that is using the deprecated register endpoints as a workaround. See the full compatibility matrix for more information.

Workaround: None.

RHOAIENG-44516 - MLflow tracking server does not accept Kubernetes service account tokens

Red Hat OpenShift AI does not accept Kubernetes service accounts when you authenticate through the dashboard MLflow URL.

Workaround

To authenticate with a service account token, complete the following steps:

Create an OpenShift Route directly to the MLflow service endpoints.
Use the Route URL as the MLFLOW_TRACKING_URI when you authenticate.

Chapter 7. Known issues

7.1. Issues discovered at version 3.4 GA
Copy link

7.2. Issues discovered at version 3.4 EA2
Copy link

7.3. Issues discovered at version 3.4 EA1
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 7. Known issues

7.1. Issues discovered at version 3.4 GACopy linkLink copied to clipboard!

7.2. Issues discovered at version 3.4 EA2Copy linkLink copied to clipboard!

7.3. Issues discovered at version 3.4 EA1Copy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. Issues discovered at version 3.4 GA
Copy link

7.2. Issues discovered at version 3.4 EA2
Copy link

7.3. Issues discovered at version 3.4 EA1
Copy link