Chapter 7. Known issues


This section describes known issues in Red Hat OpenShift AI 3.5 EA1, 3.4 EA1, 3.4 EA2, and 3.4 GA, and any known methods of working around these issues.

7.1. Issues discovered at version 3.5 EA1

RHOAIENG-64768 - AutoML and AutoRAG pipeline runs fail with image pull errors

The default pipeline definitions shipped with OpenShift AI reference container image digests that are not available in the production registry. As a consequence, AutoML and AutoRAG experiment runs remain in progress indefinitely, and pipeline task pods log ImagePullBackOff or ErrImagePull errors with messages such as manifest unknown.

Workaround

Download the updated pipeline definition for your experiment type from the rhoai-3.5-ea.1-fixed branch of the red-hat-data-services/pipelines-components repository on GitHub:

  • AutoML Tabular
  • AutoML Time Series
  • AutoRAG

    If you have already imported the pipeline, upload the updated file as a new version and re-run the experiment as a new run. For more information, see Uploading a pipeline version.

    If you have not yet imported the pipeline, import the updated file. For more information, see Importing a pipeline.

    After you upload the updated pipeline definition, experiment runs pull the correct images and complete successfully.

RHOAIENG-66859 - Evaluation jobs fail to complete with an MLflow experiment configured

When you submit an evaluation job with a configured MLflow experiment, the evaluation adapter fails after the evaluation completes successfully while trying to save results to MLflow. As a consequence, the evaluation remains indefinitely in a running state and never reports completion.

Workaround
Do not specify an MLflow experiment.

RHOAIENG-66068 - The OpenShift AI dashboard only supports an EvalHub instance in the redhat-ods-applications namespace

The Backend-for-Frontend (BFF) service always looks for the MLflow multi-tenant instance in its own redhat-ods-applications, regardless of where the cluster administrator has deployed it. As a consequence, the OpenShift AI dashboard reports that evaluations are not enabled when the multi-tenant instance is hosted in a different namespace.

Workaround
Deploy the MLflow multi-tenant instance in the redhat-ods-applications. As a result, the BFF service correctly detects the instance and the evaluations feature is available in the dashboard. Note that others instances will work, but will not be discoverable from the OpenShift AI dashboard.

RHOAIENG-67534 - A new evaluation run fails in the OpenShift AI dashboard

If the MLflow custom resource (CR) is created after the Evaluations CR, the workspaces_enabled setting is set to false. As a result, creating a new evaluation run in the OpenShift AI dashboard fails with an INVALID_PARAMETER_VALUE error: "Workspace context is required for this request."

Workaround
Create the MLflow CR before the Evaluations CR is created. This ensures the workspaces_enabled setting is correctly set to true, and evaluation runs can be created successfully.

RHOAIENG-65203 - Model Car (OCI) deployment fails for ONNX models with external data

When you use the Model Car (OCI image) method to deploy an ONNX model split into model.onnx and model.onnx.data files, the MLServer runtime container cannot access the external data file. The Model Car sidecar container exposes files by using cross-container symlinks instead of a shared volume, so the system cannot load the model. The pod changes to a CrashLoopBackOff state with the following error:

Data of TensorProto references external data at /mnt/models/model.onnx.data, but the model directory path could not be resolved.

Single-file model formats, such as SKLEARN, XGBoost, and LightGBM, are not affected.

Workaround
To deploy ONNX models with external data files, use an S3-compatible object storage backend instead of OCI image storage.

AIPCC-18235 - Structured output (JSON Schema) generation fails on IBM Z (s390x) with llguidance backend

When you use the llguidance structured decoding backend on IBM Z (s390x), JSON schema-constrained generation may produce invalid output or become stuck generating whitespace indefinitely.

Workaround
A fix is available in llguidance version 1.7.0 and later. Update your wheel from version 1.3.0 to at least version 1.7.0 for this fix.

AIPCC-17927 - vLLM crashes when multiple requests are inflight with structured outputs

When you send multiple inference requests in parallel to a vLLM-based inference server and at least one request includes structured output, the service stops responding, causing the pod to fail. As a result, concurrent workloads that use structured outputs do not function as expected.

Workaround

To prevent the service from failing, apply one of the following workarounds:

  • Process requests sequentially instead of sending multiple parallel requests that include structured output in the same batch.
  • Exclude structured output requests when you run concurrent workloads.

7.2. Issues discovered at version 3.4 GA

RHOAIENG-37916 - Deployed llm-d model shows failed in UI

Models deployed using the "{llm-d}" will initially show a Status of Failed in the OpenShift AI Dashboard. To get more information on actual status, use the OpenShift Console to follow the status of pods in the project. When the model is fully ready, the OpenShift AI Dashboard will display a status of Started. Workaround:: Wait until the status changes, or consult the pod statuses.

RHOAIENG-60940 - BUG: NemoGuardrails RBAC ClusterRoles missing Kubernetes aggregation labels cause 403 for non-cluster-admin users

When testing NemoGuardrails with a non-cluster-admin user, a 403 Forbidden error is returned: bq. User "X" cannot get resource "nemoguardrails" in API group trustyai.opendatahub.io in namespace "Y".

The nemoguardrail-viewer-role and nemoguardrail-editor-role ClusterRoles on the operator side are missing the standard Kubernetes RBAC aggregation labels (aggregate-to-view, aggregate-to-edit, aggregate-to-admin). Without these labels, regular namespace users do not inherit the NemoGuardrails permissions through the default aggregated ClusterRoles, resulting in access denied errors.

Workaround

Cluster admins can create ClusterRoleBindings that apply the nemoguardrail-editor-role or nemoguardrail-viewer-role to users that require edit or view permissions to NeMo Guardrail resources:

$ oc create clusterrolebinding <binding-name> \
  --clusterrole=nemoguardrail-editor-role \
  --user=<username>
$ oc create clusterrolebinding <binding-name> \
  --clusterrole=nemoguardrail-viewer-role \
  --user=<username>

RHOAIENG-60292 - MLflow does not automatically run database migrations after upgrade

There are no database migration changes that are automatically applied after upgrade.

Workaround
It is recommended to bring down the MLflow replicas to 1 on the MLflow CR, the spec.replicas parameter, during upgrade.

RHOAIENG-58969 - precise-prefix-cache-scorer returns score of zero due to PodIdentifier key format mismatch

In OpenShift AI 3.4, the precise prefix cache scorer was updated to identify routing endpoints using a combination of the IP address and port. Previous versions relied solely on the IP address. If the vLLM configuration is not updated to provide the port, the scorer cannot match cache entries to the correct endpoints. As a result, traffic routing ignores prefix cache locality entirely and falls back to standard load balancing, which can reduce performance efficiency.

Workaround
Update the vLLM arguments in your configuration to include the port number in the KV events topic. Modify the topic format from the older format (kv@${POD_IP}@<model>) to include the port. For example, kv@${POD_IP}:8000@<model>. The precise scorer will correctly identify cache locations, successfully restoring prefix cache locality and routing requests to the pods with the most relevant cached data.

RHOAIENG-59950 - Search space preparation fails when too many models are provided

User starts AutoRAG experiment with more than 3 embedding models or more than 2 generation models. Consequence: search_space_preparation component runs models pre-selection and produces incorrect search_space_prep_report that cannot be properly parsed in the next component: rag_templates_optimization. User sees:

AutoRAG fails with: SearchSpaceValueError: Missing required parameters in the search space: {'embedding_model', 'foundation_model'}
Workaround
User may start the experiment with up to 2 foundation models and up to 3 embedding models.

RHOAIENG-60855 - Upgrade error: OGX Operator produces invalid Deployment when storage is configured

When upgrading OpenShift AI from 3.3 to 3.4, the OGX Operator can fail to reconcile an existing OGXServer custom resource that includes a storage specification, for example storage.size: 2Gi. Due to an upgrade-strategy change, the operator may generate an invalid Deployment that specifies both spec.strategy.type: Recreate and spec.strategy.rollingUpdate, which Kubernetes rejects with an error similar to: Deployment.apps "ogx-distribution-upgrade" is invalid:spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy 'type' is 'Recreate'

Workaround

Delete the affected Deployment so that the operator recreates it with a valid strategy:

oc delete deployment <cr-name> -n <namespace>

Replace <cr-name> with the name of the OGXServer custom resource and <namespace> with its namespace. OGX operator will recreate deployment and new pod will work as expected.

INFERENG-6962 - Distributed Inference with llm-d EndpointPicker is bypassed when multiple HTTPRoutes share the same gateway listener

When multiple HTTPRoutes are attached to the same wildcard Gateway listener, Istio aggregates them into a single autogenerated Gateway VirtualService and does not create the per-route ExtProcPerRoute override for the LLMInferenceService. This causes the EndpointPicker to be bypassed entirely. Requests fall back to round-robin routing; prefix cache scoring, load-aware scoring, and all intelligent scheduling are silently disabled.

This behavior is not specific to multiple LLMInferenceServices and is triggered by any HTTPRoute on the same wildcard Gateway listener, such as a token endpoint, echo service, or test route.

You can identify this issue by checking the EndpointPicker logs, which might show no per-request activity, even at verbosity level 6 or 7. Additionally, the gateway ext_proc filter shows cluster_name: "dummy" and request_header_mode: SKIP with no per-route override applied.

This affects Istio 1.26, deployed by openshift-ingress in OSSM 3.3.x and 3.4. The upstream fix is in Istio 1.29. The following issue is related: OSSM-12585.

Workaround
Remove or reassign any non-LLMInferenceService HTTPRoutes from the inference Gateway. Move them to a separate Gateway so the LLMInferenceService HTTPRoute is the only consumer of the wildcard listener.

7.3. Issues discovered at version 3.4 EA2

RHOAIENG-58765 - Distributed Inference with llm-d prefill and decode disaggregation fails on FIPS-enabled clusters

Using Distributed Inference with llm-d prefill and decode disaggregation for LLM deployments on FIPS-enabled clusters causes the routing sidecar pod to enter a crash loop, preventing the LLM deployment from functioning. This issue is caused by a runtime image introduced in the 3.4 EA2 release that is not FIPS-compatible.

Workaround
Do not use prefill and decode disaggregation with Distributed Inference with llm-d in Red Hat OpenShift AI 3.4 EA2 on FIPS-enabled clusters. Other features continue to work correctly on FIPS-enabled clusters.

RHOAIENG-57224 - ROCm universal image training produces NaN on MI300X due to torch aotriton 0.11.1 regression

ROCm universal training image (th06) produces NaN values on MI300X due to aotriton 0.11.1 regression in AIPCC-built PyTorch wheel.

Workaround
Use th05 image or set attn_implementation="flash_attention_2".

RHOIAIENG-57427 - RAG in Gen AI Playground doesn’t work with default system prompt and model Qwen/Qwen3-14B-AWQ

In Gen AI Playground RAG, the default system prompt might not reliably trigger the knowledge search/tool-calling behavior for some models, so document retrieval is not performed. Due to this, questions about uploaded documents can return answers without using the vector store, resulting in incomplete/incorrect responses unless the prompt is adjusted.

Workaround
Manually edit the system prompt to explicitly instruct the model to use the knowledge search tool first for document-based/factual questions (as documented in the Gen AI Playground RAG documentation). As a result, after updating the system prompt, RAG retrieval works and the model can answer based on the uploaded document content.

RHOAIENG-54005 - Generate MaaS Token Endpoint Removed - breaks Gen AI Studio Playground

The /v1/token API was removed and this endpoint was merged in with the new post creation of /v1/api-keys. As a result, Gen AI Playground cannot generate a token on the fly for MaaS and cannot talk to MaaS Models in 3.4 EA2.

Workaround
There is no existing workaround for this known issue. As a result, there is no access to MaaS and Playground in 3.4 EA2.

RHOAIENG-48753 - Pipeline Name must be DNS-compliant to use “Store pipeline definitions in Kubernetes”

Elyra does not convert the pipeline name to a DNS-compliant name when using the default Kubernetes storage. As a consequence, if you don’t use a DNS-compliant name when you start an Elyra pipeline, it gives a cryptic error "[TIP: did you mean to set https://ds-pipeline-dspa-robert-tests.apps.test.rhoai.rh-aiservices-bu.com/pipeline as the endpoint, take care not to include s at end]".

Workaround
Use DNS-compliant naming when running Elyra pipelines.

7.4. Issues discovered at version 3.4 EA1

RHOAIENG-54101 - Deployments not listed in Model Registry on IBM Z

When you deploy a model from the Model Registry on IBM Z, the deployment does not appear under the Deployments tab in the Model Registry.

Workaround
Access and manage the deployment from the global Deployments page in the OpenShift AI dashboard.

RHOAIENG-53206 - Spark driver pods fail to communicate due to RpcTimeoutException

After installing the Spark Operator, Spark executor pods cannot communicate with the driver pod because the redhat-ods-applications namespace defaults to a "deny-all" traffic rule. SparkApplication pods hang and fail with an RpcTimeoutException.

Workaround

Create a NetworkPolicy in the redhat-ods-applications namespace to allow communication between the pods created by the SparkApplication controller:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-operator-allow-internal
spec:
  podSelector:
    matchLabels:
      sparkoperator.k8s.io/launched-by-spark-operator: "true"
  policyTypes:
    - Ingress
  ingress:
    - ports:
        - port: 7078
          protocol: TCP
        - port: 7079
          protocol: TCP
        - port: 4040
          protocol: TCP
      from:
        - podSelector: {}
        - namespaceSelector:
            matchLabels:
              network.openshift.io/policy-group: ingress

RHOAIENG-52130 - Workbenches with Feast integration fail to start due to missing ConfigMap

Workbenches with Feast integration enabled fail to start in OpenShift AI 3.4 EA1. Pods remain stuck in ContainerCreating state with the following error:

+

[FailedMount] [Warning] MountVolume.SetUp failed for volume "odh-feast-config"
  configmap "jupyter-nb-kube-3aadmin-feast-config" not found
Workaround

Restart the Feast Operator after DSC deployment completes:

$ kubectl rollout restart deployment/feast-operator-controller-manager -n redhat-ods-applications

RHOAIENG-53239 - Custom ServingRuntime required for IBM Z (s390x) vLLM Spyre deployments

When deploying models using the vLLM Spyre runtime on IBM Z (s390x) systems, the default ServingRuntime cannot be used directly for KServe-based deployments. Model deployment fails if the runtime is used without modification.

Workaround

Create a custom ServingRuntime by duplicating the vllm-spyre-s390x-runtime ServingRuntime and removing the command section from the container specification. Keep all other configuration, including environment variables, ports, and volume mounts, unchanged.

The following example shows only the affected section. Your complete ServingRuntime must include all other fields from the original template:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-spyre-s390x-runtime-copy
spec:
  containers:
    - name: kserve-container
      image: <image>
      # Remove the 'command' section that appears here in the original
      args:
        - --model=/mnt/models
        - --port=8000
        - --served-model-name={{.Name}}
      # ... keep all env, ports, volumeMounts from original ...

RHOAIENG-53865 - MaaS tier-based rate limiting fails when configured through the dashboard UI

When you configure Models-as-a-Service (MaaS) tier-based rate limiting through the Red Hat OpenShift AI dashboard UI, the following issues occur:

  • The system creates separate TokenRateLimitPolicy resources for each tier instead of a single, combined policy. This default configuration causes rate limiting to silently fail for most tiers, allowing users in unprotected tiers to exceed intended limits.
  • The dashboard UI does not read or display rate limits configured through the CLI.
  • When you edit tier settings through the dashboard UI, the UI-configured settings overwrite the CLI configuration.

    Workaround
    There are two possible workarounds to ensure that rate limiting is enforced for all tiers:
  • Manually update the TokenRateLimitPolicy resources with merged limits for each tier.
  • Create a single, combined TokenRateLimitPolicy resource with limits for all tiers.

To manually update the TokenRateLimitPolicy resources to use a merge strategy:

  1. List the existing TokenRateLimitPolicy resources:

    $ oc get tokenratelimitpolicy -n openshift-ingress
  2. Edit each tier policy to use defaults.strategy: merge instead of atomic. For example, edit the Free tier policy, tier-free-token-rate-limits:

    $ oc edit tokenratelimitpolicy tier-free-token-rate-limits -n openshift-ingress
  3. In the editor, locate the spec.defaults section and change the strategy from atomic to merge:

    spec:
      defaults:
        strategy: merge  # Change from atomic to merge
        limits:
          free-tokens:   # Ensure this has a distinct name
            when:
              predicate: auth.identity.tier == "free"
            rates:
              - limit: 1000
                window: 1m0s
  4. Save and exit the editor.
  5. Repeat steps 2-4 for the Premium and Enterprise tier policies (tier-premium-token-rate-limits and tier-enterprise-token-rate-limits), ensuring that each limit has a distinct name, such as premium-tokens and enterprise-tokens.

To create a single, combined TokenRateLimitPolicy resource with limits for all tiers:

  • Configure rate limits using the CLI by creating a single, combined TokenRateLimitPolicy resource with limits for all tiers such as in the following example:

    Complete example configuration for Free tier

    apiVersion: kuadrant.io/v1alpha1
      kind: TokenRateLimitPolicy
      metadata:
        name: tier-free-token-rate-limits
      spec:
        targetRef:
          kind: Gateway
          name: maas-default-gateway
        defaults:
          strategy: merge
          limits:
            free-tokens:
              when:
              - predicate: auth.identity.tier == "free"
              rates:
              - limit: 1000
                window: 1m0s

Warning

Do not edit tier settings through the dashboard UI after applying CLI-configured policies, because changes in the UI overwrite any CLI configurations.

Verification

Verify that the policy is enforced:

  1. In the side navigation menu of the OpenShift AI dashboard, click Administration > CustomResourceDefinitions.
  2. In the CustomResourceDefinitions list, search for and click TokenrateLimitPolicy .
  3. Click the Instances tab to view the list of policies.
  4. In the Name column, click the name of the specific policy you want to verify. For example, gateway-default-deny.
  5. In the TokenRateLimitPolicy details page, locate the Enforced status field:

    1. True: The policy is being picked up by the controller.
    2. False or -: The policy is not being used.
  6. Diagnose why a policy is not being used by scrolling to the Conditions section to view the status details.
  7. Review the Reason column for error codes, such as TargetNotFound.
  8. Review the Message column for a detailed explanation of the issue, such as a missing target gateway.

RHOAIENG-52057 - LLMInferenceService deployment fails without Leader WorkerSet operator

When deploying an LLMInferenceService object for Distributed Inference Server, the deployment fails with the following error:

failed to reconcile multi-node main workload: failed to build the expected main LWS: failed to get expected leader worker set demo-llm/qwen-kserve-mn: no matches for kind "LeaderWorkerSet" in version "leaderworkerset.x-k8s.io/v1
Workaround
Install the LeaderWorkerSet Operator.

RHAIENG-2827 - Unsecured routes created by older CodeFlare SDK versions

Existing 2.x workbenches continue to use an older version of the CodeFlare SDK when used in OpenShift AI 3.x. The older version of the SDK creates unsecured OpenShift routes on behalf of the user.

Workaround
To resolve this issue, update your workbench to the latest image provided in OpenShift AI 3.x before using CodeFlare SDK.

RHOAIENG-48867 - TrainJob fails to resume after Red Hat OpenShift AI upgrade due to immutable JobSet spec

TrainJobs that are suspended (e.g., queued by Kueue) before a Red Hat OpenShift AI upgrade cannot resume after the upgrade completes. The Trainer controller fails to update the immutable JobSet spec.replicatedJobs field.

Workaround
To resolve this issue, delete and recreate the affected TrainJob after the upgrade.

RHOAIENG-45142 - Dashboard URLs return 404 errors after upgrading Red Hat OpenShift AI from 2.x to 3.x

The Red Hat OpenShift AI dashboard URL subdomain changed from rhods-dashboard-redhat-ods-applications.apps.<cluster>`to `data-science-gateway.apps.<cluster> due to the use of Gateways in OpenShift AI version 3.x. Existing bookmarks to the dashboard using the default rhods-dashboard-redhat-ods-applications.apps.<cluster> format will no longer function after you upgrade to OpenShift AI version 3.0 or later. It is recommended that you update your bookmarks and any internal documentation to use the new URL format: data-science-gateway.apps.<cluster>.

Workaround
To resolve this issue, deploy an nginx-based redirect solution that recreates the old route name and redirects traffic to the new gateway URL. For instructions, see Dashboard URLs return 404 errors after RHOAI upgrade from 2.x to 3.x
Note

Cluster administrators must provide the new dashboard URL to all Red Hat OpenShift AI administrators and users. In a future release, URL redirects may be supported.

RHOAIENG-43686 - Red Hat build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error

Installing Red Hat build of Kueue 1.2 or upgrading from Red Hat build of Kueue 1.1 to 1.2 fails if legacy Kueue CustomResourceDefinitions (CRDs) remain in the cluster from a previous Red Hat OpenShift AI 2.x installation. As a result, when the legacy v1alpha1 CRDs are present, the Kueue operator cannot reconcile successfully and the Data Science Cluster (DSC) remains in a Not Ready state.

Workaround
To resolve this issue, delete the legacy Kueue CRDs, cohorts.kueue.x-k8s.io/v1alpha1 or topologies.kueue.x-k8s.io/v1alpha1 from the cluster. For detailed instructions, see Red Hat Build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error.

RHOAIENG-49389 - Tier management unavailable after deleting all tiers

If you delete all service tiers from Settings > Tiers, the Create tier button is no longer displayed. You cannot create tiers through the dashboard until at least one tier exists. To avoid this issue, ensure at least one tier remains in the system at all times.

Workaround

Create a basic tier using the CLI, then configure its settings through the dashboard. You must have cluster administrator privileges for your OpenShift cluster to perform these steps:

  1. Retrieve the tier-to-group-mapping ConfigMap:

    $ oc get configmap tier-to-group-mapping redhat-ods-namespace -o yaml tier-config.yaml
  2. Edit the ConfigMap to add a basic tier definition:

    apiVersion: v1
      kind: ConfigMap
      metadata:
        name: tier-to-group-mapping
        namespace: redhat-ods-applications
      data:
        tiers.yaml: |
          - name: basic
            displayName: Basic Tier
            level: 0
            groups:
              - system:authenticated
  3. Apply the updated ConfigMap:

    $ oc apply -f tier-config.yaml
  4. In the dashboard, navigate to Settings Tiers to configure rate limits for the newly created tier.

RHOAIENG-47589 - Missing Kueue validation for TrainJob

A TrainJob creation without a defined Kueue LocalQueue passes without validation check, even when Kueue managed namespace is enabled. As a result, it is possible to create TrainJob not managed by Kueue in Kueue managed namespace.

Workaround
None.

RHOAIENG-44516 - MLflow tracking server does not accept Kubernetes service account tokens

When you authenticate through the dashboard MLflow URL without using the workbench MLflow integration, Red Hat OpenShift AI does not accept Kubernetes service account tokens.

Workaround

Use the automatic MLflow workbench integration, which configures service account token authentication through the MLFLOW_TRACKING_AUTH environment variable. Annotate your workbench notebook with opendatahub.io/mlflow-instance and restart the workbench. For more information, see Enable MLflow integration for a workbench.

If you are not using a workbench, create an OpenShift Route directly to the MLflow service endpoints and use the Route URL as the MLFLOW_TRACKING_URI when you authenticate.

RHOAIENG-45969 - MLflow artifact-serving configuration with S3 is not supported

The automatic MLflow workbench integration does not configure S3-backed artifact storage. You can log parameters, metrics, and tags, but mlflow.log_artifact() functionality that relies on an S3-backed artifact store requires additional manual configuration.

Workaround
None. Configure artifact storage manually if needed.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat Documentation

Legal Notice

Theme

© 2026 Red Hat
Back to top