Chapter 7. Known issues
This section describes known issues in Red Hat OpenShift AI 3.4 EA1 and any known methods of working around these issues.
RHOAIENG-54101 - Deployments not listed in Model Registry on IBM Z
When you deploy a model from the Model Registry on IBM Z, the deployment does not appear under the Deployments tab in the Model Registry.
- Workaround
- Access and manage the deployment from the global Deployments page in the OpenShift AI dashboard.
RHOAIENG-53206 - Spark driver pods fail to communicate due to RpcTimeoutException
After installing the Spark Operator, Spark executor pods cannot communicate with the driver pod because the redhat-ods-applications namespace defaults to a "deny-all" traffic rule. SparkApplication pods hang and fail with an RpcTimeoutException.
- Workaround
Create a NetworkPolicy in the
redhat-ods-applicationsnamespace to allow communication between the pods created by the SparkApplication controller:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
RHOAIENG-52130 - Workbenches with Feast integration fail to start due to missing ConfigMap
Workbenches with Feast integration enabled fail to start in OpenShift AI 3.4 EA1. Pods remain stuck in ContainerCreating state with the following error:
+
[FailedMount] [Warning] MountVolume.SetUp failed for volume "odh-feast-config" configmap "jupyter-nb-kube-3aadmin-feast-config" not found
[FailedMount] [Warning] MountVolume.SetUp failed for volume "odh-feast-config"
configmap "jupyter-nb-kube-3aadmin-feast-config" not found
- Workaround
Restart the Feast Operator after DSC deployment completes:
kubectl rollout restart deployment/feast-operator-controller-manager -n redhat-ods-applications
$ kubectl rollout restart deployment/feast-operator-controller-manager -n redhat-ods-applicationsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
RHOAIENG-53239 - Custom ServingRuntime required for IBM Z (s390x) vLLM Spyre deployments
When deploying models using the vLLM Spyre runtime on IBM Z (s390x) systems, the default ServingRuntime cannot be used directly for KServe-based deployments. Model deployment fails if the runtime is used without modification.
- Workaround
Create a custom ServingRuntime by duplicating the
vllm-spyre-s390x-runtimeServingRuntime and removing thecommandsection from the container specification. Keep all other configuration, including environment variables, ports, and volume mounts, unchanged.The following example shows only the affected section. Your complete ServingRuntime must include all other fields from the original template:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
RHOAIENG-50523 - Unable to upload RAG documents in Gen AI Playground on disconnected clusters
On disconnected clusters, uploading documents in the Gen AI Playground RAG section fails. The progress bar never exceeds 50% because Llama Stack attempts to download the ibm-granite/granite-embedding-125m-english embedding model from HuggingFace, even though the model is already included in the Llama Stack Distribution image in OpenShift AI 3.3.
- Workaround
Modify the LlamaStackDistribution custom resource to include the following environment variables:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The Llama Stack pod restarts automatically after applying this configuration.
RHAIENG-2827 - Unsecured routes created by older CodeFlare SDK versions
Existing 2.x workbenches continue to use an older version of the CodeFlare SDK when used in OpenShift AI 3.x. The older version of the SDK creates unsecured OpenShift routes on behalf of the user.
- Workaround
- To resolve this issue, update your workbench to the latest image provided in OpenShift AI 3.x before using CodeFlare SDK.
RHOAIENG-48867 - TrainJob fails to resume after Red Hat OpenShift AI upgrade due to immutable JobSet spec
TrainJobs that are suspended (e.g., queued by Kueue) before a Red Hat OpenShift AI upgrade cannot resume after the upgrade completes. The Trainer controller fails to update the immutable JobSet spec.replicatedJobs field.
- Workaround
- To resolve this issue, delete and recreate the affected TrainJob after the upgrade.
RHOAIENG-45142 - Dashboard URLs return 404 errors after upgrading Red Hat OpenShift AI from 2.x to 3.x
The Red Hat OpenShift AI dashboard URL subdomain changed from rhods-dashboard-redhat-ods-applications.apps.<cluster>`to `data-science-gateway.apps.<cluster> due to the use of Gateways in OpenShift AI version 3.x. Existing bookmarks to the dashboard using the default rhods-dashboard-redhat-ods-applications.apps.<cluster> format will no longer function after you upgrade to OpenShift AI version 3.0 or later. It is recommended that you update your bookmarks and any internal documentation to use the new URL format: data-science-gateway.apps.<cluster>.
- Workaround
- To resolve this issue, deploy an nginx-based redirect solution that recreates the old route name and redirects traffic to the new gateway URL. For instructions, see Dashboard URLs return 404 errors after RHOAI upgrade from 2.x to 3.x
Cluster administrators must provide the new dashboard URL to all Red Hat OpenShift AI administrators and users. In a future release, URL redirects may be supported.
RHOAIENG-43686 - Red Hat build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error
Installing Red Hat build of Kueue 1.2 or upgrading from Red Hat build of Kueue 1.1 to 1.2 fails if legacy Kueue CustomResourceDefinitions (CRDs) remain in the cluster from a previous Red Hat OpenShift AI 2.x installation. As a result, when the legacy v1alpha1 CRDs are present, the Kueue operator cannot reconcile successfully and the Data Science Cluster (DSC) remains in a Not Ready state.
- Workaround
-
To resolve this issue, delete the legacy Kueue CRDs,
cohorts.kueue.x-k8s.io/v1alpha1ortopologies.kueue.x-k8s.io/v1alpha1from the cluster. For detailed instructions, see Red Hat Build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error.
RHOAIENG-49389 - Tier management unavailable after deleting all tiers
If you delete all service tiers from Settings > Tiers, the Create tier button is no longer displayed. You cannot create tiers through the dashboard until at least one tier exists. To avoid this issue, ensure at least one tier remains in the system at all times.
- Workaround
Create a basic tier using the CLI, then configure its settings through the dashboard. You must have cluster administrator privileges for your OpenShift cluster to perform these steps:
Retrieve the
tier-to-group-mappingConfigMap:oc get configmap tier-to-group-mapping redhat-ods-namespace -o yaml tier-config.yaml
$ oc get configmap tier-to-group-mapping redhat-ods-namespace -o yaml tier-config.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the ConfigMap to add a basic tier definition:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the updated ConfigMap:
oc apply -f tier-config.yaml
$ oc apply -f tier-config.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
In the dashboard, navigate to Settings
Tiers to configure rate limits for the newly created tier.
RHOAIENG-47589 - Missing Kueue validation for TrainJob
A TrainJob creation without a defined Kueue LocalQueue passes without validation check, even when Kueue managed namespace is enabled. As a result, it is possible to create TrainJob not managed by Kueue in Kueue managed namespace.
- Workaround
- None.
RHOAIENG-49017 - Upgrade RAGAS provider to Llama Stack 0.4.z / 0.5.z
In order to use the Ragas provider in OpenShift AI 3.3, you must update your Llama Stack distribution to use llama-stack-provider-ragas==0.5.4, which works with Llama Stack >=0.4.2,<0.5.0. This version of the provider is a workaround release that is using the deprecated register endpoints as a workaround. See the full compatibility matrix for more information.
- Workaround
- None.
RHOAIENG-44516 - MLflow tracking server does not accept Kubernetes service account tokens
Red Hat OpenShift AI does not accept Kubernetes service accounts when you authenticate through the dashboard MLflow URL.
- Workaround
To authenticate with a service account token, complete the following steps:
- Create an OpenShift Route directly to the MLflow service endpoints.
-
Use the Route URL as the
MLFLOW_TRACKING_URIwhen you authenticate.