Chapter 1. Managing data science pipelines

1.1. Configuring a pipeline server
Link kopieren

Before you can successfully create a pipeline in OpenShift AI, you must configure a pipeline server. This task includes configuring where your pipeline artifacts and data are stored.

Note

You are not required to specify any storage directories when configuring a connection for your pipeline server. When you import a pipeline, the /pipelines folder is created in the root folder of the bucket, containing a YAML file for the pipeline. If you upload a new version of the same pipeline, a new YAML file with a different ID is added to the /pipelines folder.

When you run a pipeline, the artifacts are stored in the /pipeline-name folder in the root folder of the bucket.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have created a data science project that you can add a pipeline server to.
You have an existing S3-compatible object storage bucket and you have configured write access to your S3 bucket on your storage account.
If you are configuring a pipeline server for production pipeline workloads, you have an existing external MySQL or MariaDB database.
If you are configuring a pipeline server with an external MySQL database, your database must use at least MySQL version 5.x. However, Red Hat recommends that you use MySQL version 8.x.
Note
The mysql_native_password authentication plugin is required for the ML Metadata component to successfully connect to your database. mysql_native_password is disabled by default in MySQL 8.4 and later. If your database uses MySQL 8.4 or later, you must update your MySQL deployment to enable the mysql_native_password plugin.
For more information about enabling the mysql_native_password plugin, see Native Pluggable Authentication in the MySQL documentation.
If you are configuring a pipeline server with a MariaDB database, your database must use MariaDB version 10.3 or later. However, Red Hat recommends that you use at least MariaDB version 10.5.

Procedure

From the OpenShift AI dashboard, click Data science projects.
The Data science projects page opens.
Click the name of the project that you want to configure a pipeline server for.
A project details page opens.
Click the Pipelines tab.
Click Configure pipeline server.
The Configure pipeline server dialog opens.
In the Object storage connection section, provide values for the mandatory fields:
1. In the Access key field, enter the access key ID for the S3-compatible object storage provider.
2. In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.
3. In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.
4. In the Region field, enter the default region of your S3-compatible object storage account.
5. In the Bucket field, enter the name of your S3-compatible object storage bucket.
  Important
  If you specify incorrect connection settings, you cannot update these settings on the same pipeline server. Therefore, you must delete the pipeline server and configure another one.
  If you want to use an existing artifact that was not generated by a task in a pipeline, you can use the kfp.dsl.importer component to import the artifact from its URI. You can only import these artifacts to the S3-compatible object storage bucket that you define in the Bucket field in your pipeline server configuration. For more information about the kfp.dsl.importer component, see Special Case: Importer Components.
Click Advanced settings to display the Database, Pipeline definition storage, and Pipeline caching sections.
In the Database section, choose one of the following options to specify where to store your pipeline metadata and run information:
- Select Default database on the cluster to deploy a MariaDB database in your project.
  Important
  The Default database on the cluster option is intended for development and testing purposes only. For production pipeline workloads, select the External MySQL database option to use an external MySQL or MariaDB database.
- Select External MySQL database to add a new connection to an external MySQL or MariaDB database that your pipeline server can access.
  1. In the Host field, enter the database hostname.
  2. In the Port field, enter the database port.
  3. In the Username field, enter the default user name that is connected to the database.
  4. In the Password field, enter the password for the default user account.
  5. In the Database field, enter the database name.
Optional: By default, pipeline definitions are stored as Kubernetes resources, enabling version control, GitOps workflows, and integration with OpenShift GitOps or similar tools. To store pipeline definitions in the internal database instead, clear the Store pipeline definitions in Kubernetes checkbox in the Pipeline definition storage section.
Optional: By default, caching is configurable at both the pipeline and task levels. To disable caching for all pipelines and tasks in the pipeline server and override any pipeline-level and task-level caching settings, clear the Allow caching to be configured per pipeline and task checkbox in the Pipeline caching section.
Click Configure pipeline server.

Verification

On the Pipelines tab for the project:

The Import pipeline button is available.
When you click the action menu (⋮) and then click Manage pipeline server configuration, the pipeline server details are displayed.

1.1.1. Configuring a pipeline server with an external Amazon RDS database
Link kopieren

To configure a pipeline server with an external Amazon Relational Database Service (RDS) database, you must configure OpenShift AI to trust the certificates issued by its certificate authorities (CA).

Important

If you are configuring a pipeline server for production pipeline workloads, Red Hat recommends that you use an external MySQL or MariaDB database.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have logged in to Red Hat OpenShift AI.
You have created a data science project that you can add a pipeline server to.
You have an existing S3-compatible object storage bucket, and you have configured your storage account with write access to your S3 bucket.

Procedure

Before configuring your pipeline server, from Amazon RDS: Certificate bundles by AWS Region, download the PEM certificate bundle for the region that the database was created in.
For example, if the database was created in the us-east-1 region, download us-east-1-bundle.pem.
In a terminal window, log in to the OpenShift cluster where OpenShift AI is deployed.
```
oc login api.<cluster_name>.<cluster_domain>:6443 --web
```

Run the following command to fetch the current OpenShift AI trusted CA configuration and store it in a new file:

oc get dscinitializations.dscinitialization.opendatahub.io default-dsci -o json | jq '.spec.trustedCABundle.customCABundle' > /tmp/my-custom-ca-bundles.crt

Run the following command to append the PEM certificate bundle that you downloaded to the new custom CA configuration file:
```
cat us-east-1-bundle.pem >> /tmp/my-custom-ca-bundles.crt
```

Run the following command to update the OpenShift AI trusted CA configuration to trust certificates issued by the CAs included in the new custom CA configuration file:

oc patch dscinitialization default-dsci --type='json' -p='[{"op":"replace","path":"/spec/trustedCABundle/customCABundle","value":"'"$(awk '{printf "%s\\n", $0}' /tmp/my-custom-ca-bundles.crt)"'"}]'

Configure a pipeline server, as described in Configuring a pipeline server.

Verification

The pipeline server starts successfully.
You can import and run data science pipelines.

1.2. Defining a pipeline
Link kopieren

The Kubeflow Pipelines SDK enables you to define end-to-end machine learning and data pipelines. Use the latest Kubeflow Pipelines 2.0 SDK to build your data science pipeline in Python code. After you have built your pipeline, use the SDK to compile it into an Intermediate Representation (IR) YAML file. For more information about compiling pipelines, see Compiling the pipeline YAML with the Kubeflow Pipelines SDK and Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK. Compiling to Kubernetes-native manifests is optional and applies only when your pipeline server is configured to use Kubernetes API storage. After defining the pipeline, you can import the YAML file to the OpenShift AI dashboard to enable you to configure its execution settings.

Important

If you are using OpenShift AI on a cluster running in FIPS mode, any custom container images for data science pipelines must be based on UBI 9 or RHEL 9. This ensures compatibility with FIPS-approved pipeline components and prevents errors related to mismatched OpenSSL or GNU C Library (glibc) versions.

You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information about creating pipelines in JupyterLab, see Working with pipelines in JupyterLab. For more information about the Elyra JupyterLab extension, see Elyra Documentation.

1.2.1. Compiling the pipeline YAML with the Kubeflow Pipelines SDK
Link kopieren

Before you can define your pipeline in the cluster, you must convert your Python-defined pipeline into YAML format. You can use the Kubeflow Pipelines (KFP) Software Development Kit (SDK) to compile your pipeline code into a deployable YAML file for declarative GitOps deployment.

Prerequisites

You have installed Python 3.11 or later in your local environment.
You have installed the Kubeflow Pipelines SDK package (kfp) version 2.14.3 or later.
You have a valid Python pipeline definition file.

Procedure

Compile your pipeline by using the KFP SDK to generate the pipeline YAML file.

In the following example, replace <pipeline_file>.py with the name of your Python pipeline file and specify an output file for the compiled YAML:

$ kfp dsl compile \
    --py <pipeline_file>.py \
    --output <compiled_pipeline_file>.yaml

Note

The generated <compiled_pipeline_file>.yaml file contains the compiled pipeline specification in YAML format. You can use this content as the value of the pipelineSpec field when you create a PipelineVersion custom resource (CR). You can also store the file in Git for declarative or GitOps-based deployment.

Verification

Verify that the generated file includes a pipelineSpec key followed by the compiled pipeline definition:

$ head -n 10 <compiled_pipeline_file>.yaml

Additional resources

Compiling a pipeline with the Kubeflow Pipelines SDK

1.2.2. Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK
Link kopieren

If your pipeline server uses the Kubernetes native API mode, you can compile your pipeline directly to Kubernetes manifests. The output includes Pipeline and PipelineVersion custom resources with spec.pipelineSpec and, when you use Kubernetes resource configuration, an optional spec.platformSpec.

Prerequisites

You have installed Python 3.11 or later in your local environment.
You have installed the Kubeflow Pipelines SDK package (kfp) version 2.14.3 or later.
You have a valid Python pipeline definition file.

Procedure

Save the following code as a new file named compile.py in your working directory.

The example uses the KubernetesManifestOptions class from the kfp.compiler.compiler_utils module to define pipeline metadata such as the name, version, and namespace.

Example compile script

from kfp import dsl, compiler
from kfp.compiler.compiler_utils import KubernetesManifestOptions

@dsl.pipeline(name="<pipeline_name>")
def my_pipeline():
    pass  # define your tasks

compiler.Compiler().compile(
    pipeline_func=my_pipeline,
    package_path="<output_file>.yaml",
    kubernetes_manifest_format=True,
    kubernetes_manifest_options=KubernetesManifestOptions(
        pipeline_name="<pipeline_name>",
        pipeline_version_name="<version_name>",
        namespace="<namespace>",
        include_pipeline_manifest=True,
    ),
)

Run the script to compile your pipeline and generate the Kubernetes manifests:
```
$ python compile.py
```

Verification

Verify that the compiled output includes the expected resources:

apiVersion: pipelines.kubeflow.org/v2beta1
kind: Pipeline
---
apiVersion: pipelines.kubeflow.org/v2beta1
kind: PipelineVersion
spec:
  pipelineSpec: ...
  platformSpec: ...   # present when Kubernetes resource configuration is used

Additional resources

Compiling for Kubernetes native API mode

1.2.3. Defining a pipeline by using the Kubernetes API
Link kopieren

You can define data science pipelines and pipeline versions by using the Kubernetes API, which stores them as custom resources in the cluster instead of the internal database. This approach makes it easier to use OpenShift GitOps (Argo CD) or similar tools to manage pipelines and pipeline versions, while still allowing you to manage them through the OpenShift AI dashboard, API, and the Kubeflow Pipelines (KFP) Software Development Kit (SDK). You can generate the required manifests by using the Kubeflow Pipelines SDK; see Compiling the pipeline YAML with the Kubeflow Pipelines SDK or Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK.

Note

If your pipeline server is already configured to use Kubernetes API storage, you can still use the OpenShift AI dashboard and REST API to view pipeline details, run pipelines, and create schedules. In this mode, the Kubernetes API acts as the storage backend, so your existing tools continue to work as expected.

Prerequisites

You have OpenShift AI administrator privileges or you are the project owner.
You have a data science project with a running pipeline server.
You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:
- Installing the OpenShift CLI for OpenShift Container Platform
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
If you plan to create a PipelineVersion custom resource, you have either:
- Compiled your Python pipeline to IR YAML by using the KFP SDK. See Compiling the pipeline YAML with the Kubeflow Pipelines SDK.
- Compiled Kubernetes-native manifests by using the KFP SDK. See Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK.

Procedure

In a terminal window, log in to your OpenShift cluster by using the OpenShift CLI (oc):
```
$ oc login -u <user_name>
```
When prompted, enter the OpenShift server URL, connection type, and your password.
To configure the pipeline server to use Kubernetes API storage instead of the default database option, set the spec.apiServer.pipelineStore field to kubernetes in your project’s DataSciencePipelinesApplication (DSPA) custom resource.
In the following command, replace <dspa_name> with the name of your DSPA custom resource, and replace <namespace> with the name of your project:
```
$ oc patch dspa <dspa_name> -n <namespace> \
  --type=merge \
  -p {"spec": {"apiServer": {"pipelineStore": "kubernetes"}}}
```
Warning
When you switch the pipeline server from database storage to Kubernetes API storage, existing pipelines that were stored in the internal database are no longer visible in the OpenShift AI dashboard or REST API. To view or manage those pipelines again, change the spec.apiServer.pipelineStore field back to database.
Define a Pipeline custom resource in a YAML file with the following contents:
Example pipeline definition
```
apiVersion: pipelines.kubeflow.org/v2beta1
kind: Pipeline
metadata:
  name: <name>
  namespace: <namespace>
spec:
  displayName: <displayName>
```
- name: The immutable Kubernetes resource name of your pipeline.
- namespace: The name of your project.
- displayName: The user-friendly display name of your pipeline, which is shown in the dashboard and REST API.
Apply the pipeline definition to create the Pipeline custom resource in your cluster.
In the following command, replace <pipeline_yaml_file> with the name of your YAML file:
Example command
```
$ oc apply -f <pipeline_yaml_file>.yaml
```
Alternatively, if you compiled Kubernetes-native manifests with the KFP SDK, you can apply the generated file directly without manually creating separate YAML files:
```
$ oc apply -f <output_file>.yaml
```
The generated file includes both Pipeline and PipelineVersion resources. You can skip the following manual definition steps and proceed to the verification step.
Define a PipelineVersion custom resource in a YAML file with the following contents:
Example pipeline version definition
```
apiVersion: pipelines.kubeflow.org/v2beta1
kind: PipelineVersion
metadata:
  name: <name>
  namespace: <namespace>
spec:
  pipelineName: <pipelineName>
  displayName: <displayName>
  description: This is the first version of the pipeline.
  pipelineSpec:
        # ... YAML generated by compiling Python pipeline with KFP SDK ...
```
- name: The name of your pipeline version.
- namespace: The name of your project.
- pipelineName: The immutable Kubernetes resource name of your pipeline. This value must match the metadata.name value in the Pipeline custom resource.
- displayName: The user-friendly display name of your pipeline version, which is shown in the dashboard and REST API.
- pipelineSpec: The YAML content that you generated by using the Kubeflow Pipelines (KFP) SDK.
Apply the pipeline version definition to create the PipelineVersion custom resource in your cluster.
In the following command, replace <pipeline_version_yaml_file> with the name of your YAML file:
Example command
```
$ oc apply -f <pipeline_version_yaml_file>.yaml
```
After creating the pipeline version, the system automatically applies the following labels to the pipeline version for easier filtering:
Example automatic labels
```
pipelines.kubeflow.org/pipeline-id: <metadata.uid of the pipeline>
pipelines.kubeflow.org/pipeline: <pipeline name>
```

Verification

Check that the Pipeline custom resource was successfully created:
```
$ oc get pipeline <pipeline_name> -n <namespace>
```
Check that the PipelineVersion custom resource was successfully created:
```
$ oc get pipelineversion <pipeline_version_name> -n <namespace>
```

1.2.4. Migrating pipelines from database to Kubernetes API storage
Link kopieren

You can migrate existing pipelines and pipeline versions from the internal database to Kubernetes custom resources. This makes it easier to use OpenShift GitOps (Argo CD) or similar tools to manage pipelines and pipeline versions, while still allowing you to manage them through the OpenShift AI dashboard, API, and the Kubeflow Pipelines (KFP) Software Development Kit (SDK).

This procedure uses a community-supported Kubeflow Pipelines migration script to export pipelines from the Data Science Pipelines API and generate corresponding Pipeline and PipelineVersion custom resources for import into your cluster.

Important

The migration script in this procedure is maintained by the Kubeflow Pipelines community and is not supported by Red Hat. Before you use the script, review the repository and validate it in a non-production environment.

Warning

The pipeline and pipeline version IDs change during migration, so existing pipeline runs do not map to the migrated pipeline version. The original ID is stored in the pipelines.kubeflow.org/original-id label.

Prerequisites

You have OpenShift AI administrator privileges or you are the project owner.
You have a data science project with a running pipeline server.
The pipeline server is configured with spec.apiServer.pipelineStore: database.
You have Python 3.11 installed in your local environment.
You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:
- Installing the OpenShift CLI for OpenShift Container Platform
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS

Procedure

In a terminal window, log in to your OpenShift cluster by using the OpenShift CLI (oc):
```
$ oc login -u <user_name>
```
When prompted, enter the OpenShift server URL, connection type, and your password.

Set environment variables for your data science project and get the pipeline API route.

In the export command, replace <namespace> with the name of your project:

echo "Setting the prerequisite variables"
export NAMESPACE=<namespace>
export DSPA_NAME=$(oc -n $NAMESPACE get dspa -o jsonpath={.items[0].metadata.name})
export API_URL="https://$(oc -n $NAMESPACE get route "ds-pipeline-$DSPA_NAME" -o jsonpath={.spec.host})"

Create a Python virtual environment and install the required dependencies.

echo "Set up the Python prerequisites"
python3.11 -m venv .venv
./.venv/bin/pip install kfp requests PyYAML

Download and run the Kubeflow Pipelines community migration script.
The script connects to the Data Science Pipelines API, exports all pipelines and versions from the specified data science project, and generates one YAML file per pipeline in a local kfp-exported-pipelines/ directory. Each file includes a Pipeline resource followed by all associated PipelineVersion resources.
1. Run the following command:
  curl -L https://raw.githubusercontent.com/kubeflow/pipelines/refs/heads/master/tools/k8s-native/migration.py -o migration.py ./.venv/bin/python migration.py --skip-tls-verify --kfp-server-host $API_URL --namespace $NAMESPACE --token "$(oc whoami --show-token)"
  Note
  The --skip-tls-verify option disables certificate validation and should be used only in development environments or when connecting to a server with a self-signed certificate. In production environments, provide a valid certificate bundle instead.
  Additionally, passing the access token directly on the command line might expose it in shell history or process lists. To reduce this risk, store the token in an environment variable and reference it in your command:
  
  export KFP_TOKEN=$(oc whoami --show-token) ./.venv/bin/python migration.py --kfp-server-host $API_URL --namespace $NAMESPACE --token "$KFP_TOKEN"
  
  Alternatively, use a prompt with read -s to input the token securely at runtime.
2. Optional: For more information about the script, run the following command:
  ./.venv/bin/python migration.py --help
3. If you plan to create new or updated PipelineVersion custom resources after migration, you can compile your pipeline code by using the Kubeflow Pipelines SDK. For more information, see Compiling the pipeline YAML with the Kubeflow Pipelines SDK and Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK.
Apply the exported Kubernetes custom resources to your cluster.
```
oc apply -f ./kfp-exported-pipelines
```

Change the pipeline server to use Kubernetes API storage.

oc -n "$NAMESPACE" patch dspa "$DSPA_NAME" --type=merge -p {"spec":{"apiServer":{"pipelineStore":"kubernetes"}}}

Note

To view pipelines that were stored in the internal database and not migrated, you can temporarily change the pipeline server back to database storage.

oc -n $NAMESPACE patch dspa $DSPA_NAME --type=merge -p {"spec":{"apiServer":{"pipelineStore":"database"}}}

Repeat this procedure for each additional data science project that you want to migrate, changing NAMESPACE to the appropriate project name.
Optional: Clean up the local environment.
```
rm -rf .venv migration.py
```

Verification

Check that the Pipeline and PipelineVersion custom resources were created in your project:

$ oc -n <namespace> get pipelines.pipelines.kubeflow.org
$ oc -n <namespace> get pipelineversions.pipelines.kubeflow.org

Verify that the pipeline server is using Kubernetes API storage:
```
$ oc -n <namespace> get dspa <dspa_name> -o jsonpath={.spec.apiServer.pipelineStore}{"\n"}
```
The command should return kubernetes.

Additional resources

1.3. Importing a data science pipeline
Link kopieren

To help you begin working with data science pipelines in OpenShift AI, you can import a YAML file containing your pipeline’s code to an active pipeline server, or you can import the YAML file from a URL. This file contains a Kubeflow pipeline compiled by using the Kubeflow compiler. After you have imported the pipeline to a pipeline server, you can execute the pipeline by creating a pipeline run.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that is available and contains a configured pipeline server.
You have compiled your pipeline with the Kubeflow compiler and you have access to the resulting YAML file.
If you are uploading your pipeline from a URL, the URL is publicly accessible.

Note

If your pipeline is defined in Python code instead of a YAML file, compile it first by using the KFP SDK. For more information, see Compiling the pipeline YAML with the Kubeflow Pipelines SDK.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that you want to import a pipeline to.
Click Import pipeline.
In the Import pipeline dialog, enter the details for the pipeline that you want to import.
1. In the Pipeline name field, enter a name for the pipeline that you want to import.
2. In the Pipeline description field, enter a description for the pipeline that want to import.
3. Select where you want to import your pipeline from by performing one of the following actions:
  - Select Upload a file to upload your pipeline from your local machine’s file system. Import your pipeline by clicking Upload, or by dragging and dropping a file.
  - Select Import by url to upload your pipeline from a URL, and then enter the URL into the text box.
4. Click Import pipeline.

Verification

The pipeline that you imported is displayed on the Pipelines page and on the Pipelines tab on the project details page.

1.4. Deleting a data science pipeline
Link kopieren

If you no longer require access to your data science pipeline on the dashboard, you can delete it so that it does not appear on the Data science pipelines page.

Prerequisites

You have logged in to Red Hat OpenShift AI.
There are active pipelines available on the Pipelines page.
The pipeline that you want to delete does not contain any pipeline versions.
The pipeline that you want to delete does not contain any pipeline versions. For more information, see Deleting a pipeline version.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that contains the pipeline that you want to delete.
Click the action menu (⋮) beside the pipeline that you want to delete, and then click Delete pipeline.
In the Delete pipeline dialog, enter the pipeline name in the text field to confirm that you intend to delete it.
Click Delete pipeline.

Verification

The data science pipeline that you deleted is no longer displayed on the Pipelines page.

1.5. Deleting a pipeline server
Link kopieren

After you have finished running your data science pipelines, you can delete the pipeline server. Deleting a pipeline server automatically deletes all of its associated pipelines, pipeline versions, and runs. If your pipeline data is stored in a database, the database is also deleted along with its meta-data. In addition, after deleting a pipeline server, you cannot create new pipelines or pipeline runs until you create another pipeline server.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that is available and contains a pipeline server.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that contains the pipeline server that you want to delete.
From the Pipeline server actions list, select Delete pipeline server.
In the Delete pipeline server dialog, enter the name of the pipeline server in the text field to confirm that you intend to delete it.
Click Delete.

Verification

Pipelines previously assigned to the deleted pipeline server no longer appear on the Pipelines page for the relevant data science project.
Pipeline runs previously assigned to the deleted pipeline server no longer appear on the Runs page for the relevant data science project.

1.6. Viewing the details of a pipeline server
Link kopieren

You can view the details of pipeline servers configured in OpenShift AI, such as the pipeline’s connection details and where its data is stored.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that contains an active and available pipeline server.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that contains the pipeline server that you want to view.
From the Pipeline server actions list, select Manage pipeline server configuration.

Verification

You can view the pipeline server details in the Manage pipeline server dialog.

1.7. Viewing existing pipelines
Link kopieren

You can view the details of pipelines that you have imported to Red Hat OpenShift AI, such as the pipeline’s last run, when it was created, the pipeline’s executed runs, and details of any associated pipeline versions.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that is available and contains a pipeline server.
You have imported a pipeline to an active pipeline server.
Existing pipelines are available.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that contains the pipelines that you want to view.
Optional: Click Expand ( ) on the row of a pipeline to view its pipeline versions.

Verification

A list of data science pipelines is displayed on the Pipelines page.

1.8. Overview of pipeline versions
Link kopieren

You can manage incremental changes to pipelines in OpenShift AI by using versioning. This allows you to develop and deploy pipelines iteratively, preserving a record of your changes. You can track and manage your changes on the OpenShift AI dashboard, allowing you to schedule and execute runs against all available versions of your pipeline.

1.9. Uploading a pipeline version
Link kopieren

You can upload a YAML file to an active pipeline server that contains the latest version of your pipeline, or you can upload the YAML file from a URL. The YAML file must consist of a Kubeflow pipeline compiled by using the Kubeflow compiler. After you upload a pipeline version to a pipeline server, you can execute it by creating a pipeline run.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that is available and contains a configured pipeline server.
You have a pipeline version available and ready to upload.
If you are uploading your pipeline version from a URL, the URL is publicly accessible.
If your pipeline version is based on Python code, compile it to YAML before uploading. For more information, see Compiling the pipeline YAML with the Kubeflow Pipelines SDK.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that you want to upload a pipeline version to.
Click the Import pipeline drop-down list, and then select Upload new version.
In the Upload new version dialog, enter the details for the pipeline version that you are uploading.
1. From the Pipeline list, select the pipeline that you want to upload your pipeline version to.
2. In the Pipeline version name field, confirm the name for the pipeline version, and change it if necessary.
3. In the Pipeline version description field, enter a description for the pipeline version.
4. Select where you want to upload your pipeline version from by performing one of the following actions:
  - Select Upload a file to upload your pipeline version from your local machine’s file system. Import your pipeline version by clicking Upload, or by dragging and dropping a file.
  - Select Import by url to upload your pipeline version from a URL, and then enter the URL into the text box.
5. Click Upload.

Verification

The pipeline version that you uploaded is displayed on the Pipelines page. Click Expand ( ) on the row containing the pipeline to view its versions.
The Version column on the row containing the pipeline version that you uploaded on the Pipelines page increments by one.

1.10. Deleting a pipeline version
Link kopieren

You can delete specific versions of a pipeline when you no longer require them. Deleting a default pipeline version automatically changes the default pipeline version to the next most recent version. If no pipeline versions exist, the pipeline persists without a default version.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that is available and contains a pipeline server.
You have imported a pipeline to an active pipeline server.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
The Pipelines page opens.
Delete the pipeline versions that you no longer require:
- To delete a single pipeline version:
  1. From the Project list, select the project that contains a version of a pipeline that you want to delete.
  2. On the row containing the pipeline, click Expand ( ).
  3. Click the action menu (⋮) beside the project version that you want to delete, and then click Delete pipeline version.
    The Delete pipeline version dialog opens.
  4. Enter the name of the pipeline version in the text field to confirm that you intend to delete it.
  5. Click Delete.
- To delete multiple pipeline versions:
  1. On the row containing each pipeline version that you want to delete, select the checkbox.
  2. Click the action menu (⋮) next to the Import pipeline drop-down list, and then select Delete from the list.

Verification

The pipeline version that you deleted is no longer displayed on the Pipelines page, or on the Pipelines tab for the data science project.

1.11. Viewing the details of a pipeline version
Link kopieren

You can view the details of a pipeline version that you have uploaded to Red Hat OpenShift AI, such as its graph and YAML code.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that is available and contains a pipeline server.
You have a pipeline available on an active and available pipeline server.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
The Pipelines page opens.
From the Project drop-down list, select the project that contains the pipeline versions that you want to view details for.
Click the pipeline name to view further details of its most recent version. The pipeline version details page opens, displaying the Graph, Summary, and Pipeline spec tabs.
Alternatively, click Expand ( ) on the row containing the pipeline that you want to view versions for, and then click the pipeline version that you want to view the details of. The pipeline version details page opens, displaying the Graph, Summary, and Pipeline spec tabs.

Verification

On the pipeline version details page, you can view the pipeline graph, summary details, and YAML code.

1.12. Downloading a data science pipeline version
Link kopieren

To make further changes to a data science pipeline version that you previously uploaded to OpenShift AI, you can download pipeline version code from the user interface.

Prerequisites

You have logged in to Red Hat OpenShift AI.
You have previously created a data science project that is available and contains a configured pipeline server.
You have created and imported a pipeline to an active pipeline server that is available to download.

Procedure

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that contains the version that you want to download.
Click Expand ( ) beside the pipeline that contains the version that you want to download.
Click the pipeline version that you want to download.
The pipeline version details page opens.
Click the Pipeline spec tab, and then click the Download button ( ) to download the YAML file that contains the pipeline version code to your local machine.

Verification

The pipeline version code downloads to your browser’s default directory for downloaded files.

1.13. Overview of data science pipelines caching
Link kopieren

You can use caching within data science pipelines to optimize execution times and improve resource efficiency. Caching reduces redundant task execution by reusing results from previous runs with identical inputs.

Caching is particularly beneficial for iterative tasks, where intermediate steps might not need to be repeated. Understanding caching can help you design more efficient pipelines and save time in model development.

Caching operates by storing the outputs of successfully completed tasks and comparing the inputs of new tasks against previously cached ones. If a match is found, OpenShift AI reuses the cached results instead of re-executing the task, reducing computation time and resource usage.

1.13.1. Caching criteria
Link kopieren

For caching to be effective, the following criteria determine if a task can use previously cached results:

Input data and parameters: If the input data and parameters for a task are unchanged from a previous run, cached results are eligible for reuse.
Task code and configuration: Changes to the task code or configurations invalidate the cache to ensure that modifications are always reflected.
Pipeline environment: Changes to the pipeline environment, such as dependency versions, also affect caching eligibility to maintain consistency.

1.13.2. Viewing cached steps in the OpenShift AI user interface
Link kopieren

Cached steps in pipelines are visually indicated in the user interface (UI):

Tasks that use cached results display a green icon, helping you quickly identify which steps were cached. The Status field in the side panel displays Cached for cached tasks.
The UI also includes information about when the task was previously executed, allowing for easy verification of cache usage.

To check the caching status of specific tasks, navigate to the pipeline details view in the UI. Cached and non-cached tasks are clearly indicated. Cached tasks do not display execution logs because they reuse previously generated outputs and are not re-executed.

1.13.3. Controlling caching in data science pipelines
Link kopieren

Caching is enabled by default in OpenShift AI to improve performance. However, there are instances when disabling caching might be necessary for specific tasks, an entire pipeline, or all pipelines. For example, caching might not be beneficial for tasks that rely on frequently updated data or unique computational needs. In other cases, such as debugging, development, or when deterministic re-execution is required, you might want to disable caching for all pipelines.

Important

Disabling caching at the pipeline or pipeline server level causes all tasks to re-run, potentially increasing compute time and resource usage.

You can control caching for data science pipelines in the following ways:

Individual task: Data scientists can disable caching for specific steps in a pipeline.
Pipeline (submit time): Data scientists can disable caching when submitting a pipeline run.
Pipeline (compile time): Data scientists can disable caching when compiling a pipeline.
All pipelines (pipeline server): You can disable caching for all pipelines in the pipeline server, which overrides all pipeline and task-level caching settings.

1.13.3.1. Disabling caching for individual tasks
Link kopieren

To disable caching for a particular task, apply the set_caching_options method directly to the task in your pipeline code:

task_name.set_caching_options(False)

After applying this setting, OpenShift AI runs the task in future pipeline runs, ignoring any cached results.

You can re-enable caching for individual tasks by setting the set_caching_options parameter to True or by omitting set_caching_options.

This setting is ignored if caching is disabled in the pipeline server.

1.13.3.2. Disabling caching for a pipeline at submit time
Link kopieren

To disable caching for the entire pipeline during pipeline submission, set the enable_caching parameter to False in your pipeline code. This setting ensures that no steps are cached during pipeline execution. The enable_caching parameter is available only when using the kfp.client to submit pipelines or start pipeline runs, such as the run_pipeline method.

Example:

import kfp
client = kfp.Client()
client.run_pipeline(
    experiment_id=experiment.id,
    pipeline_id=pipeline.id,
    job_name="no-cache-run",
    params={},                # optional
    enable_caching=False,
)

This setting is ignored if caching is disabled during pipeline compilation or in the pipeline server.

1.13.3.3. Disabling caching for a pipeline at compile time
Link kopieren

To disable caching for the entire pipeline during compilation, set one of the following options in your local environment or workbench:

Environment variable:

export KFP_DISABLE_EXECUTION_CACHING_BY_DEFAULT=true

CLI flag (when using kfp dsl compile):

kfp dsl compile --disable-execution-caching-by-default

These settings are ignored if caching is disabled in the pipeline server.

1.13.3.4. Disabling caching for all pipelines (pipeline server)
Link kopieren

To disable caching for all pipelines in the pipeline server and override all pipeline and task-level caching settings, use either of the following methods:

Pipeline server configuration

From the OpenShift AI dashboard, click Data science pipelines Pipelines.
On the Pipelines page, from the Project drop-down list, select the project that contains the pipeline server that you want to configure.
From the Pipeline server actions list, select Manage pipeline server configuration.
In the Pipeline caching section, clear the Allow caching to be configured per pipeline and task checkbox.
Click Save.

Data Science Pipelines Application (cluster administrator)

In the OpenShift console or CLI, set the cacheEnabled field to false in the DataSciencePipelinesApplication (DSPA) custom resource for the project.

Example:

apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
kind: DataSciencePipelinesApplication
metadata:
  name: my-dspa
  namespace: my-namespace
spec:
  apiServer:
    cacheEnabled: false

To allow caching to be configured at the pipeline and task level, set the cacheEnabled field to true in the DSPA custom resource.

After applying this setting, all pipeline and task-level caching settings are ignored.

Note

Changing this setting updates the CACHEENABLED environment variable in the pipeline server deployment.

Verification

After configuring caching settings, you can verify its behavior by using one of the following methods:

Check the UI: Locate the green icons in the task list to identify cached steps.
Test task re-runs: Disable caching on specific tasks or the pipeline to confirm that steps re-execute as expected.
Validate inputs: Ensure the task inputs, parameters, and runtime settings are unchanged when caching is applied.

Note

You can also disable caching for a single node or for your entire pipeline in JupyterLab using Elyra. For more information, see Disabling node caching in Elyra.

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

1.1. Configuring a pipeline server
Link kopieren

1.1.1. Configuring a pipeline server with an external Amazon RDS database
Link kopieren

1.2. Defining a pipeline
Link kopieren

1.2.1. Compiling the pipeline YAML with the Kubeflow Pipelines SDK
Link kopieren

1.2.2. Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK
Link kopieren

1.2.3. Defining a pipeline by using the Kubernetes API
Link kopieren

1.2.4. Migrating pipelines from database to Kubernetes API storage
Link kopieren

1.3. Importing a data science pipeline
Link kopieren

1.4. Deleting a data science pipeline
Link kopieren

1.5. Deleting a pipeline server
Link kopieren

1.6. Viewing the details of a pipeline server
Link kopieren

1.7. Viewing existing pipelines
Link kopieren

1.8. Overview of pipeline versions
Link kopieren

1.9. Uploading a pipeline version
Link kopieren

1.10. Deleting a pipeline version
Link kopieren

1.11. Viewing the details of a pipeline version
Link kopieren

1.12. Downloading a data science pipeline version
Link kopieren

1.13. Overview of data science pipelines caching
Link kopieren

1.13.1. Caching criteria
Link kopieren

1.13.2. Viewing cached steps in the OpenShift AI user interface
Link kopieren

1.13.3. Controlling caching in data science pipelines
Link kopieren

1.13.3.1. Disabling caching for individual tasks
Link kopieren

1.13.3.2. Disabling caching for a pipeline at submit time
Link kopieren

1.13.3.3. Disabling caching for a pipeline at compile time
Link kopieren

1.13.3.4. Disabling caching for all pipelines (pipeline server)
Link kopieren

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 1. Managing data science pipelines

1.1. Configuring a pipeline serverLink kopierenLink in die Zwischenablage kopiert!

1.1.1. Configuring a pipeline server with an external Amazon RDS databaseLink kopierenLink in die Zwischenablage kopiert!

1.2. Defining a pipelineLink kopierenLink in die Zwischenablage kopiert!

1.2.1. Compiling the pipeline YAML with the Kubeflow Pipelines SDKLink kopierenLink in die Zwischenablage kopiert!

1.2.2. Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDKLink kopierenLink in die Zwischenablage kopiert!

1.2.3. Defining a pipeline by using the Kubernetes APILink kopierenLink in die Zwischenablage kopiert!

1.2.4. Migrating pipelines from database to Kubernetes API storageLink kopierenLink in die Zwischenablage kopiert!

1.3. Importing a data science pipelineLink kopierenLink in die Zwischenablage kopiert!

1.4. Deleting a data science pipelineLink kopierenLink in die Zwischenablage kopiert!

1.5. Deleting a pipeline serverLink kopierenLink in die Zwischenablage kopiert!

1.6. Viewing the details of a pipeline serverLink kopierenLink in die Zwischenablage kopiert!

1.7. Viewing existing pipelinesLink kopierenLink in die Zwischenablage kopiert!

1.8. Overview of pipeline versionsLink kopierenLink in die Zwischenablage kopiert!

1.9. Uploading a pipeline versionLink kopierenLink in die Zwischenablage kopiert!

1.10. Deleting a pipeline versionLink kopierenLink in die Zwischenablage kopiert!

1.11. Viewing the details of a pipeline versionLink kopierenLink in die Zwischenablage kopiert!

1.12. Downloading a data science pipeline versionLink kopierenLink in die Zwischenablage kopiert!

1.13. Overview of data science pipelines cachingLink kopierenLink in die Zwischenablage kopiert!

1.13.1. Caching criteriaLink kopierenLink in die Zwischenablage kopiert!

1.13.2. Viewing cached steps in the OpenShift AI user interfaceLink kopierenLink in die Zwischenablage kopiert!

1.13.3. Controlling caching in data science pipelinesLink kopierenLink in die Zwischenablage kopiert!

1.13.3.1. Disabling caching for individual tasksLink kopierenLink in die Zwischenablage kopiert!

1.13.3.2. Disabling caching for a pipeline at submit timeLink kopierenLink in die Zwischenablage kopiert!

1.13.3.3. Disabling caching for a pipeline at compile timeLink kopierenLink in die Zwischenablage kopiert!

1.13.3.4. Disabling caching for all pipelines (pipeline server)Link kopierenLink in die Zwischenablage kopiert!

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

1.1. Configuring a pipeline server
Link kopieren

1.1.1. Configuring a pipeline server with an external Amazon RDS database
Link kopieren

1.2. Defining a pipeline
Link kopieren

1.2.1. Compiling the pipeline YAML with the Kubeflow Pipelines SDK
Link kopieren

1.2.2. Compiling Kubernetes-native manifests with the Kubeflow Pipelines SDK
Link kopieren

1.2.3. Defining a pipeline by using the Kubernetes API
Link kopieren

1.2.4. Migrating pipelines from database to Kubernetes API storage
Link kopieren

1.3. Importing a data science pipeline
Link kopieren

1.4. Deleting a data science pipeline
Link kopieren

1.5. Deleting a pipeline server
Link kopieren

1.6. Viewing the details of a pipeline server
Link kopieren

1.7. Viewing existing pipelines
Link kopieren

1.8. Overview of pipeline versions
Link kopieren

1.9. Uploading a pipeline version
Link kopieren

1.10. Deleting a pipeline version
Link kopieren

1.11. Viewing the details of a pipeline version
Link kopieren

1.12. Downloading a data science pipeline version
Link kopieren

1.13. Overview of data science pipelines caching
Link kopieren

1.13.1. Caching criteria
Link kopieren

1.13.2. Viewing cached steps in the OpenShift AI user interface
Link kopieren

1.13.3. Controlling caching in data science pipelines
Link kopieren

1.13.3.1. Disabling caching for individual tasks
Link kopieren

1.13.3.2. Disabling caching for a pipeline at submit time
Link kopieren

1.13.3.3. Disabling caching for a pipeline at compile time
Link kopieren

1.13.3.4. Disabling caching for all pipelines (pipeline server)
Link kopieren