Este contenido no está disponible en el idioma seleccionado.
Chapter 1. Managing data science pipelines
1.1. Configuring a pipeline server
Before you can successfully create a pipeline in OpenShift AI, you must configure a pipeline server. This task includes configuring where your pipeline artifacts and data are stored.
					You are not required to specify any storage directories when configuring a connection for your pipeline server. When you import a pipeline, the /pipelines folder is created in the root folder of the bucket, containing a YAML file for the pipeline. If you upload a new version of the same pipeline, a new YAML file with a different ID is added to the /pipelines folder.
				
					When you run a pipeline, the artifacts are stored in the /pipeline-name folder in the root folder of the bucket.
				
If you use an external MySQL database and upgrade to OpenShift AI 2.9 or later, the database is migrated to data science pipelines 2.0 format, making it incompatible with earlier versions of OpenShift AI.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have created a data science project that you can add a pipeline server to.
- You have an existing S3-compatible object storage bucket and you have configured write access to your S3 bucket on your storage account.
- If you are configuring a pipeline server for production pipeline workloads, you have an existing external MySQL or MariaDB database.
- If you are configuring a pipeline server with an external MySQL database, your database must use at least MySQL version 5.x. However, Red Hat recommends that you use MySQL version 8.x. Note- The - mysql_native_passwordauthentication plugin is required for the ML Metadata component to successfully connect to your database.- mysql_native_passwordis disabled by default in MySQL 8.4 and later. If your database uses MySQL 8.4 or later, you must update your MySQL deployment to enable the- mysql_native_passwordplugin.- For more information about enabling the - mysql_native_passwordplugin, see Native Pluggable Authentication in the MySQL documentation.
- If you are configuring a pipeline server with a MariaDB database, your database must use MariaDB version 10.3 or later. However, Red Hat recommends that you use at least MariaDB version 10.5.
Procedure
- From the OpenShift AI dashboard, click Data Science Projects. - The Data Science Projects page opens. 
- Click the name of the project that you want to configure a pipeline server for. - A project details page opens. 
- Click the Pipelines tab.
- Click Configure pipeline server. - The Configure pipeline server dialog appears. 
- In the Object storage connection section, provide values for the mandatory fields: - In the Access key field, enter the access key ID for the S3-compatible object storage provider.
- In the Secret key field, enter the secret access key for the S3-compatible object storage account that you specified.
- In the Endpoint field, enter the endpoint of your S3-compatible object storage bucket.
- In the Region field, enter the default region of your S3-compatible object storage account.
- In the Bucket field, enter the name of your S3-compatible object storage bucket. Important- If you specify incorrect connection settings, you cannot update these settings on the same pipeline server. Therefore, you must delete the pipeline server and configure another one. - If you want to use an existing artifact that was not generated by a task in a pipeline, you can use the kfp.dsl.importer component to import the artifact from its URI. You can only import these artifacts to the S3-compatible object storage bucket that you define in the Bucket field in your pipeline server configuration. For more information about the - kfp.dsl.importercomponent, see Special Case: Importer Components.
 
- In the Database section, click Show advanced database options to specify the database to store your pipeline data and select one of the following sets of actions: - Select Use default database stored on your cluster to deploy a MariaDB database in your project. Important- The Use default database stored on your cluster option is intended for development and testing purposes only. For production pipeline workloads, select the Connect to external MySQL database option to use an external MySQL or MariaDB database. 
- Select Connect to external MySQL database to add a new connection to an external MySQL or MariaDB database that your pipeline server can access. - In the Host field, enter the database’s host name.
- In the Port field, enter the database’s port.
- In the Username field, enter the default user name that is connected to the database.
- In the Password field, enter the password for the default user account.
- In the Database field, enter the database name.
 
 
- Click Configure pipeline server.
Verification
On the Pipelines tab for the project:
- The Import pipeline button is available.
- When you click the action menu (⋮) and then click View pipeline server configuration, the pipeline server details are displayed.
1.1.1. Configuring a pipeline server with an external Amazon RDS database
To configure a pipeline server with an external Amazon Relational Database Service (RDS) database, you must configure OpenShift AI to trust the certificates issued by its certificate authorities (CA).
If you are configuring a pipeline server for production pipeline workloads, Red Hat recommends that you use an external MySQL or MariaDB database.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- You have logged in to Red Hat OpenShift AI.
- You have created a data science project that you can add a pipeline server to.
- You have an existing S3-compatible object storage bucket, and you have configured your storage account with write access to your S3 bucket.
Procedure
- Before configuring your pipeline server, from Amazon RDS: Certificate bundles by AWS Region, download the PEM certificate bundle for the region that the database was created in. - For example, if the database was created in the - us-east-1region, download- us-east-1-bundle.pem.
- In a terminal window, log in to the OpenShift cluster where OpenShift AI is deployed. - oc login api.<cluster_name>.<cluster_domain>:6443 --web - oc login api.<cluster_name>.<cluster_domain>:6443 --web- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Run the following command to fetch the current OpenShift AI trusted CA configuration and store it in a new file: - oc get dscinitializations.dscinitialization.opendatahub.io default-dsci -o json | jq '.spec.trustedCABundle.customCABundle' > /tmp/my-custom-ca-bundles.crt - oc get dscinitializations.dscinitialization.opendatahub.io default-dsci -o json | jq '.spec.trustedCABundle.customCABundle' > /tmp/my-custom-ca-bundles.crt- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Run the following command to append the PEM certificate bundle that you downloaded to the new custom CA configuration file: - cat us-east-1-bundle.pem >> /tmp/my-custom-ca-bundles.crt - cat us-east-1-bundle.pem >> /tmp/my-custom-ca-bundles.crt- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Run the following command to update the OpenShift AI trusted CA configuration to trust certificates issued by the CAs included in the new custom CA configuration file: - oc patch dscinitialization default-dsci --type='json' -p='[{"op":"replace","path":"/spec/trustedCABundle/customCABundle","value":"'"$(awk '{printf "%s\\n", $0}' /tmp/my-custom-ca-bundles.crt)"'"}]'- oc patch dscinitialization default-dsci --type='json' -p='[{"op":"replace","path":"/spec/trustedCABundle/customCABundle","value":"'"$(awk '{printf "%s\\n", $0}' /tmp/my-custom-ca-bundles.crt)"'"}]'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Configure a pipeline server, as described in Configuring a pipeline server.
Verification
- The pipeline server starts successfully.
- You can import and run data science pipelines.
1.2. Defining a pipeline
The Kubeflow Pipelines SDK enables you to define end-to-end machine learning and data pipelines. Use the latest Kubeflow Pipelines 2.0 SDK to build your data science pipeline in Python code. After you have built your pipeline, use the SDK to compile it into an Intermediate Representation (IR) YAML file. After defining the pipeline, you can import the YAML file to the OpenShift AI dashboard to enable you to configure its execution settings.
You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information about creating pipelines in JupyterLab, see Working with pipelines in JupyterLab. For more information about the Elyra JupyterLab extension, see Elyra Documentation.
1.3. Importing a data science pipeline
To help you begin working with data science pipelines in OpenShift AI, you can import a YAML file containing your pipeline’s code to an active pipeline server, or you can import the YAML file from a URL. This file contains a Kubeflow pipeline compiled by using the Kubeflow compiler. After you have imported the pipeline to a pipeline server, you can execute the pipeline by creating a pipeline run.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have previously created a data science project that is available and contains a configured pipeline server.
- You have compiled your pipeline with the Kubeflow compiler and you have access to the resulting YAML file.
- If you are uploading your pipeline from a URL, the URL is publicly accessible.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines.
- On the Pipelines page, from the Project drop-down list, select the project that you want to import a pipeline to.
- Click Import pipeline.
- In the Import pipeline dialog, enter the details for the pipeline that you want to import. - In the Pipeline name field, enter a name for the pipeline that you want to import.
- In the Pipeline description field, enter a description for the pipeline that want to import.
- Select where you want to import your pipeline from by performing one of the following actions: - Select Upload a file to upload your pipeline from your local machine’s file system. Import your pipeline by clicking Upload, or by dragging and dropping a file.
- Select Import by url to upload your pipeline from a URL, and then enter the URL into the text box.
 
- Click Import pipeline.
 
Verification
- The pipeline that you imported appears on the Pipelines page and on the Pipelines tab on the project details page.
1.4. Deleting a data science pipeline
If you no longer require access to your data science pipeline on the dashboard, you can delete it so that it does not appear on the Data Science Pipelines page.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- There are active pipelines available on the Pipelines page.
- The pipeline that you want to delete does not contain any pipeline versions.
- The pipeline that you want to delete does not contain any pipeline versions. For more information, see Deleting a pipeline version.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines.
- On the Pipelines page, from the Project drop-down list, select the project that contains the pipeline that you want to delete.
- Click the action menu (⋮) beside the pipeline that you want to delete, and then click Delete pipeline.
- In the Delete pipeline dialog, enter the pipeline name in the text field to confirm that you intend to delete it.
- Click Delete pipeline.
Verification
- The data science pipeline that you deleted no longer appears on the Pipelines page.
1.5. Deleting a pipeline server
After you have finished running your data science pipelines, you can delete the pipeline server. Deleting a pipeline server automatically deletes all of its associated pipelines, pipeline versions, and runs. If your pipeline data is stored in a database, the database is also deleted along with its meta-data. In addition, after deleting a pipeline server, you cannot create new pipelines or pipeline runs until you create another pipeline server.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have previously created a data science project that is available and contains a pipeline server.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines.
- On the Pipelines page, from the Project drop-down list, select the project that contains the pipeline server that you want to delete.
- From the Pipeline server actions list, select Delete pipeline server.
- In the Delete pipeline server dialog, enter the name of the pipeline server in the text field to confirm that you intend to delete it.
- Click Delete.
Verification
- Pipelines previously assigned to the deleted pipeline server no longer appear on the Pipelines page for the relevant data science project.
- Pipeline runs previously assigned to the deleted pipeline server no longer appear on the Runs page for the relevant data science project.
1.6. Viewing the details of a pipeline server
You can view the details of pipeline servers configured in OpenShift AI, such as the pipeline’s connection details and where its data is stored.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- You have previously created a data science project that contains an active and available pipeline server.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines.
- On the Pipelines page, from the Project drop-down list, select the project that contains the pipeline server that you want to view.
- From the Pipeline server actions list, select View pipeline server configuration.
Verification
- You can view the pipeline server details in the View pipeline server dialog.
1.7. Viewing existing pipelines
You can view the details of pipelines that you have imported to Red Hat OpenShift AI, such as the pipeline’s last run, when it was created, the pipeline’s executed runs, and details of any associated pipeline versions.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have previously created a data science project that is available and contains a pipeline server.
- You have imported a pipeline to an active pipeline server.
- Existing pipelines are available.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines.
- On the Pipelines page, from the Project drop-down list, select the project that contains the pipelines that you want to view.
- 
						Optional: Click Expand ( 
						 ) on the row of a pipeline to view its pipeline versions. ) on the row of a pipeline to view its pipeline versions.
Verification
- A list of data science pipelines appears on the Pipelines page.
1.8. Overview of pipeline versions
You can manage incremental changes to pipelines in OpenShift AI by using versioning. This allows you to develop and deploy pipelines iteratively, preserving a record of your changes. You can track and manage your changes on the OpenShift AI dashboard, allowing you to schedule and execute runs against all available versions of your pipeline.
1.9. Uploading a pipeline version
You can upload a YAML file to an active pipeline server that contains the latest version of your pipeline, or you can upload the YAML file from a URL. The YAML file must consist of a Kubeflow pipeline compiled by using the Kubeflow compiler. After you upload a pipeline version to a pipeline server, you can execute it by creating a pipeline run.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have previously created a data science project that is available and contains a configured pipeline server.
- You have a pipeline version available and ready to upload.
- If you are uploading your pipeline version from a URL, the URL is publicly accessible.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines.
- On the Pipelines page, from the Project drop-down list, select the project that you want to upload a pipeline version to.
- Click the Import pipeline drop-down list, and then select Upload new version.
- In the Upload new version dialog, enter the details for the pipeline version that you are uploading. - From the Pipeline list, select the pipeline that you want to upload your pipeline version to.
- In the Pipeline version name field, confirm the name for the pipeline version, and change it if necessary.
- In the Pipeline version description field, enter a description for the pipeline version.
- Select where you want to upload your pipeline version from by performing one of the following actions: - Select Upload a file to upload your pipeline version from your local machine’s file system. Import your pipeline version by clicking Upload, or by dragging and dropping a file.
- Select Import by url to upload your pipeline version from a URL, and then enter the URL into the text box.
 
- Click Upload.
 
Verification
- 
						The pipeline version that you uploaded is displayed on the Pipelines page. Click Expand ( 
						 ) on the row containing the pipeline to view its versions. ) on the row containing the pipeline to view its versions.
- The Version column on the row containing the pipeline version that you uploaded on the Pipelines page increments by one.
1.10. Deleting a pipeline version
You can delete specific versions of a pipeline when you no longer require them. Deleting a default pipeline version automatically changes the default pipeline version to the next most recent version. If no pipeline versions exist, the pipeline persists without a default version.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have previously created a data science project that is available and contains a pipeline server.
- You have imported a pipeline to an active pipeline server.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines. - The Pipelines page opens. 
- Delete the pipeline versions that you no longer require: - To delete a single pipeline version: - From the Project list, select the project that contains a version of a pipeline that you want to delete.
- 
										On the row containing the pipeline, click Expand ( 
										 ). ).
- Click the action menu (⋮) beside the project version that you want to delete, and then click Delete pipeline version. - The Delete pipeline version dialog opens. 
- Enter the name of the pipeline version in the text field to confirm that you intend to delete it.
- Click Delete.
 
- To delete multiple pipeline versions: - On the row containing each pipeline version that you want to delete, select the checkbox.
- Click the action menu (⋮) next to the Import pipeline drop-down list, and then select Delete from the list.
 
 
Verification
- The pipeline version that you deleted no longer appears on the Pipelines page, or on the Pipelines tab for the data science project.
1.11. Viewing the details of a pipeline version
You can view the details of a pipeline version that you have uploaded to Red Hat OpenShift AI, such as its graph and YAML code.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have previously created a data science project that is available and contains a pipeline server.
- You have a pipeline available on an active and available pipeline server.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines. - The Pipelines page opens. 
- From the Project drop-down list, select the project that contains the pipeline versions that you want to view details for.
- Click the project name to view further details of its most recent version. - The pipeline version details page opens, displaying the Graph, Summary, and Pipeline spec tabs. - Alternatively, click Expand (  ) on the row containing the pipeline that you want to view versions for, and then click the pipeline version that you want to view the details of. ) on the row containing the pipeline that you want to view versions for, and then click the pipeline version that you want to view the details of.- The pipeline version details page opens, displaying the Graph, Summary, and Pipeline spec tabs. 
Verification
- On the pipeline version details page, you can view the pipeline graph, summary details, and YAML code.
1.12. Downloading a data science pipeline version
To make further changes to a data science pipeline version that you previously uploaded to OpenShift AI, you can download pipeline version code from the user interface.
Prerequisites
- You have logged in to Red Hat OpenShift AI.
- 
						If you are using OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-usersorrhoai-admins) in OpenShift.
- You have previously created a data science project that is available and contains a configured pipeline server.
- You have created and imported a pipeline to an active pipeline server that is available to download.
Procedure
- From the OpenShift AI dashboard, click Data Science Pipelines.
- On the Pipelines page, from the Project drop-down list, select the project that contains the version that you want to download.
- 
						Click Expand ( 
						 ) beside the pipeline that contains the version that you want to download. ) beside the pipeline that contains the version that you want to download.
- Click the pipeline version that you want to download. - The pipeline version details page opens. 
- 
						On the Pipeline spec tab, click the Download button ( 
						 ) to download the YAML file that contains the pipeline version code to your local machine. ) to download the YAML file that contains the pipeline version code to your local machine.
Verification
- The pipeline version code downloads to your browser’s default directory for downloaded files.
1.13. Overview of data science pipelines caching
OpenShift AI supports caching within data science pipelines to optimize execution times and improve resource efficiency. Using caching reduces redundant task execution by reusing results from previous runs with identical inputs.
Caching is particularly beneficial for iterative tasks, where intermediate steps might not need to be repeated. Understanding caching can help you design more efficient pipelines and save time in model development.
Caching operates by storing the outputs of successfully completed tasks and comparing the inputs of new tasks against previously cached ones. If a match is found, OpenShift AI reuses the cached results instead of re-executing the task, reducing computation time and resource usage.
1.13.1. Caching criteria
For caching to be effective, the following criteria determines if a task can use previously cached results:
- Input data and parameters: If the input data and parameters for a task are unchanged from a previous run, cached results are eligible for reuse.
- Task code and configuration: Changes to the task code or configurations invalidate the cache to ensure that modifications are always reflected.
- Pipeline environment: Changes to the pipeline environment, such as dependency versions, also affect caching eligibility to maintain consistency.
1.13.2. Viewing cached steps in the OpenShift AI user interface
Cached steps in pipelines are visually indicated in the user interface (UI):
- 
							Tasks that use cached results display a green icon, helping you quickly identify which steps were cached. The Status field in the side panel displays Cachedfor cached tasks.
- The UI also includes information about when the task was previously executed, allowing for easy verification of cache usage.
To confirm caching status for specific tasks, navigate to the pipeline details view in the UI, where all cached and non-cached tasks are indicated. When a pipeline task is cached, its execution logs are not available. This is because the task uses previously generated outputs, eliminating the need for re-execution.
1.13.3. Disabling caching for specific tasks or pipelines
In OpenShift AI, caching is enabled by default, but there are situations where disabling caching for specific tasks or the entire pipeline is necessary. For example, tasks that rely on frequently updated data or unique computational needs might not benefit from caching.
1.13.3.1. Disabling caching for individual tasks
						To disable caching for a particular task, apply the set_caching_options method directly to the task in your pipeline code:
					
						task_name.set_caching_options(False)
					
After applying this setting, OpenShift AI executes the task in all future pipeline runs, ignoring any cached results.
							You can re-enable caching for individual tasks by setting set_caching_options(True).
						
1.13.3.2. Disabling caching for pipelines
						If necessary, you can disable caching for the entire pipeline during pipeline submission by setting the enable_caching parameter to False in your pipeline code. This setting ensures that no steps are cached during pipeline execution. The enable_caching parameter is available only when using the kfp.client to submit pipelines or start pipeline runs, such as the run_pipeline method.
					
Example:
						pipeline_func(enable_caching=False)
					
When disabling caching at the pipeline level, all tasks are re-executed, potentially increasing compute time and resource usage.
1.13.4. Verification and troubleshooting
After configuring caching settings, you can verify that caching behaves as expected by using one of the following methods:
- Checking the UI: Confirm cached steps by locating the steps with the green icon in the task list.
- Testing task re-runs: Disable caching on individual tasks or the pipeline and check for re-execution to verify cache bypassing.
- Validating inputs: Ensure the task inputs, parameters, and environment remain unchanged if caching applies.