Chapter 4. Working with data science pipelines
As a data scientist, you can enhance your data science projects on OpenShift Data Science by building portable machine learning (ML) workflows with data science pipelines, using Docker containers. This enables you to standardize and automate machine learning workflows to enable you to develop and deploy your data science models.
For example, the steps in a machine learning workflow might include items such as data extraction, data processing, feature extraction, model training, model validation, and model serving. Automating these activities enables your organization to develop a continuous process of retraining and updating a model based on newly received data. This can help address challenges related to building an integrated machine learning deployment and continuously operating it in production.
You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information, see Working with pipelines in JupyterLab.
A data science pipeline in OpenShift Data Science consists of the following components:
- Pipeline server: A server that is attached to your data science project and hosts your data science pipeline.
Pipeline: A pipeline defines the configuration of your machine learning workflow and the relationship between each component in the workflow.
- Pipeline code: A definition of your pipeline in a Tekton-formatted YAML file.
- Pipeline graph: A graphical illustration of the steps executed in a pipeline run and the relationship between them.
Pipeline run: An execution of your pipeline.
- Triggered run: A previously executed pipeline run.
- Scheduled run: A pipeline run scheduled to execute at least once.
This feature is based on Kubeflow Pipelines v1. Use the Kubeflow Pipelines SDK to build your data science pipeline in Python code. After you have built your pipeline, compile it into Tekton-formatted YAML code using kfp-tekton SDK (version 1.5.x only). The OpenShift Data Science user interface enables you to track and manage pipelines and pipeline runs.
Before you can use data science pipelines, you must install the OpenShift Pipelines operator. For more information about installing a compatible version of the OpenShift Pipelines operator, see Red Hat OpenShift Pipelines release notes and Red Hat OpenShift Data Science: Supported Configurations.
You can store your pipeline artifacts in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket so that you do not consume local storage. To do this, you must first configure write access to your S3 bucket on your AWS account.
4.1. Managing data science pipelines
4.1.1. Configuring a pipeline server
Before you can successfully create a pipeline in OpenShift Data Science, you must configure a pipeline server. This includes configuring where your pipeline artifacts and data are stored.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have created a data science project that you can add a pipeline server to.
Procedure
From the OpenShift Data Science dashboard, click Data Science Projects.
The Data science projects page opens.
Click the name of the project that you want to configure a pipeline server for.
A project details page opens.
In the Pipelines section, click Create a pipeline server.
The Configure pipeline server dialog appears.
In the Object storage connection section, to specify the S3-compatible data connection to store your pipeline artifacts, select one of the following sets of actions:
NoteAfter the pipeline server is created, the
/metadata
and/artifacts
folders are automatically created in the defaultroot
folder. Therefore, you are not required to specify any storage directories when configuring a data connection for your pipeline server.- Select Existing data connection to use a data connection that you previously defined. If you selected this option, from the Name list, select the name of the relevant data connection and skip to step 6.
- Select Create new data connection to add a new data connection that your pipeline server can access.
If you selected Create new data connection, perform the following steps:
- In the Name field, enter a name for the data connection.
- In the AWS_ACCESS_KEY_ID field, enter your access key ID for Amazon Web Services.
- In the AWS_SECRET_ACCESS_KEY_ID field, enter your secret access key for the account you specified.
- Optional: In the AWS_S3_ENDPOINT field, enter the endpoint of your AWS S3 storage.
- Optional: In the AWS_DEFAULT_REGION field, enter the default region of your AWS account.
In the AWS_S3_BUCKET field, enter the name of the AWS S3 bucket.
ImportantIf you are creating a new data connection, in addition to the other designated mandatory fields, the AWS_S3_BUCKET field is mandatory. If you specify incorrect data connection settings, you cannot update these settings on the same pipeline server. Therefore, you must delete the pipeline server and configure another one.
In the Database section, click Show advanced database options to specify the database to store your pipeline data and select one of the following sets of actions:
- Select Use default database stored on your cluster to deploy a MariaDB database in your project.
Select Connect to external MySQL database to add a new connection to an external database that your pipeline server can access.
- In the Host field, enter the database’s host name.
- In the Port field, enter the database’s port.
- In the Username field, enter the default user name that is connected to the database.
- In the Password field, enter the password for the default user account.
- In the Database field, enter the database name.
- Click Configure.
Verification
- The pipeline server that you configured is displayed in the Pipelines section on the project details page.
- The Import pipeline button is available in the Pipelines section on the project details page.
4.1.2. Defining a pipeline
The Kubeflow Pipelines SDK enables you to define end-to-end machine learning and data pipelines. Use the Kubeflow Pipelines SDK to build your data science pipeline in Python code. After you have built your pipeline, compile it into Tekton-formatted YAML code using kfp-tekton SDK (version 1.5.x only). After defining the pipeline, you can import the YAML file to the OpenShift Data Science dashboard to enable you to configure its execution settings. For more information about installing and using Kubeflow Pipelines SDK for Tetkon, see Kubeflow Pipelines SDK for Tekton.
You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information on creating pipelines in JupyterLab, see Working with pipelines in JupyterLab. For more information on the Elyra JupyterLab extension, see Elyra Documentation.
4.1.3. Importing a data science pipeline
To help you begin working with data science pipelines in OpenShift Data Science, you can import a YAML file containing your pipeline’s code to an active pipeline server. This file contains a Kubeflow pipeline compiled with the Tekton compiler. After you have imported the pipeline to a pipeline server, you can execute the pipeline by creating a pipeline run.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a configured pipeline server.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
- From the Project list, select the project that you want to import a pipeline to.
Click Import pipeline.
The Import pipeline dialog opens.
Enter the details for the pipeline that you are importing.
- In the Pipeline name field, enter a name for the pipeline that you are importing.
- In the Pipeline description field, enter a description for the pipeline that you are importing.
Click Upload. Alternatively, drag the file from your local machine’s file system and drop it in the designated area in the Import pipeline dialog.
A file browser opens.
- Navigate to the file containing the pipeline code and click Select.
- Click Import pipeline.
Verification
- The pipeline that you imported is displayed on the Pipelines page.
4.1.4. Downloading a data science pipeline
To make further changes to a data science pipeline that you previously uploaded to OpenShift Data Science, you can download the pipeline’s code from the user interface.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a configured pipeline server.
- You have created and imported a pipeline to an active pipeline server that is available to download.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
- From the Project list, select the project whose pipeline that you want to download.
In the Pipeline name column, click the name of the pipeline that you want to download.
The Pipeline details page opens displaying the Graph tab.
Click the YAML tab.
The page reloads to display an embedded YAML editor showing the pipeline code.
- Click the Download button ( ) to download the YAML file containing your pipeline’s code to your local machine.
Verification
- The pipeline code is downloaded to your browser’s default directory for downloaded files.
4.1.5. Deleting a data science pipeline
You can delete data science pipelines so that they do not appear on the OpenShift Data Science Pipelines page.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - There are active pipelines available on the Pipelines page.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
- From the Project list, select the project that contains the pipeline that you want to delete.
Click the action menu (⋮) beside the pipeline that you want to delete and click Delete pipeline.
The Delete pipeline dialog opens.
- Enter the pipeline name in the text field to confirm that you intend to delete it.
- Click Delete pipeline.
Verification
- The data science pipeline that you deleted is no longer displayed on the Pipelines page.
4.1.6. Deleting a pipeline server
After you have finished running your data science pipelines, you can delete the pipeline server. Deleting a pipeline server automatically deletes all of its associated pipelines and runs. If your pipeline data is stored in a database, the database is also deleted along with its meta-data. In addition, after deleting a pipeline server, you cannot create new pipelines or pipeline runs until you create another pipeline server.
Prerequisites
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a pipeline server.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
- From the Project list, select the project whose pipeline server you want to delete.
- From the Pipeline server actions list, select Delete pipeline server. The Delete pipeline server dialog opens.
- Enter the pipeline server’s name in the text field to confirm that you intend to delete it.
- Click Delete.
Verification
- Pipelines previously assigned to the deleted pipeline server are no longer displayed on the Pipelines page for the relevant data science project.
- Pipeline runs previously assigned to the deleted pipeline server are no longer displayed on the Runs page for the relevant data science project.
4.1.7. Viewing the details of a pipeline server
You can view the details of pipeline servers configured in OpenShift Data Science, such as the pipeline’s data connection details and where its data is stored.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
- You have previously created a data science project that contains an active and available pipeline server.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admins
) in OpenShift.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
- From the Project list, select the project whose pipeline server you want to view.
- From the Pipeline server actions list, select View pipeline server configuration.
- When you have finished inspecting the pipeline server’s details, click Done.
Verification
- You can view the relevant pipeline server’s details in the View pipeline server dialog.
4.1.8. Viewing existing pipelines
You can view the details of pipelines that you have imported to Red Hat OpenShift Data Science, such as the pipeline’s last run, when it was created, and the pipeline’s executed runs.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admins
) in OpenShift. - You have previously created a data science project that is available and contains a pipeline server.
- You have imported a pipeline to an active and available pipeline server.
- The pipeline you imported is available, or there are other previously imported pipelines available to view.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
- From the Project list, select the relevant project whose pipelines you want to view.
- Study the pipelines on the list.
- Optional: Click Expand ( ) on the relevant row to view the pipeline’s executed runs. If the pipeline does not contain any runs, click Create run to create one.
Verification
- A list of previously created data science pipelines is displayed on the Pipelines page.
4.2. Managing pipeline runs
4.2.1. Overview of pipeline runs
A pipeline run is a single execution of a data science pipeline. As data scientist, you can use OpenShift Data Science to define, manage, and track executions of a data science pipeline. You can view a record of your data science project’s previously executed and scheduled runs from the Runs page in the OpenShift Data Science user interface.
Runs are intended for portability. Therefore, you can clone your pipeline runs to reproduce and scale them accordingly, or delete them when you longer require them. You can configure a run to execute only once immediately after creation or on a recurring basis. Recurring runs consist of a copy of a pipeline with all of its parameter values and a run trigger. A run trigger indicates when a recurring run executes. You can define the following run triggers:
- Periodic: used for scheduling runs to execute in intervals.
- Cron: used for scheduling runs as a cron job.
When executed, you can track the run’s progress from the run’s Details page on the OpenShift Data Science user interface. From here, you can view the run’s graph, and output artifacts.
A pipeline run can be classified as the following:
- Scheduled run: A pipeline run scheduled to execute at least once
- Triggered run: A previously executed pipeline run.
4.2.2. Scheduling a pipeline run
You can instantiate a single execution of a pipeline by scheduling a pipeline run. In OpenShift Data Science, you can schedule runs to occur at specific times or execute them immediately after creation.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admins
) in OpenShift. - You have previously created a data science project that is available and contains a configured pipeline server.
- You have imported a pipeline to an active pipeline server.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
Click the action menu (⋮) beside the relevant pipeline and click Create run.
The Create run page opens.
- From the Project list, select the project that contains the pipeline you want to create a run for.
- In the Name field, enter a name for the run.
- In the Description field, enter a description for the run.
- From the Pipeline list, select the pipeline to create a run for. Alternatively, to upload a new pipeline, click Upload new pipeline and fill in the relevant fields in the Import pipeline dialog.
Configure the run type by performing one of the following sets of actions:
- Select Run once immediately after creation to specify the run executes once, and immediately after its creation.
Select Schedule recurring run to schedule the run to recur.
Configure the run’s trigger type.
- Select Periodic and select the execution frequency from the list.
-
Select Cron to specify the execution schedule in
cron
format. This creates a cron job to execute the run. Click the Copy button ( ) to copy the cron job schedule to the clipboard. The field furthest to the left represents seconds. For more information about scheduling tasks using the supportedcron
format, see Cron Expression Format.
Configure the run’s duration.
- Select the Start date check box to specify a start date for the run. Select the run’s start date using the Calendar and the start time from the list of times.
- Select the End date check box to specify an end date for the run. Select the run’s end date using the Calendar and the end time from the list of times.
- Configure the input parameters for the run by selecting the parameters from the list.
- Click Create.
Verification
- The pipeline run that you created is shown in the Scheduled tab on the Runs page.
4.2.3. Cloning a scheduled pipeline run
To make it easier to schedule runs to execute as part of your pipeline configuration, you can duplicate existing scheduled runs by cloning them.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a configured pipeline server.
- You have imported a pipeline to an active pipeline server.
- You have previously scheduled a run that is available to clone.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Runs. The Runs page opens.
Click the action menu (⋮) beside the relevant run and click Clone.
The Clone page opens.
- From the Project list, select the project that contains the pipeline whose run that you want to clone.
- In the Name field, enter a name for the run that you want to clone.
- In the Description field, enter a description for the run that you want to clone.
- From the Pipeline list, select the pipeline containing the run that you want to clone.
To configure the run type for the run that you are cloning, in the Run type section, perform one of the following sets of actions:
- Select Run once immediately after create to specify the run that you are cloning executes once, and immediately after its creation. If you selected this option, skip to step 10.
- Select Schedule recurring run to schedule the run that you are cloning to recur.
If you selected Schedule recurring run in the previous step, configure the trigger type for the run, perform one of the following actions:
- Select Periodic and select the execution frequency from the Run every list.
-
Select Cron to specify the execution schedule in
cron
format. This creates a cron job to execute the run. Click the Copy button ( ) to copy the cron job schedule to the clipboard. The field furthest to the left represents seconds. For more information about scheduling tasks using the supportedcron
format, see Cron Expression Format.
If you selected Schedule recurring run in step 7, configure the duration for the run that you are cloning.
- Select the Start date check box to specify a start date for the run. Select the start date using the calendar tool and the start time from the list of times.
- Select the End date check box to specify an end date for the run. Select the end date using the calendar tool and the end time from the list of times.
- In the Parameters section, configure the input parameters for the run that you are cloning by selecting the appropriate parameters from the list.
- Click Create.
Verification
- The pipeline run that you cloned is shown in the Scheduled tab on the Runs page.
4.2.4. Stopping a triggered pipeline run
If you no longer require a triggered pipeline run to continue executing, you can stop the run before its defined end date.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admins
) in OpenShift. - There is a previously created data science project available that contains a pipeline server.
- You have imported a pipeline to an active and available pipeline server.
- You have previously triggered a pipeline run.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Runs. The Runs page opens.
- From the Project list, select the project whose pipeline runs you want to stop.
- Click the Triggered tab.
In the Name column in the table, click the name of the run that you want to stop.
The Run details page opens.
From the Actions list, select Stop run
There might be a short delay while the run stops.
Verification
- A list of previously triggered runs are displayed in the Triggered tab on the Runs page.
4.2.5. Deleting a scheduled pipeline run
To discard pipeline runs that you previously scheduled, but no longer require, you can delete them so that they do not appear on the Runs page.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a configured pipeline server.
- You have imported a pipeline to an active pipeline server.
- You have previously scheduled a run that is available to delete.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Runs. The Runs page opens.
From the Project list, select the project that contains the pipeline whose scheduled run you want to delete.
The page refreshes to show the pipeline’s scheduled runs on the Scheduled tab.
Click the action menu (⋮) beside the scheduled run that you want to delete and click Delete.
The Delete scheduled run dialog opens.
- Enter the run’s name in the text field to confirm that you intend to delete it.
- Click Delete scheduled run.
Verification
- The run that you deleted is no longer displayed on the Scheduled tab.
4.2.6. Deleting a triggered pipeline run
To discard pipeline runs that you previously executed, but no longer require a record of, you can delete them so that they do not appear on the Triggered tab on the Runs page.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a configured pipeline server.
- You have imported a pipeline to an active pipeline server.
- You have previously executed a run that is available to delete.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Runs. The Runs page opens.
From the Project list, select the project that contains the pipeline whose triggered run you want to delete.
The page refreshes to show the pipeline’s triggered runs on the Triggered tab.
Click the action menu (⋮) beside the triggered run that you want to delete and click Delete.
The Delete triggered run dialog opens.
- Enter the run’s name in the text field to confirm that you intend to delete it.
- Click Delete triggered run.
Verification
- The run that you deleted is no longer displayed on the Triggered tab.
4.2.7. Viewing scheduled pipeline runs
You can view a list of pipeline runs that are scheduled for execution in OpenShift Data Science. From this list, you can view details relating to your pipeline’s runs, such as the pipeline that the run belongs to. You can also view the run’s status, execution frequency, and schedule.
Prerequisites
- You have logged in to Red Hat OpenShift Data Science.
- You have installed the OpenShift Pipelines operator.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a pipeline server.
- You have imported a pipeline to an active and available pipeline server.
- You have created and scheduled a pipeline run.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Runs. The Runs page opens.
- From the Project list, select the project whose scheduled pipeline runs you want to view.
- Click the Scheduled tab.
Study the table showing a list of scheduled runs.
After a run has been scheduled, the run’s status is displayed in the Status column in the table, indicating whether the run is ready for execution or unavailable for execution. To enable or disable a previously imported notebook image, on the row containing the relevant notebook image, click the toggle in the Enabled column.
Verification
- A list of scheduled runs are displayed in the Scheduled tab on the Runs page.
4.2.8. Viewing triggered pipeline runs
You can view a list of pipeline runs that were previously executed in OpenShift Data Science. From this list, you can view details relating to your pipeline’s runs, such as the pipeline that the run belongs to, along with the run’s status, duration, and execution start time.
Prerequisites
- You have logged in to Red Hat OpenShift Data Science.
- You have installed the OpenShift Pipelines operator.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have previously created a data science project that is available and contains a pipeline server.
- You have imported a pipeline to an active and available pipeline server.
- You have previously triggered a pipeline run.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Runs. The Runs page opens.
From the Project list, select the project whose previously executed pipeline runs you want to view.
The Run details page opens.
Click the Triggered tab.
A table opens that shows list of triggered runs. After a run has completed its execution, the run’s status is displayed in the Status column in the table, indicating whether the run has succeeded or failed.
Verification
- A list of previously triggered runs are displayed in the Triggered tab on the Runs page.
4.2.9. Viewing the details of a pipeline run
To gain a clearer understanding of your pipeline runs, you can view the details of a previously triggered pipeline run, such as its graph, execution details, and run output.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admins
) in OpenShift. - You have previously created a data science project that is available and contains a pipeline server.
- You have imported a pipeline to an active and available pipeline server.
- You have previously triggered a pipeline run.
Procedure
From the OpenShift Data Science dashboard, click Data Science Pipelines
Pipelines. The Pipelines page opens.
- From the Project list, select the project whose pipeline runs you want to view.
- For a pipeline that you want to see run details for, click Expand ( ).
From the Runs section, click the name of the run that you want to view the details of.
The Run details page opens.
Verification
- On the Run details page, you can view the run’s graph, execution details, input parameters, and run output.
4.3. Working with pipelines in JupyterLab
4.3.1. Overview of pipelines in JupyterLab
You can use Elyra to create visual end-to-end pipeline workflows in JupyterLab. Elyra is an extension for JupyterLab that provides you with a Pipeline Editor to create pipeline workflows that can be executed in OpenShift Data Science.
Before you can work with pipelines in JupyterLab, you must install the OpenShift Pipelines operator. For more information about installing a compatible version of the OpenShift Pipelines operator, see Red Hat OpenShift Pipelines release notes and Red Hat Openshift Data Science: Supported Configurations.
You can access the Elyra extension within JupyterLab when you create the most recent version of one of the following notebook images:
- Standard Data Science
- PyTorch
- TensorFlow
- TrustyAI
As you can use the Pipeline Editor to visually design your pipelines, minimal coding is required to create and run pipelines. For more information about Elyra, see Elyra Documentation. For more information on the Pipeline Editor, see Visual Pipeline Editor. After you have created your pipeline, you can run it locally in JupyterLab, or remotely using data science pipelines in OpenShift Data Science.
The pipeline creation process consists of the following tasks:
- Create a data science project that contains a workbench.
- Create a pipeline server.
- Create a new pipeline in the Pipeline Editor in JupyterLab.
- Develop your pipeline by adding Python notebooks or Python scripts and defining their runtime properties.
- Define execution dependencies.
- Run or export your pipeline.
Before you can run a pipeline in JupyterLab, your pipeline instance must contain a runtime configuration. A runtime configuration defines connectivity information for your pipeline instance and S3-compatible cloud storage.
If you create a workbench as part of a data science project, a default runtime configuration is created automatically. However, if you create a notebook from the Jupyter tile in the OpenShift Data Science dashboard, you must create a runtime configuration before you can run your pipeline in JupyterLab. For more information about runtime configurations, see Runtime Configuration. As a prerequisite, before you create a workbench, ensure that you have created and configured a pipeline server within the same data science project as your workbench.
You can use S3-compatible cloud storage to make data available to your notebooks and scripts while they are executed. Your cloud storage must be accessible from the machine in your deployment that runs JupyterLab and from the cluster that hosts Data Science Pipelines. Before you create and run pipelines in JupyterLab, ensure that you have your s3-compatible storage credentials readily available.
Additional resources
4.3.2. Accessing the pipeline editor
You can use Elyra to create visual end-to-end pipeline workflows in JupyterLab. Elyra is an extension for JupyterLab that provides you with a Pipeline Editor to create pipeline workflows that can be executed in OpenShift Data Science.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have created a data science project that contains a workbench.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- You have created and launched a Jupyter server from a notebook image that contains the Elyra extension (Standard data science, TensorFlow, TrustyAI, or PyTorch).
- You have access to S3-compatible storage.
Procedure
- After you open JupyterLab, confirm that the JupyterLab launcher is automatically displayed.
In the Elyra section of the JupyterLab launcher, click the Pipeline Editor tile.
The Pipeline Editor opens.
Verification
- You can view the Pipeline Editor in JupyterLab.
4.3.3. Creating a runtime configuration
If you create a workbench as part of a data science project, a default runtime configuration is created automatically. However, if you create a notebook from the Jupyter tile in the OpenShift Data Science dashboard, you must create a runtime configuration before you can run your pipeline in JupyterLab. This enables you to specify connectivity information for your pipeline instance and S3-compatible cloud storage.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have access to S3-compatible cloud storage.
- You have created a data science project that contains a workbench.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- You have created and launched a Jupyter server from a notebook image that contains the Elyra extension (Standard data science, TensorFlow, TrustyAI, or PyTorch).
Procedure
- In the left sidebar of JupyterLab, click Runtimes ( ).
Click the Create new runtime configuration button ( ).
The Add new Data Science Pipelines runtime configuration page opens.
Fill in the relevant fields to define your runtime configuration.
- In the Display Name field, enter a name for your runtime configuration.
- Optional: In the Description field, enter a description to define your runtime configuration.
- Optional: In the Tags field, click Add Tag to define a category for your pipeline instance. Enter a name for the tag and press Enter.
Define the credentials of your data science pipeline:
- In the Data Science Pipelines API Endpoint field, enter the API endpoint of your data science pipeline. Do not specify the pipelines namespace in this field.
In the Public Data Science Pipelines API Endpoint field, enter the public API endpoint of your data science pipeline.
ImportantYou can obtain the Data Science Pipelines API endpoint from the Data Science Pipelines
Runs page in the dashboard. Copy the relevant end point and enter it in the Public Data Science Pipelines API Endpoint field. - Optional: In the Data Science Pipelines User Namespace field, enter the relevant user namespace to run pipelines.
-
From the Data Science Pipelines engine list, select
Tekton
. From the Authentication Type list, select the authentication type required to authenticate your pipeline.
ImportantIf you created a notebook directly from the Jupyter tile on the dashboard, select
EXISTING_BEARER_TOKEN
from the Authentication Type list.- In the Data Science Pipelines API Endpoint Username field, enter the user name required for the authentication type.
In the Data Science Pipelines API Endpoint Password Or Token, enter the password or token required for the authentication type.
ImportantTo obtain the Data Science Pipelines API endpoint token, in the upper-right corner of the OpenShift web console, click your user name and select Copy login command. After you have logged in, click Display token and copy the value of
--token=
from the Log in with this token command.
Define the connectivity information of your S3-compatible storage:
- In the Cloud Object Storage Endpoint field, enter the endpoint of your S3-compatible storage. For more information about Amazon s3 endpoints, see Amazon Simple Storage Service endpoints and quotas.
- Optional: In the Public Cloud Object Storage Endpoint field, enter the URL of your S3-compatible storage.
- In the Cloud Object Storage Bucket Name field, enter the name of the bucket where your pipeline artifacts are stored. If the bucket name does not exist, it is created automatically.
-
From the Cloud Object Storage Authentication Type list, select the authentication type required to access to your S3-compatible cloud storage. If you use AWS S3 buckets, select
KUBERNETES_SECRET
from the list. - In the Cloud Object Storage Credentials Secret field, enter the secret that contains the storage user name and password. This secret is defined in the relevant user namespace, if applicable. In addition, it must be stored on the cluster that hosts your pipeline runtime.
- In the Cloud Object Storage Username field, enter the user name to connect to your S3-compatible cloud storage, if applicable. If you use AWS S3 buckets, enter your AWS Secret Access Key ID.
- In the Cloud Object Storage Password field, enter the password to connect to your S3-compatible cloud storage, if applicable. If you use AWS S3 buckets, enter your AWS Secret Access Key.
- Click Save & Close.
Verification
- The runtime configuration that you created is shown in the Runtimes tab ( ) in the left sidebar of JupyterLab.
4.3.4. Updating a runtime configuration
To ensure that your runtime configuration is accurate and updated, you can change the settings of an existing runtime configuration.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admins
) in OpenShift. - You have access to S3-compatible storage.
- You have created a data science project that contains a workbench.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- A previously created runtime configuration is available in the JupyterLab interface.
- You have created and launched a Jupyter server from a notebook image that contains the Elyra extension (Standard data science, TensorFlow, TrustyAI, or PyTorch).
Procedure
- In the left sidebar of JupyterLab, click Runtimes ( ).
Hover the cursor over the runtime configuration that you want to update and click the Edit button ( ).
The Data Science Pipelines runtime configuration page opens.
Fill in the relevant fields to update your runtime configuration.
- In the Display Name field, update name for your runtime configuration, if applicable.
- Optional: In the Description field, update the description of your runtime configuration, if applicable.
- Optional: In the Tags field, click Add Tag to define a category for your pipeline instance. Enter a name for the tag and press Enter.
Define the credentials of your data science pipeline:
- In the Data Science Pipelines API Endpoint field, update the API endpoint of your data science pipeline, if applicable. Do not specify the pipelines namespace in this field.
- In the Public Data Science Pipelines API Endpoint field, update the API endpoint of your data science pipeline, if applicable.
- Optional: In the Data Science Pipelines User Namespace field, update the relevant user namespace to run pipelines, if applicable.
-
From the Data Science Pipelines engine list, select
Tekton
. From the Authentication Type list, select a new authentication type required to authenticate your pipeline, if applicable.
ImportantIf you created a notebook directly from the Jupyter tile on the dashboard, select
EXISTING_BEARER_TOKEN
from the Authentication Type list.- In the Data Science Pipelines API Endpoint Username field, update the user name required for the authentication type, if applicable.
In the Data Science Pipelines API Endpoint Password Or Token, update the password or token required for the authentication type, if applicable.
ImportantTo obtain the Data Science Pipelines API endpoint token, in the upper-right corner of the OpenShift web console, click your user name and select Copy login command. After you have logged in, click Display token and copy the value of
--token=
from the Log in with this token command.
Define the connectivity information of your S3-compatible storage:
- In the Cloud Object Storage Endpoint field, update the endpoint of your S3-compatible storage, if applicable. For more information about Amazon s3 endpoints, see Amazon Simple Storage Service endpoints and quotas.
- Optional: In the Public Cloud Object Storage Endpoint field, update the URL of your S3-compatible storage, if applicable.
- In the Cloud Object Storage Bucket Name field, update the name of the bucket where your pipeline artifacts are stored, if applicable. If the bucket name does not exist, it is created automatically.
-
From the Cloud Object Storage Authentication Type list, update the authentication type required to access to your S3-compatible cloud storage, if applicable. If you use AWS S3 buckets, you must select
USER_CREDENTIALS
from the list. - Optional: In the Cloud Object Storage Credentials Secret field, update the secret that contains the storage user name and password, if applicable. This secret is defined in the relevant user namespace. You must save the secret on the cluster that hosts your pipeline runtime.
- Optional: In the Cloud Object Storage Username field, update the user name to connect to your S3-compatible cloud storage, if applicable. If you use AWS S3 buckets, update your AWS Secret Access Key ID.
- Optional: In the Cloud Object Storage Password field, update the password to connect to your S3-compatible cloud storage, if applicable. If you use AWS S3 buckets, update your AWS Secret Access Key.
- Click Save & Close.
Verification
- The runtime configuration that you updated is shown in the Runtimes tab ( ) in the left sidebar of JupyterLab.
4.3.5. Deleting a runtime configuration
After you have finished using your runtime configuration, you can delete it from the JupyterLab interface. After deleting a runtime configuration, you cannot run pipelines in JupyterLab until you create another runtime configuration.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have created a data science project that contains a workbench.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- A previously created runtime configuration is visible in the JupyterLab interface.
- You have created and launched a Jupyter server from a notebook image that contains the Elyra extension (Standard data science, TensorFlow, TrustyAI, or PyTorch).
Procedure
- In the left sidebar of JupyterLab, click Runtimes ( ).
Hover the cursor over the runtime configuration that you want to delete and click the Delete Item button ( ).
A dialog box appears prompting you to confirm the deletion of your runtime configuration.
- Click OK.
Verification
- The runtime configuration that you deleted is no longer shown in the Runtimes tab ( ) in the left sidebar of JupyterLab.
4.3.6. Duplicating a runtime configuration
To prevent you from re-creating runtime configurations with similar values in their entirety, you can duplicate an existing runtime configuration in the JupyterLab interface.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have created a data science project that contains a workbench.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- A previously created runtime configuration is visible in the JupyterLab interface.
- You have created and launched a Jupyter server from a notebook image that contains the Elyra extension (Standard data science, TensorFlow, TrustyAI, or PyTorch).
Procedure
- In the left sidebar of JupyterLab, click Runtimes ( ).
- Hover the cursor over the runtime configuration that you want to duplicate and click the Duplicate button ( ).
Verification
- The runtime configuration that you duplicated is shown in the Runtimes tab ( ) in the left sidebar of JupyterLab.
4.3.7. Running a pipeline in JupyterLab
You can run pipelines that you have created in JupyterLab from the Pipeline Editor user interface. Before you can run a pipeline, you must create a data science project and a pipeline server. After you create a pipeline server, you must create a workbench within the same project as your pipeline server. Your pipeline instance in JupyterLab must contain a runtime configuration. If you create a workbench as part of a data science project, a default runtime configuration is created automatically. However, if you create a notebook from the Jupyter tile in the OpenShift Data Science dashboard, you must create a runtime configuration before you can run your pipeline in JupyterLab. A runtime configuration defines connectivity information for your pipeline instance and S3-compatible cloud storage.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admins
) in OpenShift. - You have access to S3-compatible storage.
- You have created a pipeline in JupyterLab.
- You have opened your pipeline in the Pipeline Editor in JupyterLab.
- Your pipeline instance contains a runtime configuration.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- You have created and launched a Jupyter server from a notebook image that contains the Elyra extension (Standard data science, TensorFlow, TrustyAI, or PyTorch).
Procedure
In the Pipeline Editor user interface, click Run Pipeline ( ).
The Run Pipeline dialog appears. The Pipeline Name field is automatically populated with the pipeline file name.
ImportantYou must enter a unique pipeline name. The pipeline name that you enter must not match the name of any previously executed pipelines.
Define the settings for your pipeline run.
- From the Runtime Configuration list, select the relevant runtime configuration to run your pipeline.
- Optional: Configure your pipeline parameters, if applicable. If your pipeline contains nodes that reference pipeline parameters, you can change the default parameter values. If a parameter is required and has no default value, you must enter a value.
- Click OK.
Verification
- You can view the output artifacts of your pipeline run. The artifacts are stored in your designated object storage bucket.
4.3.8. Exporting a pipeline in JupyterLab
You can export pipelines that you have created in JupyterLab. When you export a pipeline, the pipeline is prepared for later execution, but is not uploaded or executed immediately. During the export process, any package dependencies are uploaded to S3-compatible storage. Also, pipeline code is generated for the target runtime.
Before you can export a pipeline, you must create a data science project and a pipeline server. After you create a pipeline server, you must create a workbench within the same project as your pipeline server. In addition, your pipeline instance in JupyterLab must contain a runtime configuration. If you create a workbench as part of a data science project, a default runtime configuration is created automatically. However, if you create a notebook from the Jupyter tile in the OpenShift Data Science dashboard, you must create a runtime configuration before you can export your pipeline in JupyterLab. A runtime configuration defines connectivity information for your pipeline instance and S3-compatible cloud storage.
Prerequisites
- You have installed the OpenShift Pipelines operator.
- You have logged in to Red Hat OpenShift Data Science.
-
If you are using specialized OpenShift Data Science groups, you are part of the user group or admin group (for example,
rhods-users
orrhods-admin
) in OpenShift. - You have created a data science project that contains a workbench.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- You have access to S3-compatible storage.
- You have a created a pipeline in JupyterLab.
- You have opened your pipeline in the Pipeline Editor in JupyterLab.
- Your pipeline instance contains a runtime configuration.
- You have created and launched a Jupyter server from a notebook image that contains the Elyra extension (Standard data science, TensorFlow, TrustyAI, or PyTorch).
Procedure
In the Pipeline Editor user interface, click Export Pipeline ( ).
The Export Pipeline dialog appears. The Pipeline Name field is automatically populated with the pipeline file name.
Define the settings to export your pipeline.
- From the Runtime Configuration list, select the relevant runtime configuration to export your pipeline.
- From the Export Pipeline as select an appropriate file format
- In the Export Filename field, enter a file name for the exported pipeline.
- Select the Replace if file already exists check box to replace an existing file of the same name as the pipeline you are exporting.
- Optional: Configure your pipeline parameters, if applicable. If your pipeline contains nodes that reference pipeline parameters, you can change the default parameter values. If a parameter is required and has no default value, you must enter a value.
- Click OK.
Verification
- You can view the file containing the pipeline that you exported in your designated object storage bucket.