Preface


As a data scientist, you can enhance your data science projects on OpenShift AI by building portable machine learning (ML) workflows with data science pipelines, using Docker containers. This enables you to standardize and automate machine learning workflows to enable you to develop and deploy your data science models.

For example, the steps in a machine learning workflow might include items such as data extraction, data processing, feature extraction, model training, model validation, and model serving. Automating these activities enables your organization to develop a continuous process of retraining and updating a model based on newly received data. This can help address challenges related to building an integrated machine learning deployment and continuously operating it in production.

You can also use the Elyra JupyterLab extension to create and run data science pipelines within JupyterLab. For more information, see Working with pipelines in JupyterLab.

From OpenShift AI version 2.9, data science pipelines are based on KubeFlow Pipelines (KFP) version 2.0. For more information, see Enabling data science pipelines 2.0.

A data science pipeline in OpenShift AI consists of the following components:

  • Pipeline server: A server that is attached to your data science project and hosts your data science pipeline.
  • Pipeline: A pipeline defines the configuration of your machine learning workflow and the relationship between each component in the workflow.

    • Pipeline code: A definition of your pipeline in a YAML file.
    • Pipeline graph: A graphical illustration of the steps executed in a pipeline run and the relationship between them.
  • Pipeline experiment: A workspace where you can try different configurations of your pipelines. You can use experiments to organize your runs into logical groups.

    • Archived pipeline experiment: An archived pipeline experiment.
    • Pipeline artifact: An output artifact produced by a pipeline component.
    • Pipeline execution: The execution of a task in a pipeline.
  • Pipeline run: An execution of your pipeline.

    • Active run: A pipeline run that is executing, or stopped.
    • Scheduled run: A pipeline run that is scheduled to execute at least once.
    • Archived run: An archived pipeline run.

This feature is based on Kubeflow Pipelines 2.0. Use the latest Kubeflow Pipelines 2.0 SDK to build your data science pipeline in Python code. After you have built your pipeline, use the SDK to compile it into an Intermediate Representation (IR) YAML file. The OpenShift AI user interface enables you to track and manage pipelines and pipeline runs. You can manage incremental changes to pipelines in OpenShift AI by using versioning. This allows you to develop and deploy pipelines iteratively, preserving a record of your changes.

You can store your pipeline artifacts in an S3-compatible object storage bucket so that you do not consume local storage. To do this, you must first configure write access to your S3 bucket on your storage account.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.