Chapter 2. Setting up a project and storage


2.2. Setting up your data science project

Before you begin, make sure that you are logged in to Red Hat OpenShift AI.

Procedure

  1. On the navigation menu, select Data Science Projects. This page lists any existing projects that you have access to. From this page, you can select an existing project (if any) or create a new one.

    Launch Jupyter link

    Note that it is possible to start a Jupyter notebook by clicking the Launch standalone notebook server link, selecting a notebook image, and clicking Start server. However, it would be a one-off Jupyter notebook run in isolation. To implement a data science workflow, you must create a data science project (as described in the following procedure). Projects allow you and your team to organize and collaborate on resources within separated namespaces. From a project you can create multiple workbenches, each with their own IDE environment (for example, JupyterLab), and each with their own connections and cluster storage. In addition, the workbenches can share models and data with pipelines and model servers.

  2. If you are using your own OpenShift cluster, click Create project.

    Note

    If you are using the Red Hat Developer Sandbox, you are provided with a default data science project (for example, myname-dev). Select it and skip over the next step to the Verification section.

  3. Enter a display name and description.

    New data science project form

Verification

You can see your project’s initial state. Individual tabs provide more information about the project components and project access permissions:

New data science project
  • Workbenches are instances of your development and experimentation environment. They typically contain IDEs, such as JupyterLab, RStudio, and Visual Studio Code.
  • Pipelines contain the data science pipelines that are executed within the project.
  • Models allow you to quickly serve a trained model for real-time inference. You can have multiple model servers per data science project. One model server can host multiple models.
  • Cluster storage is a persistent volume that retains the files and data you’re working on within a workbench. A workbench has access to one or more cluster storage instances.
  • Connections contain configuration parameters that are required to connect to a data source, such as an S3 object bucket.
  • Permissions define which users and groups can access the project.

2.3. Storing data with connections

Add connections to workbenches if you want to connect your project to data inputs and object storage buckets. A connection is a resource that contains the configuration parameters needed to connect to a data source or data sink, such as an AWS S3 object storage bucket.

For this tutorial, you need two S3-compatible object storage buckets, such as Ceph, Minio, or AWS S3. You can use your own storage buckets or run a provided script that creates the following local Minio storage buckets for you:

  • My Storage - Use this bucket for storing your models and data. You can reuse this bucket and its connection for your notebooks and model servers.
  • Pipelines Artifacts - Use this bucket as storage for your pipeline artifacts. A pipeline artifacts bucket is required when you create a pipeline server. For this tutorial, create this bucket to separate it from the first storage bucket for clarity.

Also, you must create a connection to each storage bucket. You have two options for this tutorial, depending on whether you want to use your own storage buckets or use a script to create local Minio storage buckets:

Note

While it is possible for you to use one storage bucket for both purposes (storing models and data as well as storing pipeline artifacts), this tutorial follows best practice and uses separate storage buckets for each purpose.

2.3.1. Creating connections to your own S3-compatible object storage

If you have existing S3-compatible storage buckets that you want to use for this tutorial, you must create a connection to one storage bucket for saving your data and models. If you want to complete the pipelines section of this tutorial, create another connection to a different storage bucket for saving pipeline artifacts.

Note

If you do not have your own s3-compatible storage, or if you want to use a disposable local Minio instance instead, skip this section and follow the steps in Running a script to install local object storage buckets and create connections. The provided script automatically completes the following tasks for you: creates a Minio instance in your project, creates two storage buckets in that Minio instance, creates two connections in your project, one for each bucket and both using the same credentials, and installs required network policies for service mesh functionality.

Prerequisites

To create connections to your existing S3-compatible storage buckets, you need the following credential information for the storage buckets:

  • Endpoint URL
  • Access key
  • Secret key
  • Region
  • Bucket name

If you don’t have this information, contact your storage administrator.

Procedure

  1. Create a connection for saving your data and models:

    1. In the OpenShift AI dashboard, navigate to the page for your data science project.
    2. Click the Connections tab, and then click Add connection.

      Add connection
    3. In the Add connection modal, for the Connection type select S3 compatible object storage - v1.
    4. Complete the Add connection form and name your connection My Storage. This connection is for saving your personal work, including data and models.

      Note

      Skip the Connected workbench item. You add connections to a workbench in a later section.

      Add my storage form
    5. Click Add connection.
  2. Create a connection for saving pipeline artifacts:

    Note

    If you do not intend to complete the pipelines section of the tutorial, you can skip this step.

    1. Click Add connection.
    2. Complete the form and name your connection Pipeline Artifacts.

      Note

      Skip the Connected workbench item. You add connections to a workbench in a later section.

      Add pipeline artifacts form
    3. Click Add connection.

Verification

In the Connections tab for the project, check to see that your connections are listed.

List of project connections

Next steps

If you want to complete the pipelines section of this tutorial, go to Enabling data science pipelines.

Otherwise, skip to Creating a workbench.

2.3.2. Running a script to install local object storage buckets and create connections

For convenience, run a script (provided in the following procedure) that automatically completes these tasks:

  • Creates a Minio instance in your project.
  • Creates two storage buckets in that Minio instance.
  • Generates a random user id and password for your Minio instance.
  • Creates two connections in your project, one for each bucket and both using the same credentials.
  • Installs required network policies for service mesh functionality.

The script is based on a guide for deploying Minio.

Important

The Minio-based Object Storage that the script creates is not meant for production usage.

Note

If you want to connect to your own storage, see Creating connections to your own S3-compatible object storage.

Prerequisites

You must know the OpenShift resource name for your data science project so that you run the provided script in the correct project. To get the project’s resource name:

In the OpenShift AI dashboard, select Data Science Projects and then click the ? icon next to the project name. A text box appears with information about the project, including its resource name:

Project list resource name
Note

The following procedure describes how to run the script from the OpenShift console. If you are knowledgeable in OpenShift and can access the cluster from the command line, instead of following the steps in this procedure, you can use the following command to run the script:

oc apply -n <your-project-name/> -f https://github.com/rh-aiservices-bu/fraud-detection/raw/main/setup/setup-s3.yaml

Procedure

  1. In the OpenShift AI dashboard, click the application launcher icon and then select the OpenShift Console option.

    OpenShift Console Link
  2. In the OpenShift console, click + in the top navigation bar.

    Add resources Icon
  3. Select your project from the list of projects.

    Select a project
  4. Verify that you selected the correct project.

    Selected project
  5. Copy the following code and paste it into the Import YAML editor.

    Note

    This code gets and applies the setup-s3-no-sa.yaml file.

    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: demo-setup
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: demo-setup-edit
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: edit
    subjects:
      - kind: ServiceAccount
        name: demo-setup
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: create-s3-storage
    spec:
      selector: {}
      template:
        spec:
          containers:
            - args:
                - -ec
                - |-
                  echo -n 'Setting up Minio instance and connections'
                  oc apply -f https://github.com/rh-aiservices-bu/fraud-detection/raw/main/setup/setup-s3-no-sa.yaml
              command:
                - /bin/bash
              image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest
              imagePullPolicy: IfNotPresent
              name: create-s3-storage
          restartPolicy: Never
          serviceAccount: demo-setup
          serviceAccountName: demo-setup
  6. Click Create.

Verification

  1. In the OpenShift console, you should see a "Resources successfully created" message and the following resources listed:

    • demo-setup
    • demo-setup-edit
    • create-s3-storage
  2. In the OpenShift AI dashboard:

    1. Select Data Science Projects and then click the name of your project, Fraud detection.
    2. Click Connections. You should see two connections listed: My Storage and Pipeline Artifacts.

      Connections for Fraud Detection

Next steps

If you want to complete the pipelines section of this tutorial, go to Enabling data science pipelines.

Otherwise, skip to Creating a workbench.

2.4. Enabling data science pipelines

Note

If you do not intend to complete the pipelines section of the workshop you can skip this step and move on to the next section, Create a Workbench.

In this section, you prepare your tutorial environment so that you can use data science pipelines.

In this tutorial, you implement an example pipeline by using the JupyterLab Elyra extension. With Elyra, you can create a visual end-to-end pipeline workflow that can be executed in OpenShift AI.

Prerequisite

Procedure

  1. In the OpenShift AI dashboard, on the Fraud Detection page, click the Pipelines tab.
  2. Click Configure pipeline server.

    Create pipeline server button
  3. In the Configure pipeline server form, in the Access key field next to the key icon, click the dropdown menu and then click Pipeline Artifacts to populate the Configure pipeline server form with credentials for the connection.

    Selecting the Pipeline Artifacts connection
  4. Leave the database configuration as the default.
  5. Click Configure pipeline server.
  6. Wait until the loading spinner disappears and Start by importing a pipeline is displayed.

    Important

    You must wait until the pipeline configuration is complete before you continue and create your workbench. If you create your workbench before the pipeline server is ready, your workbench will not be able to submit pipelines to it.

    If you have waited more than 5 minutes, and the pipeline server configuration does not complete, you can try to delete the pipeline server and create it again.

    Delete pipeline server

    You can also ask your OpenShift AI administrator to verify that self-signed certificates are added to your cluster as described in Working with certificates.

Verification

  1. Navigate to the Pipelines tab for the project.
  2. Next to Import pipeline, click the action menu (⋮) and then select View pipeline server configuration.

    View pipeline server configuration menu

    An information box opens and displays the object storage connection information for the pipeline server.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.