Search

Chapter 5. Tuning a model by using the Training Operator

download PDF

To tune a model by using the Kubeflow Training Operator, you configure and run a training job.

Optionally, you can use Low-Rank Adaptation (LoRA) to efficiently fine-tune large language models, such as Llama 3. The integration optimizes computational requirements and reduces memory footprint, allowing fine-tuning on consumer-grade GPUs. The solution combines PyTorch Fully Sharded Data Parallel (FSDP) and LoRA to enable scalable, cost-effective model training and inference, enhancing the flexibility and performance of AI workloads within OpenShift environments.

5.1. Configuring the training job

Before you can use a training job to tune a model, you must configure the training job. The example training job in this section is based on the IBM and Hugging Face tuning example provided in GitHub.

Prerequisites

  • You have logged in to OpenShift.
  • You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • You have created a data science project. For information about how to create a project, see Creating a data science project.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.
  • You have access to a model.
  • You have access to data that you can use to train the model.

Procedure

  1. In a terminal window, if you are not already logged in to your OpenShift cluster, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <username> -p <password>
  2. Configure a training job, as follows:

    1. Create a YAML file named config_trainingjob.yaml.
    2. Add the ConfigMap object definition as follows:

      Example training-job configuration

      kind: ConfigMap
      apiVersion: v1
      metadata:
        name: training-config
        namespace: kfto
      data:
        config.json: |
          {
            "model_name_or_path": "bigscience/bloom-560m",
            "training_data_path": "/data/input/twitter_complaints.json",
            "output_dir": "/data/output/tuning/bloom-twitter",
            "num_train_epochs": 10.0,
            "per_device_train_batch_size": 4,
            "gradient_accumulation_steps": 4,
            "learning_rate": 1e-05,
            "response_template": "\n### Label:",
            "dataset_text_field": "output",
            "use_flash_attn": false
          }

    3. Optional: To fine-tune with Low Rank Adaptation (LoRA), update the config.json section as follows:

      1. Set the peft_method parameter to "lora".
      2. Add the lora_r, lora_alpha, lora_dropout, bias, and target_modules parameters.

        Example LoRA configuration

              ...
              "use_flash_attn": false,
              "peft_method": "lora",
              "lora_r": 8,
              "lora_alpha": 8,
              "lora_dropout": 0.1,
              "bias": "none",
              "target_modules": ["all-linear"]
            }

    4. Edit the metadata of the training-job configuration as shown in the following table.

      Table 5.1. Training-job configuration metadata
      ParameterValue

      name

      Name of the training-job configuration

      namespace

      Name of your project

    5. Edit the parameters of the training-job configuration as shown in the following table.

      Table 5.2. Training-job configuration parameters
      ParameterValue

      model_name_or_path

      Name of the pre-trained model or the path to the model in the training-job container; in this example, the model name is taken from the Hugging Face web page

      training_data_path

      Path to the training data that you set in the training_data.yaml ConfigMap

      output_dir

      Output directory for the model

      num_train_epochs

      Number of epochs for training; in this example, the training job is set to run 10 times

      per_device_train_batch_size

      Batch size, the number of data set examples to process together; in this example, the training job processes 4 examples at a time

      gradient_accumulation_steps

      Number of gradient accumulation steps

      learning_rate

      Learning rate for the training

      response_template

      Template formatting for the response

      dataset_text_field

      Dataset field for training output, as set in the training_data.yaml config map

      use_flash_attn

      Whether to use flash attention

      peft_method

      Tuning method: for full fine-tuning, omit this parameter; for LoRA, set to "lora"; for prompt tuning, set to "pt"

      lora_r

      LoRA: Rank of the low-rank decomposition

      lora_alpha

      LoRA: Scale the low-rank matrices to control their influence on the model’s adaptations

      lora_dropout

      LoRA: Dropout rate applied to the LoRA layers, a regularization technique to prevent overfitting

      bias

      LoRA: Whether to adapt bias terms in the model; setting the bias to "none" indicates that no bias terms will be adapted

      target_modules

      LoRA: Names of the modules to apply LoRA to; to include all linear layers, set to "all_linear"; optional parameter for some models

    6. Save your changes in the config_trainingjob.yaml file.
    7. Apply the configuration to create the training-config object:

      $ oc apply -f config_trainingjob.yaml
  3. Create the training data.

    Note

    The training data in this simple example is for demonstration purposes only, and is not suitable for production use. The usual method for providing training data is to use persistent volumes.

    1. Create a YAML file named training_data.yaml.
    2. Add the following ConfigMap object definition:

      kind: ConfigMap
      apiVersion: v1
      metadata:
        name: twitter-complaints
        namespace: kfto
      data:
        twitter_complaints.json: |
          [
              {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"},
              {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"},
              {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"},
              {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"},
              {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"},
              {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"},
              {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"},
              {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year ��","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year ��\n\n### Label: no complaint"}
          ]
    3. Replace the example namespace value kfto with the name of your project.
    4. Replace the example training data with your training data.
    5. Save your changes in the training_data.yaml file.
    6. Apply the configuration to create the training data:

      $ oc apply -f training_data.yaml
  4. Create a persistent volume claim (PVC), as follows:

    1. Create a YAML file named trainedmodelpvc.yaml.
    2. Add the following PersistentVolumeClaim object definition:

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: trained-model
        namespace: kfto
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
    3. Replace the example namespace value kfto with the name of your project, and update the other parameters to suit your environment. To calculate the storage value, multiply the model size by the number of epochs, and add a little extra as a buffer.
    4. Save your changes in the trainedmodelpvc.yaml file.
    5. Apply the configuration to create a Persistent Volume Claim (PVC) for the training job:

      $ oc apply -f trainedmodelpvc.yaml

Verification

  1. In the OpenShift console, select your project from the Project list.
  2. Click ConfigMaps and verify that the training-config and twitter-complaints ConfigMaps are listed.
  3. Click Search. From the Resources list, select PersistentVolumeClaim and verify that the trained-model PVC is listed.

5.2. Running the training job

You can run a training job to tune a model. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.

Prerequisites

  • You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • You have created a data science project. For information about how to create a project, see Creating a data science project.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.
  • You have access to a model.
  • You have access to data that you can use to train the model.
  • You have configured the training job as described in Configuring the training job.

Procedure

  1. In a terminal window, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <username> -p <password>
  2. Create a PyTorch training job, as follows:

    1. Create a YAML file named pytorchjob.yaml.
    2. Add the following PyTorchJob object definition:

      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      metadata:
        name: kfto-demo
        namespace: kfto
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: Never
            template:
              spec:
                containers:
                  - env:
                      - name: SFT_TRAINER_CONFIG_JSON_PATH
                        value: /etc/config/config.json
                    image: 'quay.io/modh/fms-hf-tuning:release'
                    imagePullPolicy: IfNotPresent
                    name: pytorch
                    volumeMounts:
                      - mountPath: /etc/config
                        name: config-volume
                      - mountPath: /data/input
                        name: dataset-volume
                      - mountPath: /data/output
                        name: model-volume
                volumes:
                  - configMap:
                      items:
                        - key: config.json
                          path: config.json
                      name: training-config
                    name: config-volume
                  - configMap:
                      name: twitter-complaints
                    name: dataset-volume
                  - name: model-volume
                    persistentVolumeClaim:
                      claimName: trained-model
        runPolicy:
          suspend: false
    3. Replace the example namespace value kfto with the name of your project, and update the other parameters to suit your environment.
    4. Edit the parameters of the PyTorch training job, to provide the details for your training job and environment.
    5. Save your changes in the pytorchjob.yaml file.
    6. Apply the configuration to run the PyTorch training job:

      $ oc apply -f pytorchjob.yaml

Verification

  1. In the OpenShift console, select your project from the Project list.
  2. Click Workloads Pods and verify that the <training-job-name>-master-0 pod is listed.

5.3. Monitoring the training job

When you run a training job to tune a model, you can monitor the progress of the job. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.

Prerequisites

  • You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • You have created a data science project. For information about how to create a project, see Creating a data science project.
  • You have Admin access for the data science project.

    • If you created the project, you automatically have Admin access.
    • If you did not create the project, your cluster administrator must give you Admin access.
  • You have access to a model.
  • You have access to data that you can use to train the model.
  • You are running the training job as described in Running the training job.

Procedure

  1. In the OpenShift console, select your project from the Project list.
  2. Click Workloads Pods.
  3. Search for the pod that corresponds to the PyTorch job, that is, <training-job-name>-master-0.

    For example, if the training job name is kfto-demo, the pod name is kfto-demo-master-0.

  4. Click the pod name to open the pod details page.
  5. Click the Logs tab to monitor the progress of the job and view status updates, as shown in the following example output:

    0%| | 0/10 [00:00<?, ?it/s] 10%|█ | 1/10 [01:10<10:32, 70.32s/it] {'loss': 6.9531, 'grad_norm': 1104.0, 'learning_rate': 9e-06, 'epoch': 1.0}
    10%|█ | 1/10 [01:10<10:32, 70.32s/it] 20%|██ | 2/10 [01:40<06:13, 46.71s/it] 30%|███ | 3/10 [02:26<05:25, 46.55s/it] {'loss': 2.4609, 'grad_norm': 736.0, 'learning_rate': 7e-06, 'epoch': 2.0}
    30%|███ | 3/10 [02:26<05:25, 46.55s/it] 40%|████ | 4/10 [03:23<05:02, 50.48s/it] 50%|█████ | 5/10 [03:41<03:13, 38.66s/it] {'loss': 1.7617, 'grad_norm': 328.0, 'learning_rate': 5e-06, 'epoch': 3.0}
    50%|█████ | 5/10 [03:41<03:13, 38.66s/it] 60%|██████ | 6/10 [04:54<03:22, 50.58s/it] {'loss': 3.1797, 'grad_norm': 1016.0, 'learning_rate': 4.000000000000001e-06, 'epoch': 4.0}
    60%|██████ | 6/10 [04:54<03:22, 50.58s/it] 70%|███████ | 7/10 [06:03<02:49, 56.59s/it] {'loss': 2.9297, 'grad_norm': 984.0, 'learning_rate': 3e-06, 'epoch': 5.0}
    70%|███████ | 7/10 [06:03<02:49, 56.59s/it] 80%|████████ | 8/10 [06:38<01:39, 49.57s/it] 90%|█████████ | 9/10 [07:22<00:48, 48.03s/it] {'loss': 1.4219, 'grad_norm': 684.0, 'learning_rate': 1.0000000000000002e-06, 'epoch': 6.0}
    90%|█████████ | 9/10 [07:22<00:48, 48.03s/it]100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'loss': 1.9609, 'grad_norm': 648.0, 'learning_rate': 0.0, 'epoch': 6.67}
    100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'train_runtime': 508.0444, 'train_samples_per_second': 0.197, 'train_steps_per_second': 0.02, 'train_loss': 2.63125, 'epoch': 6.67}
    100%|██████████| 10/10 [08:28<00:00, 52.53s/it]100%|██████████| 10/10 [08:28<00:00, 50.80s/it]

    In the example output, the solid blocks indicate progress bars.

Verification

  1. The <training-job-name>-master-0 pod is running.
  2. The Logs tab provides information about the job progress and job status.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.