Chapter 5. Tuning a model by using the Training Operator

To tune a model by using the Kubeflow Training Operator, you configure and run a training job.

Optionally, you can use Low-Rank Adaptation (LoRA) to efficiently fine-tune large language models, such as Llama 3. The integration optimizes computational requirements and reduces memory footprint, allowing fine-tuning on consumer-grade GPUs. The solution combines PyTorch Fully Sharded Data Parallel (FSDP) and LoRA to enable scalable, cost-effective model training and inference, enhancing the flexibility and performance of AI workloads within OpenShift environments.

5.1. Configuring the training job
Copy link

Before you can use a training job to tune a model, you must configure the training job. The example training job in this section is based on the IBM and Hugging Face tuning example provided in GitHub.

Prerequisites

You have logged in to OpenShift.
You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
You have created a data science project. For information about how to create a project, see Creating a data science project.
You have Admin access for the data science project.
- If you created the project, you automatically have Admin access.
- If you did not create the project, your cluster administrator must give you Admin access.
You have access to a model.
You have access to data that you can use to train the model.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <username> -p <password>
```

Configure a training job, as follows:

Create a YAML file named config_trainingjob.yaml.

Add the ConfigMap object definition as follows:

Example training-job configuration

kind: ConfigMap
apiVersion: v1
metadata:
  name: training-config
  namespace: kfto
data:
  config.json: |
    {
      "model_name_or_path": "bigscience/bloom-560m",
      "training_data_path": "/data/input/twitter_complaints.json",
      "output_dir": "/data/output/tuning/bloom-twitter",
      "save_model_dir": "/mnt/output/model",
      "num_train_epochs": 10.0,
      "per_device_train_batch_size": 4,
      "per_device_eval_batch_size": 4,
      "gradient_accumulation_steps": 4,
      "save_strategy": "no",
      "learning_rate": 1e-05,
      "weight_decay": 0.0,
      "lr_scheduler_type": "cosine",
      "include_tokens_per_second": true,
      "response_template": "\n### Label:",
      "dataset_text_field": "output",
      "padding_free": ["huggingface"],
      "multipack": [16],
      "use_flash_attn": false
    }

Optional: To fine-tune with Low Rank Adaptation (LoRA), update the config.json section as follows:
1. Set the peft_method parameter to "lora".
2. Add the lora_r, lora_alpha, lora_dropout, bias, and target_modules parameters.
  Example LoRA configuration
  ... "peft_method": "lora", "lora_r": 8, "lora_alpha": 8, "lora_dropout": 0.1, "bias": "none", "target_modules": ["all-linear"] }
Optional: To fine-tune with Quantized Low Rank Adaptation (QLoRA), update the config.json section as follows:
1. Set the use_flash_attn parameter to "true".
2. Set the peft_method parameter to "lora".
3. Add the LoRA parameters: lora_r, lora_alpha, lora_dropout, bias, and target_modules.
4. Add the QLoRA mandatory parameters: auto_gptq, torch_dtype, and fp16.
5. If required, add the QLoRA optional parameters: fused_lora and fast_kernels.
  Example QLoRA configuration
  ... "use_flash_attn": true, "peft_method": "lora", "lora_r": 8, "lora_alpha": 8, "lora_dropout": 0.1, "bias": "none", "target_modules": ["all-linear"], "auto_gptq": ["triton_v2"], "torch_dtype": float16, "fp16": true, "fused_lora": ["auto_gptq", true], "fast_kernels": [true, true, true] }

Edit the metadata of the training-job configuration as shown in the following table.

Expand

Table 5.1. Training-job configuration metadata
Parameter	Value
`name`	Name of the training-job configuration
`namespace`	Name of your project

Edit the parameters of the training-job configuration as shown in the following table.

Expand

Table 5.2. Training-job configuration parameters
Parameter	Value
`model_name_or_path`	Name of the pre-trained model or the path to the model in the training-job container; in this example, the model name is taken from the Hugging Face web page
`training_data_path`	Path to the training data that you set in the `training_data.yaml` ConfigMap
`output_dir`	Output directory for the model
`save_model_dir`	Directory where the tuned model is saved
`num_train_epochs`	Number of epochs for training; in this example, the training job is set to run 10 times
`per_device_train_batch_size`	Batch size, the number of data set examples to process together; in this example, the training job processes 4 examples at a time
`per_device_eval_batch_size`	Batch size, the number of data set examples to process together per GPU or TPU core or CPU; in this example, the training job processes 4 examples at a time
`gradient_accumulation_steps`	Number of gradient accumulation steps
`save_strategy`	How often model checkpoints can be saved; the default value is `"epoch"` (save model checkpoint every epoch), other possible values are `"steps"` (save model checkpoint for every training step) and `"no"` (do not save model checkpoints)
`save_total_limit`	Number of model checkpoints to save; omit if `save_strategy` is set to `"no"` (no model checkpoints saved)
`learning_rate`	Learning rate for the training
`weight_decay`	Weight decay to apply
`lr_scheduler_type`	Optional: Scheduler type to use; the default value is `"linear"`, other possible values are `"cosine"`, `"cosine_with_restarts"`, `"polynomial"`, `"constant"`, and `"constant_with_warmup"`
`include_tokens_per_second`	Optional: Whether or not to compute the number of tokens per second per device for training speed metrics
`response_template`	Template formatting for the response
`dataset_text_field`	Dataset field for training output, as set in the `training_data.yaml` config map
`padding_free`	Whether to use a technique to process multiple examples in a single batch without adding padding tokens that waste compute resources; if used, this parameter must be set to `["huggingface"]`
`multipack`	Whether to use a technique for multi-GPU training to balance the number of tokens processed in each device, to minimize waiting time; you can experiment with different values to find the optimum value for your training job
`use_flash_attn`	Whether to use flash attention
`peft_method`	Tuning method: for full fine-tuning, omit this parameter; for LoRA and QLoRA, set to `"lora"`; for prompt tuning, set to `"pt"`
`lora_r`	LoRA: Rank of the low-rank decomposition
`lora_alpha`	LoRA: Scale the low-rank matrices to control their influence on the model’s adaptations
`lora_dropout`	LoRA: Dropout rate applied to the LoRA layers, a regularization technique to prevent overfitting
`bias`	LoRA: Whether to adapt bias terms in the model; setting the bias to `"none"` indicates that no bias terms will be adapted
`target_modules`	LoRA: Names of the modules to apply LoRA to; to include all linear layers, set to "all_linear"; optional parameter for some models
`auto_gptq`	QLoRA: Sets 4-bit GPTQ-LoRA with AutoGPTQ; when used, this parameter must be set to `["triton_v2"]`
`torch_dtype`	QLoRA: Tensor datatype; when used, this parameter must be set to `float16`
`fp16`	QLoRA: Whether to use half-precision floating-point format; when used, this parameter must be set to `true`
`fused_lora`	QLoRA: Whether to use fused LoRA for more efficient LoRA training; if used, this parameter must be set to `["auto_gptq", true]`
`fast_kernels`	QLoRA: Whether to use fast cross-entropy, rope, rms loss kernels; if used, this parameter must be set to `[true, true, true]`

Save your changes in the config_trainingjob.yaml file.
Apply the configuration to create the training-config object:
```
$ oc apply -f config_trainingjob.yaml
```

Create the training data.

Note

The training data in this simple example is for demonstration purposes only, and is not suitable for production use. The usual method for providing training data is to use persistent volumes.

Create a YAML file named training_data.yaml.

Add the following ConfigMap object definition:

kind: ConfigMap
apiVersion: v1
metadata:
  name: twitter-complaints
  namespace: kfto
data:
  twitter_complaints.json: |
    [
        {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"},
        {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"},
        {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"},
        {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"},
        {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"},
        {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"},
        {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"},
        {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year ��","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year ��\n\n### Label: no complaint"}
    ]

Replace the example namespace value kfto with the name of your project.
Replace the example training data with your training data.
Save your changes in the training_data.yaml file.
Apply the configuration to create the training data:
```
$ oc apply -f training_data.yaml
```

Create a persistent volume claim (PVC), as follows:
1. Create a YAML file named trainedmodelpvc.yaml.
2. Add the following PersistentVolumeClaim object definition:
  apiVersion: v1 kind: PersistentVolumeClaim metadata: name: trained-model namespace: kfto spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi
3. Replace the example namespace value kfto with the name of your project, and update the other parameters to suit your environment. To calculate the storage value, multiply the model size by the number of epochs, and add a little extra as a buffer.
4. Save your changes in the trainedmodelpvc.yaml file.
5. Apply the configuration to create a Persistent Volume Claim (PVC) for the training job:
  $ oc apply -f trainedmodelpvc.yaml

Verification

In the OpenShift console, select your project from the Project list.
Click ConfigMaps and verify that the training-config and twitter-complaints ConfigMaps are listed.
Click Search. From the Resources list, select PersistentVolumeClaim and verify that the trained-model PVC is listed.

5.2. Running the training job
Copy link

You can run a training job to tune a model. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.

Prerequisites

You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
You have created a data science project. For information about how to create a project, see Creating a data science project.
You have Admin access for the data science project.
- If you created the project, you automatically have Admin access.
- If you did not create the project, your cluster administrator must give you Admin access.
You have access to a model.
You have access to data that you can use to train the model.
You have configured the training job as described in Configuring the training job.

Procedure

In a terminal window, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <username> -p <password>
```

Create a PyTorch training job, as follows:

Create a YAML file named pytorchjob.yaml.

Add the following PyTorchJob object definition:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: kfto-demo
  namespace: kfto
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - env:
                - name: SFT_TRAINER_CONFIG_JSON_PATH
                  value: /etc/config/config.json
              image: 'quay.io/modh/fms-hf-tuning:release'
              imagePullPolicy: IfNotPresent
              name: pytorch
              volumeMounts:
                - mountPath: /etc/config
                  name: config-volume
                - mountPath: /data/input
                  name: dataset-volume
                - mountPath: /data/output
                  name: model-volume
          volumes:
            - configMap:
                items:
                  - key: config.json
                    path: config.json
                name: training-config
              name: config-volume
            - configMap:
                name: twitter-complaints
              name: dataset-volume
            - name: model-volume
              persistentVolumeClaim:
                claimName: trained-model
  runPolicy:
    suspend: false

Replace the example namespace value kfto with the name of your project, and update the other parameters to suit your environment.
Edit the parameters of the PyTorch training job, to provide the details for your training job and environment.
Save your changes in the pytorchjob.yaml file.
Apply the configuration to run the PyTorch training job:
```
$ oc apply -f pytorchjob.yaml
```

Verification

In the OpenShift console, select your project from the Project list.
Click Workloads Pods and verify that the <training-job-name>-master-0 pod is listed.

5.3. Monitoring the training job
Copy link

When you run a training job to tune a model, you can monitor the progress of the job. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.

Prerequisites

You have access to a data science cluster that is configured to run distributed workloads as described in Managing distributed workloads.
You have created a data science project. For information about how to create a project, see Creating a data science project.
You have Admin access for the data science project.
- If you created the project, you automatically have Admin access.
- If you did not create the project, your cluster administrator must give you Admin access.
You have access to a model.
You have access to data that you can use to train the model.
You are running the training job as described in Running the training job.

Procedure

In the OpenShift console, select your project from the Project list.
Click Workloads Pods.
Search for the pod that corresponds to the PyTorch job, that is, <training-job-name>-master-0.
For example, if the training job name is kfto-demo, the pod name is kfto-demo-master-0.
Click the pod name to open the pod details page.

Click the Logs tab to monitor the progress of the job and view status updates, as shown in the following example output:

0%| | 0/10 [00:00<?, ?it/s] 10%|█ | 1/10 [01:10<10:32, 70.32s/it] {'loss': 6.9531, 'grad_norm': 1104.0, 'learning_rate': 9e-06, 'epoch': 1.0}
10%|█ | 1/10 [01:10<10:32, 70.32s/it] 20%|██ | 2/10 [01:40<06:13, 46.71s/it] 30%|███ | 3/10 [02:26<05:25, 46.55s/it] {'loss': 2.4609, 'grad_norm': 736.0, 'learning_rate': 7e-06, 'epoch': 2.0}
30%|███ | 3/10 [02:26<05:25, 46.55s/it] 40%|████ | 4/10 [03:23<05:02, 50.48s/it] 50%|█████ | 5/10 [03:41<03:13, 38.66s/it] {'loss': 1.7617, 'grad_norm': 328.0, 'learning_rate': 5e-06, 'epoch': 3.0}
50%|█████ | 5/10 [03:41<03:13, 38.66s/it] 60%|██████ | 6/10 [04:54<03:22, 50.58s/it] {'loss': 3.1797, 'grad_norm': 1016.0, 'learning_rate': 4.000000000000001e-06, 'epoch': 4.0}
60%|██████ | 6/10 [04:54<03:22, 50.58s/it] 70%|███████ | 7/10 [06:03<02:49, 56.59s/it] {'loss': 2.9297, 'grad_norm': 984.0, 'learning_rate': 3e-06, 'epoch': 5.0}
70%|███████ | 7/10 [06:03<02:49, 56.59s/it] 80%|████████ | 8/10 [06:38<01:39, 49.57s/it] 90%|█████████ | 9/10 [07:22<00:48, 48.03s/it] {'loss': 1.4219, 'grad_norm': 684.0, 'learning_rate': 1.0000000000000002e-06, 'epoch': 6.0}
90%|█████████ | 9/10 [07:22<00:48, 48.03s/it]100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'loss': 1.9609, 'grad_norm': 648.0, 'learning_rate': 0.0, 'epoch': 6.67}
100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'train_runtime': 508.0444, 'train_samples_per_second': 0.197, 'train_steps_per_second': 0.02, 'train_loss': 2.63125, 'epoch': 6.67}
100%|██████████| 10/10 [08:28<00:00, 52.53s/it]100%|██████████| 10/10 [08:28<00:00, 50.80s/it]

In the example output, the solid blocks indicate progress bars.

Verification

The <training-job-name>-master-0 pod is running.
The Logs tab provides information about the job progress and job status.

Chapter 5. Tuning a model by using the Training Operator

5.1. Configuring the training job
Copy link

5.2. Running the training job
Copy link

5.3. Monitoring the training job
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 5. Tuning a model by using the Training Operator

5.1. Configuring the training jobCopy linkLink copied to clipboard!

5.2. Running the training jobCopy linkLink copied to clipboard!

5.3. Monitoring the training jobCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

5.1. Configuring the training job
Copy link

5.2. Running the training job
Copy link

5.3. Monitoring the training job
Copy link