Chapter 5. Tuning a model by using the Training Operator
To tune a model by using the Kubeflow Training Operator, you configure and run a training job.
Optionally, you can use Low-Rank Adaptation (LoRA) to efficiently fine-tune large language models, such as Llama 3. The integration optimizes computational requirements and reduces memory footprint, allowing fine-tuning on consumer-grade GPUs. The solution combines PyTorch Fully Sharded Data Parallel (FSDP) and LoRA to enable scalable, cost-effective model training and inference, enhancing the flexibility and performance of AI workloads within OpenShift environments.
5.1. Configuring the training job
Before you can use a training job to tune a model, you must configure the training job. The example training job in this section is based on the IBM and Hugging Face tuning example provided in GitHub.
Prerequisites
- You have logged in to OpenShift.
- You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
- You have created a data science project. For information about how to create a project, see Creating a data science project.
You have Admin access for the data science project.
- If you created the project, you automatically have Admin access.
- If you did not create the project, your cluster administrator must give you Admin access.
- You have access to a model.
- You have access to data that you can use to train the model.
Procedure
In a terminal window, if you are not already logged in to your OpenShift cluster, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <username> -p <password>
Configure a training job, as follows:
-
Create a YAML file named
config_trainingjob.yaml
. Add the
ConfigMap
object definition as follows:Example training-job configuration
kind: ConfigMap apiVersion: v1 metadata: name: training-config namespace: kfto data: config.json: | { "model_name_or_path": "bigscience/bloom-560m", "training_data_path": "/data/input/twitter_complaints.json", "output_dir": "/data/output/tuning/bloom-twitter", "num_train_epochs": 10.0, "per_device_train_batch_size": 4, "gradient_accumulation_steps": 4, "learning_rate": 1e-05, "response_template": "\n### Label:", "dataset_text_field": "output", "use_flash_attn": false }
Optional: To fine-tune with Low Rank Adaptation (LoRA), update the
config.json
section as follows:-
Set the
peft_method
parameter to"lora"
. Add the
lora_r
,lora_alpha
,lora_dropout
,bias
, andtarget_modules
parameters.Example LoRA configuration
... "use_flash_attn": false, "peft_method": "lora", "lora_r": 8, "lora_alpha": 8, "lora_dropout": 0.1, "bias": "none", "target_modules": ["all-linear"] }
-
Set the
Edit the metadata of the training-job configuration as shown in the following table.
Table 5.1. Training-job configuration metadata Parameter Value name
Name of the training-job configuration
namespace
Name of your project
Edit the parameters of the training-job configuration as shown in the following table.
Table 5.2. Training-job configuration parameters Parameter Value model_name_or_path
Name of the pre-trained model or the path to the model in the training-job container; in this example, the model name is taken from the Hugging Face web page
training_data_path
Path to the training data that you set in the
training_data.yaml
ConfigMapoutput_dir
Output directory for the model
num_train_epochs
Number of epochs for training; in this example, the training job is set to run 10 times
per_device_train_batch_size
Batch size, the number of data set examples to process together; in this example, the training job processes 4 examples at a time
gradient_accumulation_steps
Number of gradient accumulation steps
learning_rate
Learning rate for the training
response_template
Template formatting for the response
dataset_text_field
Dataset field for training output, as set in the
training_data.yaml
config mapuse_flash_attn
Whether to use flash attention
peft_method
Tuning method: for full fine-tuning, omit this parameter; for LoRA, set to
"lora"
; for prompt tuning, set to"pt"
lora_r
LoRA: Rank of the low-rank decomposition
lora_alpha
LoRA: Scale the low-rank matrices to control their influence on the model’s adaptations
lora_dropout
LoRA: Dropout rate applied to the LoRA layers, a regularization technique to prevent overfitting
bias
LoRA: Whether to adapt bias terms in the model; setting the bias to
"none"
indicates that no bias terms will be adaptedtarget_modules
LoRA: Names of the modules to apply LoRA to; to include all linear layers, set to "all_linear"; optional parameter for some models
-
Save your changes in the
config_trainingjob.yaml
file. Apply the configuration to create the
training-config
object:$ oc apply -f config_trainingjob.yaml
-
Create a YAML file named
Create the training data.
NoteThe training data in this simple example is for demonstration purposes only, and is not suitable for production use. The usual method for providing training data is to use persistent volumes.
-
Create a YAML file named
training_data.yaml
. Add the following
ConfigMap
object definition:kind: ConfigMap apiVersion: v1 metadata: name: twitter-complaints namespace: kfto data: twitter_complaints.json: | [ {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}, {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}, {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}, {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}, {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}, {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}, {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}, {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year ��","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year ��\n\n### Label: no complaint"} ]
-
Replace the example namespace value
kfto
with the name of your project. - Replace the example training data with your training data.
-
Save your changes in the
training_data.yaml
file. Apply the configuration to create the training data:
$ oc apply -f training_data.yaml
-
Create a YAML file named
Create a persistent volume claim (PVC), as follows:
-
Create a YAML file named
trainedmodelpvc.yaml
. Add the following
PersistentVolumeClaim
object definition:apiVersion: v1 kind: PersistentVolumeClaim metadata: name: trained-model namespace: kfto spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi
-
Replace the example namespace value
kfto
with the name of your project, and update the other parameters to suit your environment. To calculate thestorage
value, multiply the model size by the number of epochs, and add a little extra as a buffer. -
Save your changes in the
trainedmodelpvc.yaml
file. Apply the configuration to create a Persistent Volume Claim (PVC) for the training job:
$ oc apply -f trainedmodelpvc.yaml
-
Create a YAML file named
Verification
- In the OpenShift console, select your project from the Project list.
-
Click ConfigMaps and verify that the
training-config
andtwitter-complaints
ConfigMaps are listed. -
Click Search. From the Resources list, select PersistentVolumeClaim and verify that the
trained-model
PVC is listed.
5.2. Running the training job
You can run a training job to tune a model. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.
Prerequisites
- You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
- You have created a data science project. For information about how to create a project, see Creating a data science project.
You have Admin access for the data science project.
- If you created the project, you automatically have Admin access.
- If you did not create the project, your cluster administrator must give you Admin access.
- You have access to a model.
- You have access to data that you can use to train the model.
- You have configured the training job as described in Configuring the training job.
Procedure
In a terminal window, log in to the OpenShift CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <username> -p <password>
Create a PyTorch training job, as follows:
-
Create a YAML file named
pytorchjob.yaml
. Add the following
PyTorchJob
object definition:apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: kfto-demo namespace: kfto spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: Never template: spec: containers: - env: - name: SFT_TRAINER_CONFIG_JSON_PATH value: /etc/config/config.json image: 'quay.io/modh/fms-hf-tuning:release' imagePullPolicy: IfNotPresent name: pytorch volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /data/input name: dataset-volume - mountPath: /data/output name: model-volume volumes: - configMap: items: - key: config.json path: config.json name: training-config name: config-volume - configMap: name: twitter-complaints name: dataset-volume - name: model-volume persistentVolumeClaim: claimName: trained-model runPolicy: suspend: false
-
Replace the example namespace value
kfto
with the name of your project, and update the other parameters to suit your environment. - Edit the parameters of the PyTorch training job, to provide the details for your training job and environment.
-
Save your changes in the
pytorchjob.yaml
file. Apply the configuration to run the PyTorch training job:
$ oc apply -f pytorchjob.yaml
-
Create a YAML file named
Verification
- In the OpenShift console, select your project from the Project list.
-
Click Workloads
Pods and verify that the <training-job-name>-master-0 pod is listed.
5.3. Monitoring the training job
When you run a training job to tune a model, you can monitor the progress of the job. The example training job in this section is based on the IBM and Hugging Face tuning example provided here.
Prerequisites
- You have access to a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
- You have created a data science project. For information about how to create a project, see Creating a data science project.
You have Admin access for the data science project.
- If you created the project, you automatically have Admin access.
- If you did not create the project, your cluster administrator must give you Admin access.
- You have access to a model.
- You have access to data that you can use to train the model.
- You are running the training job as described in Running the training job.
Procedure
- In the OpenShift console, select your project from the Project list.
-
Click Workloads
Pods. Search for the pod that corresponds to the PyTorch job, that is, <training-job-name>-master-0.
For example, if the training job name is
kfto-demo
, the pod name is kfto-demo-master-0.- Click the pod name to open the pod details page.
Click the Logs tab to monitor the progress of the job and view status updates, as shown in the following example output:
0%| | 0/10 [00:00<?, ?it/s] 10%|█ | 1/10 [01:10<10:32, 70.32s/it] {'loss': 6.9531, 'grad_norm': 1104.0, 'learning_rate': 9e-06, 'epoch': 1.0} 10%|█ | 1/10 [01:10<10:32, 70.32s/it] 20%|██ | 2/10 [01:40<06:13, 46.71s/it] 30%|███ | 3/10 [02:26<05:25, 46.55s/it] {'loss': 2.4609, 'grad_norm': 736.0, 'learning_rate': 7e-06, 'epoch': 2.0} 30%|███ | 3/10 [02:26<05:25, 46.55s/it] 40%|████ | 4/10 [03:23<05:02, 50.48s/it] 50%|█████ | 5/10 [03:41<03:13, 38.66s/it] {'loss': 1.7617, 'grad_norm': 328.0, 'learning_rate': 5e-06, 'epoch': 3.0} 50%|█████ | 5/10 [03:41<03:13, 38.66s/it] 60%|██████ | 6/10 [04:54<03:22, 50.58s/it] {'loss': 3.1797, 'grad_norm': 1016.0, 'learning_rate': 4.000000000000001e-06, 'epoch': 4.0} 60%|██████ | 6/10 [04:54<03:22, 50.58s/it] 70%|███████ | 7/10 [06:03<02:49, 56.59s/it] {'loss': 2.9297, 'grad_norm': 984.0, 'learning_rate': 3e-06, 'epoch': 5.0} 70%|███████ | 7/10 [06:03<02:49, 56.59s/it] 80%|████████ | 8/10 [06:38<01:39, 49.57s/it] 90%|█████████ | 9/10 [07:22<00:48, 48.03s/it] {'loss': 1.4219, 'grad_norm': 684.0, 'learning_rate': 1.0000000000000002e-06, 'epoch': 6.0} 90%|█████████ | 9/10 [07:22<00:48, 48.03s/it]100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'loss': 1.9609, 'grad_norm': 648.0, 'learning_rate': 0.0, 'epoch': 6.67} 100%|██████████| 10/10 [08:25<00:00, 52.53s/it] {'train_runtime': 508.0444, 'train_samples_per_second': 0.197, 'train_steps_per_second': 0.02, 'train_loss': 2.63125, 'epoch': 6.67} 100%|██████████| 10/10 [08:28<00:00, 52.53s/it]100%|██████████| 10/10 [08:28<00:00, 50.80s/it]
In the example output, the solid blocks indicate progress bars.
Verification
- The <training-job-name>-master-0 pod is running.
- The Logs tab provides information about the job progress and job status.