이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 2. Training a model
RHEL AI can use your taxonomy tree and synthetic data to create a newly trained model with your domain-specific knowledge or skills using multi-phase training and evaluation. You can run the full training and evaluation process using the synthetic dataset you generated. The LAB optimized technique of multi-phase training is a type of LLM training that goes through multiple stages of training and evaluation. In these various stages, RHEL AI runs the training process and produces model checkpoints. The best checkpoint is selected for the next phase. This process creates many checkpoints and selects the best scored checkpoint. This best scored checkpoint is your newly trained LLM.
The entire process creates a newly generated model that is trained and evaluated using the synthetic data from your taxonomy tree.
Red Hat Enterprise Linux AI 1.5 includes training modes on long-context data. As a result, training times may be longer compared to previous releases. You can reduce the training time by settting max_seq_len: 10000 in your config.yaml files.
2.1. Training the model on your data 링크 복사링크가 클립보드에 복사되었습니다!
You can use Red Hat Enterprise Linux AI to train a model with your synthetically generated data. The following procedures show how to do this using the LAB multi-phase training strategy.
Red Hat Enterprise Linux AI general availability does not support training and inference serving at the same time. If you have an inference server running, you must close it before you start the training process.
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You downloaded the
granite-7b-startermodel. -
You created a custom
qna.yamlfile with knowledge data. - You ran the synthetic data generation (SDG) process.
-
You downloaded the
prometheus-8x7b-v2-0judge model. - You have root user access on your machine.
Procedure
You can run multi-phase training and evaluation by running the following command with the data files generated from SDG.
NoteYou can use the
--enable-serving-outputflag with theilab model traincommmand to display the training logs.ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>$ ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where
- <generation-date>
- The date of when you ran Synthetic Data Generation (SDG).
- <knowledge-train-messages-file>
-
The location of the
knowledge_messages.jsonlfile generated during SDG. RHEL AI trains the student modelgranite-7b-starterusing the data from this.jsonlfile. Example path:~/.local/share/instructlab/datasets/2024-09-07_194933/knowledge_train_msgs_2024-09-07T20_54_21.jsonl. - <skills-train-messages-file>
-
The location of the
skills_messages.jsonlfile generated during SDG. RHEL AI trains the student modelgranite-7b-starterusing the data from the.jsonlfile. Example path:~/.local/share/instructlab/datasets/2024-09-07_194933/skills_train_msgs_2024-09-07T20_54_21.jsonl.
You can use the --strategy lab-skills-only value to train a model on skills only.
Example skills only training command:
ilab model train --strategy lab-skills-only --phased-phase2-data ~/.local/share/instructlab/datasets/<skills-train-messages-jsonl-file>
$ ilab model train --strategy lab-skills-only --phased-phase2-data ~/.local/share/instructlab/datasets/<skills-train-messages-jsonl-file>
The first phase trains the model using the synthetic data from your knowledge contribution.
Example output of training knowledge
Training Phase 1/2... TrainingArgs for current phase: TrainingArgs(model_path='/opt/app-root/src/.cache/instructlab/models/granite-7b-starter', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/tmp/jul19-knowledge-26k.jsonl', ckpt_output_dir='/tmp/e2e/phase1/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Training Phase 1/2... TrainingArgs for current phase: TrainingArgs(model_path='/opt/app-root/src/.cache/instructlab/models/granite-7b-starter', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/tmp/jul19-knowledge-26k.jsonl', ckpt_output_dir='/tmp/e2e/phase1/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Then, RHEL AI selects the best checkpoint to use for the next phase.
The next phase trains the model using the synthetic data from the skills data.
Example output of training skills
Training Phase 2/2... TrainingArgs for current phase: TrainingArgs(model_path='/tmp/e2e/phase1/checkpoints/hf_format/samples_52096', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/usr/share/instructlab/sdg/datasets/skills.jsonl', ckpt_output_dir='/tmp/e2e/phase2/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Training Phase 2/2... TrainingArgs for current phase: TrainingArgs(model_path='/tmp/e2e/phase1/checkpoints/hf_format/samples_52096', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/usr/share/instructlab/sdg/datasets/skills.jsonl', ckpt_output_dir='/tmp/e2e/phase2/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Copy to Clipboard Copied! Toggle word wrap Toggle overflow Then, RHEL AI evaluates all of the checkpoints from phase 2 model training using the Multi-turn Benchmark (MT-Bench) and returns the best performing checkpoint as the fully trained output model.
Example output of evaluating skills
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After training is complete, a confirmation appears and displays your best performed checkpoint.
Example output of a complete multi-phase training run
Training finished! Best final checkpoint: samples_1945 with score: 6.813759384
Training finished! Best final checkpoint: samples_1945 with score: 6.813759384Copy to Clipboard Copied! Toggle word wrap Toggle overflow Make a note of this checkpoint because the path is necessary for evaluation and serving.
Verification
When training a model with
ilab model train, multiple checkpoints are saved with thesamples_prefix based on how many data points they have been trained on. These are saved to the~/.local/share/instructlab/phase/directory.ls ~/.local/share/instructlab/phase/<phase1-or-phase2>/checkpoints/
$ ls ~/.local/share/instructlab/phase/<phase1-or-phase2>/checkpoints/Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output of the new models
samples_1711 samples_1945 samples_1456 samples_1462 samples_1903
samples_1711 samples_1945 samples_1456 samples_1462 samples_1903Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.1.1. Continuing or restarting a training run 링크 복사링크가 클립보드에 복사되었습니다!
RHEL AI allows you to continue a training run that may have failed during multi-phase training. There are a few ways a training run can fail:
- The vLLM server may not start correctly.
- A accelerator or GPU may freeze, causing training to abort.
-
There may be an error in your InstructLab
config.yamlfile.
When you run multi-phase training for the first time, the initial training data gets saved into a journalfile.yaml file. If necessary, this metadata in the file can be used to restart a failed training.
You can also restart a training run which clears the training data by following the CLI prompts when running multi-phase training.
Prerequisites
- You ran multi-phase training with your synthetic data and that failed.
Procedure
Run the multi-phase training command again.
ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>$ ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>Copy to Clipboard Copied! Toggle word wrap Toggle overflow The Red Hat Enterprise Linux AI CLI reads if the
journalfile.yamlfile exists and continues the training run from that point.The CLI prompts you to continue for the previous training run, or start from the beginning.
Type
nin your shell to continue from your previews training run.Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? n
Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? nCopy to Clipboard Copied! Toggle word wrap Toggle overflow Type
yinto the terminal to restart a training run.Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? y
Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? yCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restarting also clears your systems cache of previous checkpoints, journal files and other training data.