Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 2. Training a model
RHEL AI can use your taxonomy tree and synthetic data to create a newly trained model with your domain-specific knowledge or skills using multi-phase training and evaluation. You can run the full training and evaluation process using the synthetic dataset you generated. The LAB optimized technique of multi-phase training is a type of LLM training that goes through multiple stages of training and evaluation. In these various stages, RHEL AI runs the training process and produces model checkpoints. The best checkpoint is selected for the next phase. This process creates many checkpoints and selects the best scored checkpoint. This best scored checkpoint is your newly trained LLM.
The entire process creates a newly generated model that is trained and evaluated using the synthetic data from your taxonomy tree.
Red Hat Enterprise Linux AI 1.5 includes training modes on long-context data. As a result, training times may be longer compared to previous releases. You can reduce the training time by settting max_seq_len: 10000 in your config.yaml files.
2.1. Training the model on your data Link kopierenLink in die Zwischenablage kopiert!
You can use Red Hat Enterprise Linux AI to train a model with your synthetically generated data. The following procedures show how to do this using the LAB multi-phase training strategy.
Red Hat Enterprise Linux AI general availability does not support training and inference serving at the same time. If you have an inference server running, you must close it before you start the training process.
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You downloaded the
granite-7b-startermodel. -
You created a custom
qna.yamlfile with knowledge data. - You ran the synthetic data generation (SDG) process.
-
You downloaded the
prometheus-8x7b-v2-0judge model. - You have root user access on your machine.
Procedure
You can run multi-phase training and evaluation by running the following command with the data files generated from SDG.
NoteYou can use the
--enable-serving-outputflag with theilab model traincommmand to display the training logs.$ ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>where
- <generation-date>
- The date of when you ran Synthetic Data Generation (SDG).
- <knowledge-train-messages-file>
-
The location of the
knowledge_messages.jsonlfile generated during SDG. RHEL AI trains the student modelgranite-7b-starterusing the data from this.jsonlfile. Example path:~/.local/share/instructlab/datasets/2024-09-07_194933/knowledge_train_msgs_2024-09-07T20_54_21.jsonl. - <skills-train-messages-file>
-
The location of the
skills_messages.jsonlfile generated during SDG. RHEL AI trains the student modelgranite-7b-starterusing the data from the.jsonlfile. Example path:~/.local/share/instructlab/datasets/2024-09-07_194933/skills_train_msgs_2024-09-07T20_54_21.jsonl.
You can use the --strategy lab-skills-only value to train a model on skills only.
Example skills only training command:
$ ilab model train --strategy lab-skills-only --phased-phase2-data ~/.local/share/instructlab/datasets/<skills-train-messages-jsonl-file>
The first phase trains the model using the synthetic data from your knowledge contribution.
Example output of training knowledge
Training Phase 1/2... TrainingArgs for current phase: TrainingArgs(model_path='/opt/app-root/src/.cache/instructlab/models/granite-7b-starter', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/tmp/jul19-knowledge-26k.jsonl', ckpt_output_dir='/tmp/e2e/phase1/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))- Then, RHEL AI selects the best checkpoint to use for the next phase.
The next phase trains the model using the synthetic data from the skills data.
Example output of training skills
Training Phase 2/2... TrainingArgs for current phase: TrainingArgs(model_path='/tmp/e2e/phase1/checkpoints/hf_format/samples_52096', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/usr/share/instructlab/sdg/datasets/skills.jsonl', ckpt_output_dir='/tmp/e2e/phase2/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Then, RHEL AI evaluates all of the checkpoints from phase 2 model training using the Multi-turn Benchmark (MT-Bench) and returns the best performing checkpoint as the fully trained output model.
Example output of evaluating skills
MT-Bench evaluation for Phase 2... Using gpus from --gpus or evaluate config and ignoring --tensor-parallel-size configured in serve vllm_args INFO 2024-08-15 10:04:51,065 instructlab.model.backends.backends:437: Trying to connect to model server at http://127.0.0.1:8000/v1 INFO 2024-08-15 10:04:53,580 instructlab.model.backends.vllm:208: vLLM starting up on pid 79388 at http://127.0.0.1:54265/v1 INFO 2024-08-15 10:04:53,580 instructlab.model.backends.backends:450: Starting a temporary vLLM server at http://127.0.0.1:54265/v1 INFO 2024-08-15 10:04:53,580 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 1/300 INFO 2024-08-15 10:04:58,003 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 2/300 INFO 2024-08-15 10:05:02,314 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 3/300 moment... Attempt: 3/300 INFO 2024-08-15 10:06:07,611 instructlab.model.backends.backends:472: vLLM engine successfully started at http://127.0.0.1:54265/v1After training is complete, a confirmation appears and displays your best performed checkpoint.
Example output of a complete multi-phase training run
Training finished! Best final checkpoint: samples_1945 with score: 6.813759384Make a note of this checkpoint because the path is necessary for evaluation and serving.
Verification
When training a model with
ilab model train, multiple checkpoints are saved with thesamples_prefix based on how many data points they have been trained on. These are saved to the~/.local/share/instructlab/phase/directory.$ ls ~/.local/share/instructlab/phase/<phase1-or-phase2>/checkpoints/Example output of the new models
samples_1711 samples_1945 samples_1456 samples_1462 samples_1903
2.1.1. Continuing or restarting a training run Link kopierenLink in die Zwischenablage kopiert!
RHEL AI allows you to continue a training run that may have failed during multi-phase training. There are a few ways a training run can fail:
- The vLLM server may not start correctly.
- A accelerator or GPU may freeze, causing training to abort.
-
There may be an error in your InstructLab
config.yamlfile.
When you run multi-phase training for the first time, the initial training data gets saved into a journalfile.yaml file. If necessary, this metadata in the file can be used to restart a failed training.
You can also restart a training run which clears the training data by following the CLI prompts when running multi-phase training.
Prerequisites
- You ran multi-phase training with your synthetic data and that failed.
Procedure
Run the multi-phase training command again.
$ ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>The Red Hat Enterprise Linux AI CLI reads if the
journalfile.yamlfile exists and continues the training run from that point.The CLI prompts you to continue for the previous training run, or start from the beginning.
Type
nin your shell to continue from your previews training run.Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? nType
yinto the terminal to restart a training run.Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? yRestarting also clears your systems cache of previous checkpoints, journal files and other training data.