Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 2. Training a model


RHEL AI can use your taxonomy tree and synthetic data to create a newly trained model with your domain-specific knowledge or skills using multi-phase training and evaluation. You can run the full training and evaluation process using the synthetic dataset you generated. The LAB optimized technique of multi-phase training is a type of LLM training that goes through multiple stages of training and evaluation. In these various stages, RHEL AI runs the training process and produces model checkpoints. The best checkpoint is selected for the next phase. This process creates many checkpoints and selects the best scored checkpoint. This best scored checkpoint is your newly trained LLM.

The entire process creates a newly generated model that is trained and evaluated using the synthetic data from your taxonomy tree.

Note

Red Hat Enterprise Linux AI 1.5 includes training modes on long-context data. As a result, training times may be longer compared to previous releases. You can reduce the training time by settting max_seq_len: 10000 in your config.yaml files.

2.1. Training the model on your data

You can use Red Hat Enterprise Linux AI to train a model with your synthetically generated data. The following procedures show how to do this using the LAB multi-phase training strategy.

Important

Red Hat Enterprise Linux AI general availability does not support training and inference serving at the same time. If you have an inference server running, you must close it before you start the training process.

Prerequisites

  • You installed RHEL AI with the bootable container image.
  • You downloaded the granite-7b-starter model.
  • You created a custom qna.yaml file with knowledge data.
  • You ran the synthetic data generation (SDG) process.
  • You downloaded the prometheus-8x7b-v2-0 judge model.
  • You have root user access on your machine.

Procedure

  1. You can run multi-phase training and evaluation by running the following command with the data files generated from SDG.

    Note

    You can use the --enable-serving-output flag with the ilab model train commmand to display the training logs.

    $ ilab model train --strategy lab-multiphase \
            --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \
            --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>

    where

    <generation-date>
    The date of when you ran Synthetic Data Generation (SDG).
    <knowledge-train-messages-file>
    The location of the knowledge_messages.jsonl file generated during SDG. RHEL AI trains the student model granite-7b-starter using the data from this .jsonl file. Example path: ~/.local/share/instructlab/datasets/2024-09-07_194933/knowledge_train_msgs_2024-09-07T20_54_21.jsonl.
    <skills-train-messages-file>
    The location of the skills_messages.jsonl file generated during SDG. RHEL AI trains the student model granite-7b-starter using the data from the .jsonl file. Example path: ~/.local/share/instructlab/datasets/2024-09-07_194933/skills_train_msgs_2024-09-07T20_54_21.jsonl.
Note

You can use the --strategy lab-skills-only value to train a model on skills only.

Example skills only training command:

$ ilab model train --strategy lab-skills-only --phased-phase2-data ~/.local/share/instructlab/datasets/<skills-train-messages-jsonl-file>

  1. The first phase trains the model using the synthetic data from your knowledge contribution.

    Example output of training knowledge

    Training Phase 1/2...
    TrainingArgs for current phase: TrainingArgs(model_path='/opt/app-root/src/.cache/instructlab/models/granite-7b-starter', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/tmp/jul19-knowledge-26k.jsonl', ckpt_output_dir='/tmp/e2e/phase1/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))

  2. Then, RHEL AI selects the best checkpoint to use for the next phase.
  3. The next phase trains the model using the synthetic data from the skills data.

    Example output of training skills

    Training Phase 2/2...
    TrainingArgs for current phase: TrainingArgs(model_path='/tmp/e2e/phase1/checkpoints/hf_format/samples_52096', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/usr/share/instructlab/sdg/datasets/skills.jsonl', ckpt_output_dir='/tmp/e2e/phase2/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))

  4. Then, RHEL AI evaluates all of the checkpoints from phase 2 model training using the Multi-turn Benchmark (MT-Bench) and returns the best performing checkpoint as the fully trained output model.

    Example output of evaluating skills

    MT-Bench evaluation for Phase 2...
    Using gpus from --gpus or evaluate config and ignoring --tensor-parallel-size configured in serve vllm_args
    INFO 2024-08-15 10:04:51,065 instructlab.model.backends.backends:437: Trying to connect to model server at http://127.0.0.1:8000/v1
    INFO 2024-08-15 10:04:53,580 instructlab.model.backends.vllm:208: vLLM starting up on pid 79388 at http://127.0.0.1:54265/v1
    INFO 2024-08-15 10:04:53,580 instructlab.model.backends.backends:450: Starting a temporary vLLM server at http://127.0.0.1:54265/v1
    INFO 2024-08-15 10:04:53,580 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 1/300
    INFO 2024-08-15 10:04:58,003 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 2/300
    INFO 2024-08-15 10:05:02,314 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 3/300
    moment... Attempt: 3/300
    INFO 2024-08-15 10:06:07,611 instructlab.model.backends.backends:472: vLLM engine successfully started at http://127.0.0.1:54265/v1

    1. After training is complete, a confirmation appears and displays your best performed checkpoint.

      Example output of a complete multi-phase training run

      Training finished! Best final checkpoint: samples_1945 with score: 6.813759384

      Make a note of this checkpoint because the path is necessary for evaluation and serving.

Verification

  • When training a model with ilab model train, multiple checkpoints are saved with the samples_ prefix based on how many data points they have been trained on. These are saved to the ~/.local/share/instructlab/phase/ directory.

    $ ls ~/.local/share/instructlab/phase/<phase1-or-phase2>/checkpoints/

    Example output of the new models

    samples_1711 samples_1945 samples_1456 samples_1462 samples_1903

2.1.1. Continuing or restarting a training run

RHEL AI allows you to continue a training run that may have failed during multi-phase training. There are a few ways a training run can fail:

  • The vLLM server may not start correctly.
  • A accelerator or GPU may freeze, causing training to abort.
  • There may be an error in your InstructLab config.yaml file.

When you run multi-phase training for the first time, the initial training data gets saved into a journalfile.yaml file. If necessary, this metadata in the file can be used to restart a failed training.

You can also restart a training run which clears the training data by following the CLI prompts when running multi-phase training.

Prerequisites

  • You ran multi-phase training with your synthetic data and that failed.

Procedure

  1. Run the multi-phase training command again.

    $ ilab model train --strategy lab-multiphase \
            --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \
            --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>

    The Red Hat Enterprise Linux AI CLI reads if the journalfile.yaml file exists and continues the training run from that point.

  2. The CLI prompts you to continue for the previous training run, or start from the beginning.

    • Type n in your shell to continue from your previews training run.

      Metadata (checkpoints, the training journal) may have been saved from a previous training run.
      By default, training will resume from this metadata if it exists
      Alternatively, the metadata can be cleared, and training can start from scratch
      Would you like to START TRAINING FROM THE BEGINNING? n
    • Type y into the terminal to restart a training run.

      Metadata (checkpoints, the training journal) may have been saved from a previous training run.
      By default, training will resume from this metadata if it exists
      Alternatively, the metadata can be cleared, and training can start from scratch
      Would you like to START TRAINING FROM THE BEGINNING? y

      Restarting also clears your systems cache of previous checkpoints, journal files and other training data.

Red Hat logoGithubredditYoutubeTwitter

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Wir helfen Red Hat Benutzern, mit unseren Produkten und Diensten innovativ zu sein und ihre Ziele zu erreichen – mit Inhalten, denen sie vertrauen können. Entdecken Sie unsere neuesten Updates.

Mehr Inklusion in Open Source

Red Hat hat sich verpflichtet, problematische Sprache in unserem Code, unserer Dokumentation und unseren Web-Eigenschaften zu ersetzen. Weitere Einzelheiten finden Sie in Red Hat Blog.

Über Red Hat

Wir liefern gehärtete Lösungen, die es Unternehmen leichter machen, plattform- und umgebungsübergreifend zu arbeiten, vom zentralen Rechenzentrum bis zum Netzwerkrand.

Theme

© 2026 Red Hat
Nach oben