이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Training a model


RHEL AI can use your taxonomy tree and synthetic data to create a newly trained model with your domain-specific knowledge or skills using multi-phase training and evaluation. You can run the full training and evaluation process using the synthetic dataset you generated. The LAB optimized technique of multi-phase training is a type of LLM training that goes through multiple stages of training and evaluation. In these various stages, RHEL AI runs the training process and produces model checkpoints. The best checkpoint is selected for the next phase. This process creates many checkpoints and selects the best scored checkpoint. This best scored checkpoint is your newly trained LLM.

The entire process creates a newly generated model that is trained and evaluated using the synthetic data from your taxonomy tree.

Note

Red Hat Enterprise Linux AI 1.5 includes training modes on long-context data. As a result, training times may be longer compared to previous releases. You can reduce the training time by settting max_seq_len: 10000 in your config.yaml files.

2.1. Training the model on your data

You can use Red Hat Enterprise Linux AI to train a model with your synthetically generated data. The following procedures show how to do this using the LAB multi-phase training strategy.

Important

Red Hat Enterprise Linux AI general availability does not support training and inference serving at the same time. If you have an inference server running, you must close it before you start the training process.

Prerequisites

  • You installed RHEL AI with the bootable container image.
  • You downloaded the granite-7b-starter model.
  • You created a custom qna.yaml file with knowledge data.
  • You ran the synthetic data generation (SDG) process.
  • You downloaded the prometheus-8x7b-v2-0 judge model.
  • You have root user access on your machine.

Procedure

  1. You can run multi-phase training and evaluation by running the following command with the data files generated from SDG.

    Note

    You can use the --enable-serving-output flag with the ilab model train commmand to display the training logs.

    $ ilab model train --strategy lab-multiphase \
            --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \
            --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>
    Copy to Clipboard Toggle word wrap

    where

    <generation-date>
    The date of when you ran Synthetic Data Generation (SDG).
    <knowledge-train-messages-file>
    The location of the knowledge_messages.jsonl file generated during SDG. RHEL AI trains the student model granite-7b-starter using the data from this .jsonl file. Example path: ~/.local/share/instructlab/datasets/2024-09-07_194933/knowledge_train_msgs_2024-09-07T20_54_21.jsonl.
    <skills-train-messages-file>
    The location of the skills_messages.jsonl file generated during SDG. RHEL AI trains the student model granite-7b-starter using the data from the .jsonl file. Example path: ~/.local/share/instructlab/datasets/2024-09-07_194933/skills_train_msgs_2024-09-07T20_54_21.jsonl.
Note

You can use the --strategy lab-skills-only value to train a model on skills only.

Example skills only training command:

$ ilab model train --strategy lab-skills-only --phased-phase2-data ~/.local/share/instructlab/datasets/<skills-train-messages-jsonl-file>
Copy to Clipboard Toggle word wrap

  1. The first phase trains the model using the synthetic data from your knowledge contribution.

    Example output of training knowledge

    Training Phase 1/2...
    TrainingArgs for current phase: TrainingArgs(model_path='/opt/app-root/src/.cache/instructlab/models/granite-7b-starter', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/tmp/jul19-knowledge-26k.jsonl', ckpt_output_dir='/tmp/e2e/phase1/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))
    Copy to Clipboard Toggle word wrap

  2. Then, RHEL AI selects the best checkpoint to use for the next phase.
  3. The next phase trains the model using the synthetic data from the skills data.

    Example output of training skills

    Training Phase 2/2...
    TrainingArgs for current phase: TrainingArgs(model_path='/tmp/e2e/phase1/checkpoints/hf_format/samples_52096', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/usr/share/instructlab/sdg/datasets/skills.jsonl', ckpt_output_dir='/tmp/e2e/phase2/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))
    Copy to Clipboard Toggle word wrap

  4. Then, RHEL AI evaluates all of the checkpoints from phase 2 model training using the Multi-turn Benchmark (MT-Bench) and returns the best performing checkpoint as the fully trained output model.

    Example output of evaluating skills

    MT-Bench evaluation for Phase 2...
    Using gpus from --gpus or evaluate config and ignoring --tensor-parallel-size configured in serve vllm_args
    INFO 2024-08-15 10:04:51,065 instructlab.model.backends.backends:437: Trying to connect to model server at http://127.0.0.1:8000/v1
    INFO 2024-08-15 10:04:53,580 instructlab.model.backends.vllm:208: vLLM starting up on pid 79388 at http://127.0.0.1:54265/v1
    INFO 2024-08-15 10:04:53,580 instructlab.model.backends.backends:450: Starting a temporary vLLM server at http://127.0.0.1:54265/v1
    INFO 2024-08-15 10:04:53,580 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 1/300
    INFO 2024-08-15 10:04:58,003 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 2/300
    INFO 2024-08-15 10:05:02,314 instructlab.model.backends.backends:465: Waiting for the vLLM server to start at http://127.0.0.1:54265/v1, this might take a moment... Attempt: 3/300
    moment... Attempt: 3/300
    INFO 2024-08-15 10:06:07,611 instructlab.model.backends.backends:472: vLLM engine successfully started at http://127.0.0.1:54265/v1
    Copy to Clipboard Toggle word wrap

    1. After training is complete, a confirmation appears and displays your best performed checkpoint.

      Example output of a complete multi-phase training run

      Training finished! Best final checkpoint: samples_1945 with score: 6.813759384
      Copy to Clipboard Toggle word wrap

      Make a note of this checkpoint because the path is necessary for evaluation and serving.

Verification

  • When training a model with ilab model train, multiple checkpoints are saved with the samples_ prefix based on how many data points they have been trained on. These are saved to the ~/.local/share/instructlab/phase/ directory.

    $ ls ~/.local/share/instructlab/phase/<phase1-or-phase2>/checkpoints/
    Copy to Clipboard Toggle word wrap

    Example output of the new models

    samples_1711 samples_1945 samples_1456 samples_1462 samples_1903
    Copy to Clipboard Toggle word wrap

2.1.1. Continuing or restarting a training run

RHEL AI allows you to continue a training run that may have failed during multi-phase training. There are a few ways a training run can fail:

  • The vLLM server may not start correctly.
  • A accelerator or GPU may freeze, causing training to abort.
  • There may be an error in your InstructLab config.yaml file.

When you run multi-phase training for the first time, the initial training data gets saved into a journalfile.yaml file. If necessary, this metadata in the file can be used to restart a failed training.

You can also restart a training run which clears the training data by following the CLI prompts when running multi-phase training.

Prerequisites

  • You ran multi-phase training with your synthetic data and that failed.

Procedure

  1. Run the multi-phase training command again.

    $ ilab model train --strategy lab-multiphase \
            --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \
            --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>
    Copy to Clipboard Toggle word wrap

    The Red Hat Enterprise Linux AI CLI reads if the journalfile.yaml file exists and continues the training run from that point.

  2. The CLI prompts you to continue for the previous training run, or start from the beginning.

    • Type n in your shell to continue from your previews training run.

      Metadata (checkpoints, the training journal) may have been saved from a previous training run.
      By default, training will resume from this metadata if it exists
      Alternatively, the metadata can be cleared, and training can start from scratch
      Would you like to START TRAINING FROM THE BEGINNING? n
      Copy to Clipboard Toggle word wrap
    • Type y into the terminal to restart a training run.

      Metadata (checkpoints, the training journal) may have been saved from a previous training run.
      By default, training will resume from this metadata if it exists
      Alternatively, the metadata can be cleared, and training can start from scratch
      Would you like to START TRAINING FROM THE BEGINNING? y
      Copy to Clipboard Toggle word wrap

      Restarting also clears your systems cache of previous checkpoints, journal files and other training data.

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2026 Red Hat
맨 위로 이동