이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Generating a custom LLM using RHEL AI
Using SDG, training, and evaluation to create a custom LLM
Abstract
Chapter 1. Generating a new dataset with SDG 링크 복사링크가 클립보드에 복사되었습니다!
After customizing your taxonomy tree, you can generate a synthetic dataset using the Synthetic Data Generation (SDG) process on Red Hat Enterprise Linux AI. SDG is a process that creates an artificially generated dataset that mimics real data based on provided examples. SDG uses a YAML file containing question-and-answer pairs as input data. With these examples, SDG utilizes the mixtral-8x7b-instruct-v0-1 LLM as a teacher model to generate similar question-and-answer pairs. In the SDG pipeline, many questions are generated and scored based on quality, where the mixtral-8x7b-instruct-v0-1 teacher model assesses their relevance and coherence. The pipeline then applies a filtering mechanism to select the highest-scoring questions, generates corresponding answers, and further evaluates their accuracy based on the original example question. The final set of high-quality question-and-answer pairs is then included in the synthetic dataset used for training.
1.1. Creating a synthetic dataset using your examples 링크 복사링크가 클립보드에 복사되었습니다!
You can use your examples and run the SDG process to create a synthetic dataset.
If you are running SDG on a system with 4xL40s, you must use the following parameters for SDG to run properly.
ilab data generate --num-cpus 4
ilab data generate --num-cpus 4
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You created a custom
qna.yamlfile with knowledge data. -
You downloaded the
mixtral-8x7b-instruct-v0-1teacher model for SDG. -
You downloaded the
skills-adapter-v3:1.5andknowledge-adapter-v3:1.5LoRA layered skills and knowledge adapter. - You have root user access on your machine.
Procedure
To generate a new synthetic dataset, based on your custom taxonomy with knowledge, run the following command:
ilab data generate
$ ilab data generateCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou can use the
--enable-serving-outputflag when running theilab data generatecommand to display the vLLM startup logs.At the start of the SDG process, vLLM attempts to start a server for hosting the
mixtral-8x7B-instructteacher model.Example output of vLLM attempting to start a server
Starting a temporary vLLM server at http://127.0.0.1:47825/v1 INFO 2024-08-22 17:01:09,461 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 1/120 INFO 2024-08-22 17:01:14,213 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 2/120
Starting a temporary vLLM server at http://127.0.0.1:47825/v1 INFO 2024-08-22 17:01:09,461 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 1/120 INFO 2024-08-22 17:01:14,213 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 2/120Copy to Clipboard Copied! Toggle word wrap Toggle overflow Once vLLM connects, the SDG process starts creating synthetic data based on your seed examples in the
qna.yamlfile.Example output of vLLM connecting and SDG generating
INFO 2024-08-22 15:16:43,497 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:49311/v1, this might take a moment... Attempt: 74/120 INFO 2024-08-22 15:16:45,949 instructlab.model.backends.backends:487: vLLM engine successfully started at http://127.0.0.1:49311/v1 Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/var/home/cloud-user/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:49311/v1 server INFO 2024-08-22 15:16:46,594 instructlab.sdg:375: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-08-22 15:16:43,497 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:49311/v1, this might take a moment... Attempt: 74/120 INFO 2024-08-22 15:16:45,949 instructlab.model.backends.backends:487: vLLM engine successfully started at http://127.0.0.1:49311/v1 Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/var/home/cloud-user/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:49311/v1 server INFO 2024-08-22 15:16:46,594 instructlab.sdg:375: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The SDG process completes when the CLI displays the location of your new data set.
Example output of a successful SDG run
INFO 2024-08-16 17:12:46,548 instructlab.sdg.datamixing:200: Mixed Dataset saved to /home/example-user/.local/share/instructlab/datasets/skills_train_msgs_2024-08-16T16_50_11.jsonl INFO 2024-08-16 17:12:46,549 instructlab.sdg:438: Generation took 1355.74s
INFO 2024-08-16 17:12:46,548 instructlab.sdg.datamixing:200: Mixed Dataset saved to /home/example-user/.local/share/instructlab/datasets/skills_train_msgs_2024-08-16T16_50_11.jsonl INFO 2024-08-16 17:12:46,549 instructlab.sdg:438: Generation took 1355.74sCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThis process can be time consuming depending on your hardware specifications.
Verification
To verify that the SDG files were created, navigate to your
~/.local/share/instructlab/datasets/directory and list the files corresponding to the date when the data was generated. For example:ls 2024-03-24_194933
$ ls 2024-03-24_194933Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
knowledge_recipe_2024-03-24T20_54_21.yaml skills_recipe_2024-03-24T20_54_21.yaml knowledge_train_msgs_2024-03-24T20_54_21.jsonl skills_train_msgs_2024-03-24T20_54_21.jsonl messages_granite-7b-lab-Q4_K_M_2024-03-24T20_54_21.jsonl node_datasets_2024-03-24T15_12_12/
knowledge_recipe_2024-03-24T20_54_21.yaml skills_recipe_2024-03-24T20_54_21.yaml knowledge_train_msgs_2024-03-24T20_54_21.jsonl skills_train_msgs_2024-03-24T20_54_21.jsonl messages_granite-7b-lab-Q4_K_M_2024-03-24T20_54_21.jsonl node_datasets_2024-03-24T15_12_12/Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantMake a note of your most recent
knowledge_train_msgs.jsonlandskills_train_msgs.jsonlfile. You need to specify this file during multi-phase training. Each JSONL has the time stamp on the file, for exampleknowledge_train_msgs_2024-08-08T20_04_28.jsonl, use the most recent file when training.Optional: You can view output of SDG by navigating to the
~/.local/share/datasets/<generation-date>directory and opening theJSONLfile.cat ~/.local/share/datasets/<generation-date>/<jsonl-dataset>
$ cat ~/.local/share/datasets/<generation-date>/<jsonl-dataset>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output of a SDG JSONL file
{"messages":[{"content":"I am, Red Hat\u00ae Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.","role":"system"},{"content":"<|user|>\n### Deep-sky objects\n\nThe constellation does not lie on the [galactic\nplane](galactic_plane \"wikilink\") of the Milky Way, and there are no\nprominent star clusters. [NGC 625](NGC_625 \"wikilink\") is a dwarf\n[irregular galaxy](irregular_galaxy \"wikilink\") of apparent magnitude\n11.0 and lying some 12.7 million light years distant. Only 24000 light\nyears in diameter, it is an outlying member of the [Sculptor\nGroup](Sculptor_Group \"wikilink\"). NGC 625 is thought to have been\ninvolved in a collision and is experiencing a burst of [active star\nformation](Active_galactic_nucleus \"wikilink\"). [NGC\n37](NGC_37 \"wikilink\") is a [lenticular\ngalaxy](lenticular_galaxy \"wikilink\") of apparent magnitude 14.66. It is\napproximately 42 [kiloparsecs](kiloparsecs \"wikilink\") (137,000\n[light-years](light-years \"wikilink\")) in diameter and about 12.9\nbillion years old. [Robert's Quartet](Robert's_Quartet \"wikilink\")\n(composed of the irregular galaxy [NGC 87](NGC_87 \"wikilink\"), and three\nspiral galaxies [NGC 88](NGC_88 \"wikilink\"), [NGC 89](NGC_89 \"wikilink\")\nand [NGC 92](NGC_92 \"wikilink\")) is a group of four galaxies located\naround 160 million light-years away which are in the process of\ncolliding and merging. They are within a circle of radius of 1.6 arcmin,\ncorresponding to about 75,000 light-years. Located in the galaxy ESO\n243-49 is [HLX-1](HLX-1 \"wikilink\"), an [intermediate-mass black\nhole](intermediate-mass_black_hole \"wikilink\")\u2014the first one of its kind\nidentified. It is thought to be a remnant of a dwarf galaxy that was\nabsorbed in a [collision](Interacting_galaxy \"wikilink\") with ESO\n243-49. Before its discovery, this class of black hole was only\nhypothesized.\n\nLying within the bounds of the constellation is the gigantic [Phoenix\ncluster](Phoenix_cluster \"wikilink\"), which is around 7.3 million light\nyears wide and 5.7 billion light years away, making it one of the most\nmassive [galaxy clusters](galaxy_cluster \"wikilink\"). It was first\ndiscovered in 2010, and the central galaxy is producing an estimated 740\nnew stars a year. Larger still is [El\nGordo](El_Gordo_(galaxy_cluster) \"wikilink\"), or officially ACT-CL\nJ0102-4915, whose discovery was announced in 2012.{"messages":[{"content":"I am, Red Hat\u00ae Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.","role":"system"},{"content":"<|user|>\n### Deep-sky objects\n\nThe constellation does not lie on the [galactic\nplane](galactic_plane \"wikilink\") of the Milky Way, and there are no\nprominent star clusters. [NGC 625](NGC_625 \"wikilink\") is a dwarf\n[irregular galaxy](irregular_galaxy \"wikilink\") of apparent magnitude\n11.0 and lying some 12.7 million light years distant. Only 24000 light\nyears in diameter, it is an outlying member of the [Sculptor\nGroup](Sculptor_Group \"wikilink\"). NGC 625 is thought to have been\ninvolved in a collision and is experiencing a burst of [active star\nformation](Active_galactic_nucleus \"wikilink\"). [NGC\n37](NGC_37 \"wikilink\") is a [lenticular\ngalaxy](lenticular_galaxy \"wikilink\") of apparent magnitude 14.66. It is\napproximately 42 [kiloparsecs](kiloparsecs \"wikilink\") (137,000\n[light-years](light-years \"wikilink\")) in diameter and about 12.9\nbillion years old. [Robert's Quartet](Robert's_Quartet \"wikilink\")\n(composed of the irregular galaxy [NGC 87](NGC_87 \"wikilink\"), and three\nspiral galaxies [NGC 88](NGC_88 \"wikilink\"), [NGC 89](NGC_89 \"wikilink\")\nand [NGC 92](NGC_92 \"wikilink\")) is a group of four galaxies located\naround 160 million light-years away which are in the process of\ncolliding and merging. They are within a circle of radius of 1.6 arcmin,\ncorresponding to about 75,000 light-years. Located in the galaxy ESO\n243-49 is [HLX-1](HLX-1 \"wikilink\"), an [intermediate-mass black\nhole](intermediate-mass_black_hole \"wikilink\")\u2014the first one of its kind\nidentified. It is thought to be a remnant of a dwarf galaxy that was\nabsorbed in a [collision](Interacting_galaxy \"wikilink\") with ESO\n243-49. Before its discovery, this class of black hole was only\nhypothesized.\n\nLying within the bounds of the constellation is the gigantic [Phoenix\ncluster](Phoenix_cluster \"wikilink\"), which is around 7.3 million light\nyears wide and 5.7 billion light years away, making it one of the most\nmassive [galaxy clusters](galaxy_cluster \"wikilink\"). It was first\ndiscovered in 2010, and the central galaxy is producing an estimated 740\nnew stars a year. Larger still is [El\nGordo](El_Gordo_(galaxy_cluster) \"wikilink\"), or officially ACT-CL\nJ0102-4915, whose discovery was announced in 2012.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
1.2. Running Synthetic Data Generation (SDG) in the background 링크 복사링크가 클립보드에 복사되었습니다!
There are various ways you can manage and interact with the SDG process.
Running SDG in the background allows you to continue using your terminal while SDG is still running.
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You created a custom
qna.yamlfile with knowledge data. -
You downloaded the
mixtral-8x7b-instruct-v0-1teacher model for SDG. -
You downloaded the
skills-adapter-v3:1.5andknowledge-adapter-v3:1.5LoRA layered skills and knowledge adapter. - You have root user access on your machine.
Procedure
To start an SDG process in the background, run the following command:
ilab data generate -dt
$ ilab data generate -dtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output of a successful start
INFO 2025-01-15 11:36:47,557 instructlab.process.process:236: Started subprocess with PID 68289. Logs are being written to /Users/<user-name>/.local/share/instructlab/logs/generation/generation-e85623ac-d35e-11ef-bc70-2a1c6126d703.log.
$ INFO 2025-01-15 11:36:47,557 instructlab.process.process:236: Started subprocess with PID 68289. Logs are being written to /Users/<user-name>/.local/share/instructlab/logs/generation/generation-e85623ac-d35e-11ef-bc70-2a1c6126d703.log.Copy to Clipboard Copied! Toggle word wrap Toggle overflow There are a few ways you can manage, view, and interact with a detached SDG process.
You can view all the current running processes and their status by entering the following command:
ilab process list
$ ilab process listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output of the listed processes
+------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+ | Type | PID | UUID | Log File | Runtime | Status | +------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+ | Generation | 30334 | f2623406-de55-11ef-b684-2a1c6126d703 | /Users/<user-name>/.local/share/instructlab/logs/generation/generation-f2623406-de55-11ef-b684-2a1c6126d703.log| 00:08:30 | Running | +------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+
+------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+ | Type | PID | UUID | Log File | Runtime | Status | +------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+ | Generation | 30334 | f2623406-de55-11ef-b684-2a1c6126d703 | /Users/<user-name>/.local/share/instructlab/logs/generation/generation-f2623406-de55-11ef-b684-2a1c6126d703.log| 00:08:30 | Running | +------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can join and view the latest process by running the following command:
WarningYou cannot detach from an SDG process once already attached.
ilab process attach --latest
$ ilab process attach --latestCopy to Clipboard Copied! Toggle word wrap Toggle overflow
1.3. Using the llama-3.3-70B-Instruct model as a teacher model (Technology Preview) 링크 복사링크가 클립보드에 복사되었습니다!
RHEL AI version 1.5 supports using llama-3.3-70b-Instruct as a teacher model when running Synthetic Data Generation (SDG). For more information on how the teacher model is utilized in SDG, see Generating a new dataset with SDG. Using a larger-parameter teacher model, such as the llama-3.3-70b-Instruct model, can assess the synthetically generated question-and-answer pairs more effectively, resulting in a higher-quality and more accurate dataset aligned with your original seed file.
Using `llama-3.3-70b-Instruct` as a teacher model is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You created a custom
qna.yamlfile with knowledge or skills data. -
You downloaded the
skills-adapter-v3:1.5andknowledge-adapter-v3:1.5LoRA layered skills and knowledge adapter.
Procedure
-
Download the
llama-3.3-70b-Instructto your RHEL AI system by running the following command:
ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3.3-70b-Instruct --release latest
$ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3.3-70b-Instruct --release latest
-
You can run SDG with the
llama-3.3-70b-Instructmodel as the teacher model by running the following command:
ilab data generate --pipeline llama
$ ilab data generate --pipeline llama
Chapter 2. Training a model 링크 복사링크가 클립보드에 복사되었습니다!
RHEL AI can use your taxonomy tree and synthetic data to create a newly trained model with your domain-specific knowledge or skills using multi-phase training and evaluation. You can run the full training and evaluation process using the synthetic dataset you generated. The LAB optimized technique of multi-phase training is a type of LLM training that goes through multiple stages of training and evaluation. In these various stages, RHEL AI runs the training process and produces model checkpoints. The best checkpoint is selected for the next phase. This process creates many checkpoints and selects the best scored checkpoint. This best scored checkpoint is your newly trained LLM.
The entire process creates a newly generated model that is trained and evaluated using the synthetic data from your taxonomy tree.
Red Hat Enterprise Linux AI 1.5 includes training modes on long-context data. As a result, training times may be longer compared to previous releases. You can reduce the training time by settting max_seq_len: 10000 in your config.yaml files.
2.1. Training the model on your data 링크 복사링크가 클립보드에 복사되었습니다!
You can use Red Hat Enterprise Linux AI to train a model with your synthetically generated data. The following procedures show how to do this using the LAB multi-phase training strategy.
Red Hat Enterprise Linux AI general availability does not support training and inference serving at the same time. If you have an inference server running, you must close it before you start the training process.
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You downloaded the
granite-7b-startermodel. -
You created a custom
qna.yamlfile with knowledge data. - You ran the synthetic data generation (SDG) process.
-
You downloaded the
prometheus-8x7b-v2-0judge model. - You have root user access on your machine.
Procedure
You can run multi-phase training and evaluation by running the following command with the data files generated from SDG.
NoteYou can use the
--enable-serving-outputflag with theilab model traincommmand to display the training logs.ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>$ ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where
- <generation-date>
- The date of when you ran Synthetic Data Generation (SDG).
- <knowledge-train-messages-file>
-
The location of the
knowledge_messages.jsonlfile generated during SDG. RHEL AI trains the student modelgranite-7b-starterusing the data from this.jsonlfile. Example path:~/.local/share/instructlab/datasets/2024-09-07_194933/knowledge_train_msgs_2024-09-07T20_54_21.jsonl. - <skills-train-messages-file>
-
The location of the
skills_messages.jsonlfile generated during SDG. RHEL AI trains the student modelgranite-7b-starterusing the data from the.jsonlfile. Example path:~/.local/share/instructlab/datasets/2024-09-07_194933/skills_train_msgs_2024-09-07T20_54_21.jsonl.
You can use the --strategy lab-skills-only value to train a model on skills only.
Example skills only training command:
ilab model train --strategy lab-skills-only --phased-phase2-data ~/.local/share/instructlab/datasets/<skills-train-messages-jsonl-file>
$ ilab model train --strategy lab-skills-only --phased-phase2-data ~/.local/share/instructlab/datasets/<skills-train-messages-jsonl-file>
The first phase trains the model using the synthetic data from your knowledge contribution.
Example output of training knowledge
Training Phase 1/2... TrainingArgs for current phase: TrainingArgs(model_path='/opt/app-root/src/.cache/instructlab/models/granite-7b-starter', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/tmp/jul19-knowledge-26k.jsonl', ckpt_output_dir='/tmp/e2e/phase1/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Training Phase 1/2... TrainingArgs for current phase: TrainingArgs(model_path='/opt/app-root/src/.cache/instructlab/models/granite-7b-starter', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/tmp/jul19-knowledge-26k.jsonl', ckpt_output_dir='/tmp/e2e/phase1/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Then, RHEL AI selects the best checkpoint to use for the next phase.
The next phase trains the model using the synthetic data from the skills data.
Example output of training skills
Training Phase 2/2... TrainingArgs for current phase: TrainingArgs(model_path='/tmp/e2e/phase1/checkpoints/hf_format/samples_52096', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/usr/share/instructlab/sdg/datasets/skills.jsonl', ckpt_output_dir='/tmp/e2e/phase2/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Training Phase 2/2... TrainingArgs for current phase: TrainingArgs(model_path='/tmp/e2e/phase1/checkpoints/hf_format/samples_52096', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/usr/share/instructlab/sdg/datasets/skills.jsonl', ckpt_output_dir='/tmp/e2e/phase2/checkpoints', data_output_dir='/opt/app-root/src/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=55000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=True, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=LoraOptions(rank=0, alpha=32, dropout=0.1, target_modules=('q_proj', 'k_proj', 'v_proj', 'o_proj'), quantize_data_type=<QuantizeDataType.NONE: None>))Copy to Clipboard Copied! Toggle word wrap Toggle overflow Then, RHEL AI evaluates all of the checkpoints from phase 2 model training using the Multi-turn Benchmark (MT-Bench) and returns the best performing checkpoint as the fully trained output model.
Example output of evaluating skills
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After training is complete, a confirmation appears and displays your best performed checkpoint.
Example output of a complete multi-phase training run
Training finished! Best final checkpoint: samples_1945 with score: 6.813759384
Training finished! Best final checkpoint: samples_1945 with score: 6.813759384Copy to Clipboard Copied! Toggle word wrap Toggle overflow Make a note of this checkpoint because the path is necessary for evaluation and serving.
Verification
When training a model with
ilab model train, multiple checkpoints are saved with thesamples_prefix based on how many data points they have been trained on. These are saved to the~/.local/share/instructlab/phase/directory.ls ~/.local/share/instructlab/phase/<phase1-or-phase2>/checkpoints/
$ ls ~/.local/share/instructlab/phase/<phase1-or-phase2>/checkpoints/Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output of the new models
samples_1711 samples_1945 samples_1456 samples_1462 samples_1903
samples_1711 samples_1945 samples_1456 samples_1462 samples_1903Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.1.1. Continuing or restarting a training run 링크 복사링크가 클립보드에 복사되었습니다!
RHEL AI allows you to continue a training run that may have failed during multi-phase training. There are a few ways a training run can fail:
- The vLLM server may not start correctly.
- A accelerator or GPU may freeze, causing training to abort.
-
There may be an error in your InstructLab
config.yamlfile.
When you run multi-phase training for the first time, the initial training data gets saved into a journalfile.yaml file. If necessary, this metadata in the file can be used to restart a failed training.
You can also restart a training run which clears the training data by following the CLI prompts when running multi-phase training.
Prerequisites
- You ran multi-phase training with your synthetic data and that failed.
Procedure
Run the multi-phase training command again.
ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>$ ilab model train --strategy lab-multiphase \ --phased-phase1-data ~/.local/share/instructlab/datasets/<generation-date>/<knowledge-train-messages-jsonl-file> \ --phased-phase2-data ~/.local/share/instructlab/datasets/<generation-date>/<skills-train-messages-jsonl-file>Copy to Clipboard Copied! Toggle word wrap Toggle overflow The Red Hat Enterprise Linux AI CLI reads if the
journalfile.yamlfile exists and continues the training run from that point.The CLI prompts you to continue for the previous training run, or start from the beginning.
Type
nin your shell to continue from your previews training run.Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? n
Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? nCopy to Clipboard Copied! Toggle word wrap Toggle overflow Type
yinto the terminal to restart a training run.Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? y
Metadata (checkpoints, the training journal) may have been saved from a previous training run. By default, training will resume from this metadata if it exists Alternatively, the metadata can be cleared, and training can start from scratch Would you like to START TRAINING FROM THE BEGINNING? yCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restarting also clears your systems cache of previous checkpoints, journal files and other training data.
Chapter 3. Evaluating the model 링크 복사링크가 클립보드에 복사되었습니다!
If you want to measure the improvements of your new model, you can compare its performance to the base model with the evaluation process. You can also chat with the model directly to qualitatively identify whether the new model has learned the knowledge you created. If you want more quantitative results of the model improvements, you can run the evaluation process in the RHEL AI CLI.
3.1. Evaluating your new model 링크 복사링크가 클립보드에 복사되었습니다!
You can run the evaluation process in the RHEL AI CLI with the following procedure.
Prerequisites
- You installed RHEL AI with the bootable container image.
-
You created a custom
qna.yamlfile with skills or knowledge. - You ran the synthetic data generation process.
- You trained the model using the RHEL AI training process.
-
You downloaded the
prometheus-8x7b-v2-0judge model. - You have root user access on your machine.
Procedure
-
Navigate to your working Git branch where you created your
qna.yamlfile. You can now run the evaluation process on different benchmarks. Each command needs the path to the trained
samplesmodel to evaluate, you can access these checkpoints in your~/.local/share/instructlab/checkpointsfolder.MMLU_BRANCH benchmark - If you want to measure how your knowledge contributions have impacted your model, run the
mmlu_branchbenchmark by executing the following command:ilab model evaluate --benchmark mmlu_branch --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> \ --tasks-dir ~/.local/share/instructlab/datasets/<generation-date>/<node-dataset> \ --base-model ~/.cache/instructlab/models/granite-7b-starter$ ilab model evaluate --benchmark mmlu_branch --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> \ --tasks-dir ~/.local/share/instructlab/datasets/<generation-date>/<node-dataset> \ --base-model ~/.cache/instructlab/models/granite-7b-starterCopy to Clipboard Copied! Toggle word wrap Toggle overflow where
- <checkpoint>
- Specify the best scored checkpoint file generated during multi-phase training
- <node-dataset>
Specify the
node_datasetsdirectory that was generated during SDG, in the~/.local/share/instructlab/datasets/<generation-date>directory, with the same timestamps as the.jsonl files used for training the model.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
MT_BENCH_BRANCH benchmark - If you want to measure how your skills contributions have impacted your model, run the
mt_bench_branchbenchmark by executing the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow where
- <checkpoint>
- Specify the best scored checkpoint file generated during multi-phase training.
- <worker-branch>
- Specify the branch you used when adding data to your taxonomy tree.
- <num-gpus>
Specify the number of GPUs you want to use for evaluation.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Optional: You can manually evaluate each checkpoint using the MMLU and MT_BENCH benchmarks. You can evaluate any model against the standardized set of knowledge or skills, allowing you to compare the scores of your own model against other LLMs.
MMLU - If you want to see the evaluation score of your new model against a standardized set of knowledge data, set the
mmlubenchmark by running the following command:ilab model evaluate --benchmark mmlu --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_665 --skip-server
$ ilab model evaluate --benchmark mmlu --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_665 --skip-serverCopy to Clipboard Copied! Toggle word wrap Toggle overflow where
- <checkpoint>
Specify one of the checkpoint files generated during multi-phase training.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
MT_BENCH - If you want to see the evaluation score of your new model against a standardized set of skills, set the
mt_benchbenchmark by running the following command:ilab model evaluate --benchmark mt_bench --model ~/.local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665
$ ilab model evaluate --benchmark mt_bench --model ~/.local/share/instructlab/phased/phases2/checkpoints/hf_format/samples_665Copy to Clipboard Copied! Toggle word wrap Toggle overflow where
- <checkpoint>
Specify one of the checkpoint files generated during multi-phase training.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.1.1. Domain-Knowledge benchmark evaluation 링크 복사링크가 클립보드에 복사되었습니다!
Domain-Knowledge benchmark evaluation is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
The current knowledge evaluation benchmark in RHEL AI, MMLU and MMLU_branch, evaluates models on their ability to answer multiple choice questions. There was no way to give the model credit on moderately correct or incorrect answers.
The Domain-Knowledge benchmark (DK-bench) evaluation provides the ability to bring custom evaluation questions and score the models answers on a scale.
Each response given is compared to the reference answer and graded on the following scale by the judge model:
| Score | Criteria |
|---|---|
| 1 | The response is entirely incorrect, irrelevant, or does not align with the reference in any meaningful way. |
| 2 | The response partially matches the reference but contains major errors, significant omissions, or irrelevant information. |
| 3 | The response aligns with the reference overall but lacks sufficient detail, clarity, or contains minor inaccuracies. |
| 4 | The response is mostly accurate, aligns closely with the reference, and contains only minor issues or omissions. |
| 5 | The response is fully accurate, completely aligns with the reference, and is clear, thorough, and detailed. |
Prerequisites
- You installed RHEL AI with the bootable container image.
- You trained the model using the RHEL AI training process.
- You downloaded the prometheus-8x7b-v2-0 judge model.
- You have root user access on your machine.
Procedure
To utilize custom evaluation, you must create a
jsonlfile that includes every question you want to ask a model to answer and evaluate.Example DK-bench
jsonlfile{"user_input":"What is the capital of Canada?","reference":"The capital of Canada is Ottawa."}{"user_input":"What is the capital of Canada?","reference":"The capital of Canada is Ottawa."}Copy to Clipboard Copied! Toggle word wrap Toggle overflow where
- user_input
- Contains the question for the model.
- reference
- Contains the answer to the question.
To run the DK-bench benchmark with your custom evaluation questions, run the following command:
ilab model evaluate --benchmark dk_bench --input-questions <path-to-jsonl-file> --model <path-to-model>
$ ilab model evaluate --benchmark dk_bench --input-questions <path-to-jsonl-file> --model <path-to-model>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where
- <path-to-jsonl-file>
-
Specify the path to your
jsonlfile that contains your questions and answers. - <path-to-model>
Specify the path to the model you want to evaluate.
Example command
ilab model evaluate --benchmark dk_bench --input-questions /home/use/path/to/questions.jsonl --model ~/.cache/instructlab/models/instructlab/granite-7b-lab
$ ilab model evaluate --benchmark dk_bench --input-questions /home/use/path/to/questions.jsonl --model ~/.cache/instructlab/models/instructlab/granite-7b-labCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output of domain-Knowledge benchmark evaluation
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 4. Serving and chatting with your new model 링크 복사링크가 클립보드에 복사되었습니다!
You must deploy the model to your machine by serving the model. This deploys the model and makes the model available for interacting and chatting.
4.1. Serving the new model 링크 복사링크가 클립보드에 복사되었습니다!
To interact with your new model, you must activate the model in a machine through serving. The ilab model serve command starts a vLLM server that allows you to chat with the model.
Prerequisites
- You installed RHEL AI with the bootable container image.
- You initialized InstructLab.
- You customized your taxonomy tree, ran synthetic data generation, trained, and evaluated your new model.
- You need root user access on your machine.
Procedure
You can serve the model by running the following command:
ilab model serve --model-path <path-to-best-performed-checkpoint>
$ ilab model serve --model-path <path-to-best-performed-checkpoint>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <path-to-best-performed-checkpoint>
Specify the full path to the checkpoint you built after training. Your new model is the best performed checkpoint with its file path displayed after training.
Example command:
ilab model serve --model-path ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_1945/
$ ilab model serve --model-path ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_1945/Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantEnsure you have a slash
/at the end of your model path.Example output of the
ilab model servecommandilab model serve --model-path ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> INFO 2024-03-02 02:21:11,352 lab.py:201 Using model /home/example-user/.local/share/instructlab/checkpoints/hf_format/checkpoint_1945 with -1 gpu-layers and 4096 max context size. Starting server process After application startup complete see http://127.0.0.1:8000/docs for API. Press CTRL+C to shut down the server.
$ ilab model serve --model-path ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/<checkpoint> INFO 2024-03-02 02:21:11,352 lab.py:201 Using model /home/example-user/.local/share/instructlab/checkpoints/hf_format/checkpoint_1945 with -1 gpu-layers and 4096 max context size. Starting server process After application startup complete see http://127.0.0.1:8000/docs for API. Press CTRL+C to shut down the server.Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.2. Chatting with the new model 링크 복사링크가 클립보드에 복사되었습니다!
You can chat with your model that has been trained with your data.
Prerequisites
- You installed RHEL AI with the bootable container image.
- You initialized InstructLab.
- You customized your taxonomy tree, ran synthetic data generated, trained and evaluated your new model.
- You served your checkpoint model.
- You need root user access on your machine.
Procedure
- Since you are serving the model in one terminal window, you must open a new terminal window to chat with the model.
To chat with your new model, run the following command:
ilab model chat --model <path-to-best-performed-checkpoint-file>
$ ilab model chat --model <path-to-best-performed-checkpoint-file>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
- <path-to-best-performed-checkpoint-file>
Specify the new model checkpoint file you built after training. Your new model is the best performed checkpoint with its file path displayed after training.
Example command:
ilab model chat --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_1945
$ ilab model chat --model ~/.local/share/instructlab/phased/phase2/checkpoints/hf_format/samples_1945Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Example output of the InstructLab chatbot
ilab model chat ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────── system ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Welcome to InstructLab Chat w/ CHECKPOINT_1945 (type /h for help) │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ >> [S][default]
$ ilab model chat ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────── system ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Welcome to InstructLab Chat w/ CHECKPOINT_1945 (type /h for help) │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ >>> [S][default]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Type
exitto leave the chatbot.