Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 1. Generating a new dataset with SDG

After customizing your taxonomy tree, you can generate a synthetic dataset using the Synthetic Data Generation (SDG) process on Red Hat Enterprise Linux AI. SDG is a process that creates an artificially generated dataset that mimics real data based on provided examples. SDG uses a YAML file containing question-and-answer pairs as input data. With these examples, SDG utilizes the mixtral-8x7b-instruct-v0-1 LLM as a teacher model to generate similar question-and-answer pairs. In the SDG pipeline, many questions are generated and scored based on quality, where the mixtral-8x7b-instruct-v0-1 teacher model assesses their relevance and coherence. The pipeline then applies a filtering mechanism to select the highest-scoring questions, generates corresponding answers, and further evaluates their accuracy based on the original example question. The final set of high-quality question-and-answer pairs is then included in the synthetic dataset used for training.

1.1. Creating a synthetic dataset using your examples
Link kopieren

You can use your examples and run the SDG process to create a synthetic dataset.

Important

If you are running SDG on a system with 4xL40s, you must use the following parameters for SDG to run properly.

ilab data generate --num-cpus 4

Prerequisites

You installed RHEL AI with the bootable container image.
You created a custom qna.yaml file with knowledge data.
You downloaded the mixtral-8x7b-instruct-v0-1 teacher model for SDG.
You downloaded the skills-adapter-v3:1.5 and knowledge-adapter-v3:1.5 LoRA layered skills and knowledge adapter.
You have root user access on your machine.

Procedure

To generate a new synthetic dataset, based on your custom taxonomy with knowledge, run the following command:

$ ilab data generate

Note

You can use the --enable-serving-output flag when running the ilab data generate command to display the vLLM startup logs.

At the start of the SDG process, vLLM attempts to start a server for hosting the mixtral-8x7B-instruct teacher model.

Example output of vLLM attempting to start a server

Starting a temporary vLLM server at http://127.0.0.1:47825/v1
INFO 2024-08-22 17:01:09,461 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 1/120
INFO 2024-08-22 17:01:14,213 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 2/120

Once vLLM connects, the SDG process starts creating synthetic data based on your seed examples in the qna.yaml file.

Example output of vLLM connecting and SDG generating

INFO 2024-08-22 15:16:43,497 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:49311/v1, this might take a moment... Attempt: 74/120
INFO 2024-08-22 15:16:45,949 instructlab.model.backends.backends:487: vLLM engine successfully started at http://127.0.0.1:49311/v1
Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/var/home/cloud-user/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:49311/v1 server
INFO 2024-08-22 15:16:46,594 instructlab.sdg:375: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.

The SDG process completes when the CLI displays the location of your new data set.

Example output of a successful SDG run

INFO 2024-08-16 17:12:46,548 instructlab.sdg.datamixing:200: Mixed Dataset saved to /home/example-user/.local/share/instructlab/datasets/skills_train_msgs_2024-08-16T16_50_11.jsonl
INFO 2024-08-16 17:12:46,549 instructlab.sdg:438: Generation took 1355.74s

Note

This process can be time consuming depending on your hardware specifications.

Verification

To verify that the SDG files were created, navigate to your ~/.local/share/instructlab/datasets/ directory and list the files corresponding to the date when the data was generated. For example:
```
$ ls 2024-03-24_194933
```
Example output
```
knowledge_recipe_2024-03-24T20_54_21.yaml                   skills_recipe_2024-03-24T20_54_21.yaml
knowledge_train_msgs_2024-03-24T20_54_21.jsonl              skills_train_msgs_2024-03-24T20_54_21.jsonl
messages_granite-7b-lab-Q4_K_M_2024-03-24T20_54_21.jsonl    node_datasets_2024-03-24T15_12_12/
```
Important
Make a note of your most recent knowledge_train_msgs.jsonl and skills_train_msgs.jsonl file. You need to specify this file during multi-phase training. Each JSONL has the time stamp on the file, for example knowledge_train_msgs_2024-08-08T20_04_28.jsonl, use the most recent file when training.

Optional: You can view output of SDG by navigating to the ~/.local/share/datasets/<generation-date> directory and opening the JSONL file.

$ cat ~/.local/share/datasets/<generation-date>/<jsonl-dataset>

Example output of a SDG JSONL file

{"messages":[{"content":"I am, Red Hat\u00ae Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.","role":"system"},{"content":"<|user|>\n### Deep-sky objects\n\nThe constellation does not lie on the [galactic\nplane](galactic_plane \"wikilink\") of the Milky Way, and there are no\nprominent star clusters. [NGC 625](NGC_625 \"wikilink\") is a dwarf\n[irregular galaxy](irregular_galaxy \"wikilink\") of apparent magnitude\n11.0 and lying some 12.7 million light years distant.
Only 24000 light\nyears in diameter, it is an outlying member of the [Sculptor\nGroup](Sculptor_Group \"wikilink\"). NGC 625 is thought to have been\ninvolved in a collision and is experiencing a burst of [active star\nformation](Active_galactic_nucleus \"wikilink\"). [NGC\n37](NGC_37 \"wikilink\") is a [lenticular\ngalaxy](lenticular_galaxy \"wikilink\") of apparent magnitude 14.66. It is\napproximately 42 [kiloparsecs](kiloparsecs \"wikilink\") (137,000\n[light-years](light-years \"wikilink\")) in diameter and about 12.9\nbillion years old. [Robert's Quartet](Robert's_Quartet \"wikilink\")\n(composed of the
irregular galaxy [NGC 87](NGC_87 \"wikilink\"), and three\nspiral galaxies [NGC 88](NGC_88 \"wikilink\"), [NGC 89](NGC_89 \"wikilink\")\nand [NGC 92](NGC_92 \"wikilink\")) is a group of four galaxies located\naround 160 million light-years away which are in the process of\ncolliding and merging. They are within a circle of radius of 1.6 arcmin,\ncorresponding to about 75,000 light-years. Located in the galaxy ESO\n243-49 is [HLX-1](HLX-1 \"wikilink\"), an [intermediate-mass black\nhole](intermediate-mass_black_hole \"wikilink\")\u2014the first one of its kind\nidentified. It is thought to be a remnant of a dwarf
galaxy that was\nabsorbed in a [collision](Interacting_galaxy \"wikilink\") with ESO\n243-49. Before its discovery, this class of black hole was only\nhypothesized.\n\nLying within the bounds of the constellation is the gigantic [Phoenix\ncluster](Phoenix_cluster \"wikilink\"), which is around 7.3 million light\nyears wide and 5.7 billion light years away, making it one of the most\nmassive [galaxy clusters](galaxy_cluster \"wikilink\"). It was first\ndiscovered in 2010, and the central galaxy is producing an estimated 740\nnew stars a year. Larger still is [El\nGordo](El_Gordo_(galaxy_cluster) \"wikilink\"),
or officially ACT-CL\nJ0102-4915, whose discovery was announced in 2012.

1.2. Running Synthetic Data Generation (SDG) in the background
Link kopieren

There are various ways you can manage and interact with the SDG process.

Running SDG in the background allows you to continue using your terminal while SDG is still running.

Prerequisites

You installed RHEL AI with the bootable container image.
You created a custom qna.yaml file with knowledge data.
You downloaded the mixtral-8x7b-instruct-v0-1 teacher model for SDG.
You downloaded the skills-adapter-v3:1.5 and knowledge-adapter-v3:1.5 LoRA layered skills and knowledge adapter.
You have root user access on your machine.

Procedure

To start an SDG process in the background, run the following command:

$ ilab data generate -dt

Example output of a successful start

$ INFO 2025-01-15 11:36:47,557 instructlab.process.process:236: Started subprocess with PID 68289. Logs are being written to /Users/<user-name>/.local/share/instructlab/logs/generation/generation-e85623ac-d35e-11ef-bc70-2a1c6126d703.log.

There are a few ways you can manage, view, and interact with a detached SDG process.

You can view all the current running processes and their status by entering the following command:

$ ilab process list

Example output of the listed processes

+------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+
| Type       | PID   | UUID                                 | Log File                                                                                                       | Runtime  | Status  |
+------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+
| Generation | 30334 | f2623406-de55-11ef-b684-2a1c6126d703 | /Users/<user-name>/.local/share/instructlab/logs/generation/generation-f2623406-de55-11ef-b684-2a1c6126d703.log| 00:08:30 | Running |
+------------+-------+--------------------------------------+----------------------------------------------------------------------------------------------------------------+----------+---------+

You can join and view the latest process by running the following command:
Warning
You cannot detach from an SDG process once already attached.
```
$ ilab process attach --latest
```

1.3. Using the llama-3.3-70B-Instruct model as a teacher model (Technology Preview)
Link kopieren

RHEL AI version 1.5 supports using llama-3.3-70b-Instruct as a teacher model when running Synthetic Data Generation (SDG). For more information on how the teacher model is utilized in SDG, see Generating a new dataset with SDG. Using a larger-parameter teacher model, such as the llama-3.3-70b-Instruct model, can assess the synthetically generated question-and-answer pairs more effectively, resulting in a higher-quality and more accurate dataset aligned with your original seed file.

Important

Using `llama-3.3-70b-Instruct` as a teacher model is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You installed RHEL AI with the bootable container image.
You created a custom qna.yaml file with knowledge or skills data.
You downloaded the skills-adapter-v3:1.5 and knowledge-adapter-v3:1.5 LoRA layered skills and knowledge adapter.

Procedure

Download the llama-3.3-70b-Instruct to your RHEL AI system by running the following command:

$ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3.3-70b-Instruct --release latest

You can run SDG with the llama-3.3-70b-Instruct model as the teacher model by running the following command:

$ ilab data generate --pipeline llama

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 1. Generating a new dataset with SDG

1.1. Creating a synthetic dataset using your examples
Link kopieren

1.2. Running Synthetic Data Generation (SDG) in the background
Link kopieren

1.3. Using the llama-3.3-70B-Instruct model as a teacher model (Technology Preview)
Link kopieren

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 1. Generating a new dataset with SDG

1.1. Creating a synthetic dataset using your examplesLink kopierenLink in die Zwischenablage kopiert!

1.2. Running Synthetic Data Generation (SDG) in the backgroundLink kopierenLink in die Zwischenablage kopiert!

1.3. Using the llama-3.3-70B-Instruct model as a teacher model (Technology Preview)Link kopierenLink in die Zwischenablage kopiert!

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

1.1. Creating a synthetic dataset using your examples
Link kopieren

1.2. Running Synthetic Data Generation (SDG) in the background
Link kopieren

1.3. Using the llama-3.3-70B-Instruct model as a teacher model (Technology Preview)
Link kopieren