このコンテンツは選択した言語では利用できません。

Chapter 2. Generating a new dataset with Synthetic data generation (SDG)


After customizing your taxonomy tree, you can generate a synthetic dataset using the Synthetic Data Generation (SDG) process on Red Hat Enterprise Linux AI. SDG is a process that creates an artificially generated dataset that mimics real data based on provided examples. SDG uses a YAML file containing question-and-answer pairs as input data. With these examples, SDG utilizes the mixtral-8x7b-instruct-v0-1 LLM as a teacher model to generate similar question-and-answer pairs. In the SDG pipeline, many questions are generated and scored based on quality, where the mixtral-8x7b-instruct-v0-1 model assesses the quality of these questions. The pipeline then selects the highest-scoring questions, generates corresponding answers, and includes these pairs in the synthetic dataset.

2.1. Creating a synthetic dataset using your examples

You can use your examples and run the SDG process to create a synthetic dataset.

Prerequisites

  • You installed RHEL AI with the bootable container image.
  • You created a custom qna.yaml file with knowledge data.
  • You downloaded the mixtral-8x7b-instruct-v0-1 teacher model for SDG.
  • You downloaded the skills-adapter-v3:1.1 and knowledge-adapter-v3:1.1 LoRA layered skills and knowledge adapter.
  • You have root user access on your machine.

Procedure

  1. To generate a new synthetic dataset, based on your custom taxonomy with knowledge, run the following command:

    $ ilab data generate

    This command runs SDG with mixtral-8x7B-instruct as the teacher model

    Note

    You can use the --enable-serving-output flag when running the ilab data generate command to display the vLLM startup logs.

    1. At the start of the SDG process, vLLM attempts to start a server.

      Example output of vLLM attempting to start a server

      Starting a temporary vLLM server at http://127.0.0.1:47825/v1
      INFO 2024-08-22 17:01:09,461 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 1/120
      INFO 2024-08-22 17:01:14,213 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 2/120
      INFO 2024-08-22 17:01:19,142 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 3/120

    2. Once vLLM connects, the SDG process starts creating synthetic data from your examples.

      Example output of vLLM connecting and SDG generating

      INFO 2024-08-22 15:16:38,933 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:49311/v1, this might take a moment... Attempt: 73/120
      INFO 2024-08-22 15:16:43,497 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:49311/v1, this might take a moment... Attempt: 74/120
      INFO 2024-08-22 15:16:45,949 instructlab.model.backends.backends:487: vLLM engine successfully started at http://127.0.0.1:49311/v1
      Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/var/home/cloud-user/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:49311/v1 server
      INFO 2024-08-22 15:16:46,594 instructlab.sdg:375: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.

  2. The SDG process completes when the CLI displays the location of your new data set.

    Example output of a successful SDG run

    INFO 2024-08-16 17:12:46,548 instructlab.sdg.datamixing:200: Mixed Dataset saved to /home/example-user/.local/share/instructlab/datasets/skills_train_msgs_2024-08-16T16_50_11.jsonl
    INFO 2024-08-16 17:12:46,549 instructlab.sdg:438: Generation took 1355.74s

    Note

    This process can be time consuming depending on your hardware specifications.

  3. Verify the files are created by running the following command:

    $ ls ~/.local/share/instructlab/datasets/

    Example output

    knowledge_recipe_2024-08-13T20_54_21.yaml                   skills_recipe_2024-08-13T20_54_21.yaml
    knowledge_train_msgs_2024-08-13T20_54_21.jsonl              skills_train_msgs_2024-08-13T20_54_21.jsonl
    messages_granite-7b-lab-Q4_K_M_2024-08-13T20_54_21.jsonl    node_datasets_2024-08-13T15_12_12/

    Important

    Make a note of your most recent knowledge_train_msgs.jsonl and skills_train_msgs.jsonl file. You need to specify this file during multi-phase training. Each JSONL has the time stamp on the file, for example knowledge_train_msgs_2024-08-08T20_04_28.jsonl, use the most recent file when training.

  4. Optional: You can view output of SDG by navigating to the ~/.local/share/datasets/ directory and opening the JSONL file.

    $ cat ~/.local/share/datasets/<jsonl-dataset>

    Example output of a SDG JSONL file

    {"messages":[{"content":"I am, Red Hat\u00ae Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.","role":"system"},{"content":"<|user|>\n### Deep-sky objects\n\nThe constellation does not lie on the [galactic\nplane](galactic_plane \"wikilink\") of the Milky Way, and there are no\nprominent star clusters. [NGC 625](NGC_625 \"wikilink\") is a dwarf\n[irregular galaxy](irregular_galaxy \"wikilink\") of apparent magnitude\n11.0 and lying some 12.7 million light years distant.
    Only 24000 light\nyears in diameter, it is an outlying member of the [Sculptor\nGroup](Sculptor_Group \"wikilink\"). NGC 625 is thought to have been\ninvolved in a collision and is experiencing a burst of [active star\nformation](Active_galactic_nucleus \"wikilink\"). [NGC\n37](NGC_37 \"wikilink\") is a [lenticular\ngalaxy](lenticular_galaxy \"wikilink\") of apparent magnitude 14.66. It is\napproximately 42 [kiloparsecs](kiloparsecs \"wikilink\") (137,000\n[light-years](light-years \"wikilink\")) in diameter and about 12.9\nbillion years old. [Robert's Quartet](Robert's_Quartet \"wikilink\")\n(composed of the
    irregular galaxy [NGC 87](NGC_87 \"wikilink\"), and three\nspiral galaxies [NGC 88](NGC_88 \"wikilink\"), [NGC 89](NGC_89 \"wikilink\")\nand [NGC 92](NGC_92 \"wikilink\")) is a group of four galaxies located\naround 160 million light-years away which are in the process of\ncolliding and merging. They are within a circle of radius of 1.6 arcmin,\ncorresponding to about 75,000 light-years. Located in the galaxy ESO\n243-49 is [HLX-1](HLX-1 \"wikilink\"), an [intermediate-mass black\nhole](intermediate-mass_black_hole \"wikilink\")\u2014the first one of its kind\nidentified. It is thought to be a remnant of a dwarf
    galaxy that was\nabsorbed in a [collision](Interacting_galaxy \"wikilink\") with ESO\n243-49. Before its discovery, this class of black hole was only\nhypothesized.\n\nLying within the bounds of the constellation is the gigantic [Phoenix\ncluster](Phoenix_cluster \"wikilink\"), which is around 7.3 million light\nyears wide and 5.7 billion light years away, making it one of the most\nmassive [galaxy clusters](galaxy_cluster \"wikilink\"). It was first\ndiscovered in 2010, and the central galaxy is producing an estimated 740\nnew stars a year. Larger still is [El\nGordo](El_Gordo_(galaxy_cluster) \"wikilink\"),
    or officially ACT-CL\nJ0102-4915, whose discovery was announced in 2012. Located around\n7.2 billion light years away, it is composed of two subclusters in the\nprocess of colliding, resulting in the spewing out of hot gas, seen in\nX-rays and infrared images.\n\n### Meteor showers\n\nPhoenix is the [radiant](radiant_(meteor_shower) \"wikilink\") of two\nannual [meteor showers](meteor_shower \"wikilink\"). The\n[Phoenicids](Phoenicids \"wikilink\"), also known as the December\nPhoenicids, were first observed on 3 December 1887. The shower was\nparticularly intense in December 1956, and is thought related to the\nbreakup
    of the [short-period comet](short-period_comet \"wikilink\")\n[289P\/Blanpain](289P\/Blanpain \"wikilink\"). It peaks around 4\u20135 December,\nthough is not seen every year. A very minor meteor shower peaks\naround July 14 with around one meteor an hour, though meteors can be\nseen anytime from July 3 to 18; this shower is referred to as the July\nPhoenicids.\n\nHow many light years wide is the Phoenix cluster?\n<|assistant|>\n' 'The Phoenix cluster is around 7.3 million light years wide.'","role":"pretraining"}],"metadata":"{\"sdg_document\": \"### Deep-sky objects\\n\\nThe constellation does not lie on the [galactic\\nplane](galactic_plane \\\"wikilink\\\") of the Milky Way,
    and there are no\\nprominent star clusters. [NGC 625](NGC_625 \\\"wikilink\\\") is a dwarf\\n[irregular galaxy](irregular_galaxy \\\"wikilink\\\") of apparent magnitude\\n11.0 and lying some 12.7 million light years distant. Only 24000 light\\nyears in diameter, it is an outlying member of the [Sculptor\\nGroup](Sculptor_Group \\\"wikilink\\\"). NGC 625 is thought to have been\\ninvolved in a collision and is experiencing a burst of [active star\\nformation](Active_galactic_nucleus \\\"wikilink\\\"). [NGC\\n37](NGC_37 \\\"wikilink\\\") is a [lenticular\\ngalaxy](lenticular_galaxy \\\"wikilink\\\") of apparent magnitude 14.66. It is\\napproximately 42
    [kiloparsecs](kiloparsecs \\\"wikilink\\\") (137,000\\n[light-years](light-years \\\"wikilink\\\")) in diameter and about 12.9\\nbillion years old. [Robert's Quartet](Robert's_Quartet \\\"wikilink\\\")\\n(composed of the irregular galaxy [NGC 87](NGC_87 \\\"wikilink\\\"), and three\\nspiral galaxies [NGC 88](NGC_88 \\\"wikilink\\\"), [NGC 89](NGC_89 \\\"wikilink\\\")\\nand [NGC 92](NGC_92 \\\"wikilink\\\")) is a group of four galaxies located\\naround 160 million light-years away which are in the process of\\ncolliding and merging. They are within a circle of radius of 1.6 arcmin,\\ncorresponding to about 75,000 light-years.
    Located in the galaxy ESO\\n243-49 is [HLX-1](HLX-1 \\\"wikilink\\\"), an [intermediate-mass black\\nhole](intermediate-mass_black_hole \\\"wikilink\\\")\\u2014the first one of its kind\\nidentified. It is thought to be a remnant of a dwarf galaxy that was\\nabsorbed in a [collision](Interacting_galaxy \\\"wikilink\\\") with ESO\\n243-49. Before its discovery, this class of black hole was only\\nhypothesized.\\n\\nLying within the bounds of the constellation is the gigantic [Phoenix\\ncluster](Phoenix_cluster \\\"wikilink\\\"), which is around 7.3 million light\\nyears wide and 5.7 billion light years away, making it one of the most\\nmassive [galaxy clusters](galaxy_cluster \\\"wikilink\\\").
    It was first\\ndiscovered in 2010, and the central galaxy is producing an estimated 740\\nnew stars a year. Larger still is [El\\nGordo](El_Gordo_(galaxy_cluster) \\\"wikilink\\\"), or officially ACT-CL\\nJ0102-4915, whose discovery was announced in 2012. Located around\\n7.2 billion light years away, it is composed of two subclusters in the\\nprocess of colliding, resulting in the spewing out of hot gas, seen in\\nX-rays and infrared images.\\n\\n### Meteor showers\\n\\nPhoenix is the [radiant](radiant_(meteor_shower) \\\"wikilink\\\") of two\\nannual [meteor showers](meteor_shower \\\"wikilink\\\"). The\\n[Phoenicids](Phoenicids \\\"wikilink\\\"),
    also known as the December\\nPhoenicids, were first observed on 3 December 1887. The shower was\\nparticularly intense in December 1956, and is thought related to the\\nbreakup of the [short-period comet](short-period_comet \\\"wikilink\\\")\\n[289P\/Blanpain](289P\/Blanpain \\\"wikilink\\\"). It peaks around 4\\u20135 December,\\nthough is not seen every year. A very minor meteor shower peaks\\naround July 14 with around one meteor an hour, though meteors can be\\nseen anytime from July 3 to 18; this shower is referred to as the July\\nPhoenicids.\", \"domain\": \"astronomy\", \"dataset\": \"document_knowledge_qa\"}","id":"1df7c219-a062-4511-8bae-f55c88927dc1"}

Red Hat logoGithubRedditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

Red Hat をお使いのお客様が、信頼できるコンテンツが含まれている製品やサービスを活用することで、イノベーションを行い、目標を達成できるようにします。 最新の更新を見る.

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

© 2024 Red Hat, Inc.