第 1 章 使用 SDG 生成新数据集
自定义税务树后,您可以在 Red Hat Enterprise Linux AI 上使用 Synthetic Data Generation (SDG)流程生成复合数据集。SDG 是一个创建人工生成的数据集,它根据提供的示例模拟实际数据的过程。SDG 使用包含问答对作为输入数据的 YAML 文件。使用这些示例,SDG 使用 mixtral-8x7b-instruct-v0-1 LLM 作为生成类似的问题和回答对的指导模型。在 SDG 管道中,根据质量生成和评分许多问题,其中 mixtral-8x7b-instruct-v0-1 指导器模型评估其相关性和一致性。然后,管道应用过滤机制来选择最高级问题,生成相应的答案,并根据原始示例问题进一步评估其准确性。最后一组高质量的问答对包含在用于培训的合成数据集中。
1.1. 使用您的示例创建复合数据集 复制链接链接已复制到粘贴板!
您可以使用示例并运行 SDG 进程来创建复合数据集。
如果您在有 4xL40s 的系统上运行 SDG,则必须使用以下参数才能使 SDG 正确运行。
ilab data generate --num-cpus 4
先决条件
- 已使用可引导容器镜像安装了 RHEL AI。
-
已使用知识数据创建自定义
qna.yaml文件。 -
您下载了 SDG 的
mixtral-8x7b-instruct-v0-1教授模型。 -
您下载了
skills-adapter-v3:1.5和knowledge-adapter-v3:1.5LoRA 分层技术和知识适配器。 - 在机器上具有 root 用户访问权限。
流程
要根据自定义税务数据集生成一个新的复合数据集,请运行以下命令:
$ ilab data generate注意在运行
ilab data generate命令时,您可以使用--enable-serving-output标志来显示 vLLM 启动日志。在 SDG 进程开始时,vLLM 会尝试启动用于托管
mixtral-8x7B-instructteacher 模型的服务器。试图启动服务器的 vLLM 输出示例
Starting a temporary vLLM server at http://127.0.0.1:47825/v1 INFO 2024-08-22 17:01:09,461 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 1/120 INFO 2024-08-22 17:01:14,213 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:47825/v1, this might take a moment... Attempt: 2/120当 vLLM 连接后,SDG 进程会根据
qna.yaml文件中的 seed 示例开始创建复合数据。vLLM 连接和 SDG 生成的输出示例
INFO 2024-08-22 15:16:43,497 instructlab.model.backends.backends:480: Waiting for the vLLM server to start at http://127.0.0.1:49311/v1, this might take a moment... Attempt: 74/120 INFO 2024-08-22 15:16:45,949 instructlab.model.backends.backends:487: vLLM engine successfully started at http://127.0.0.1:49311/v1 Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/var/home/cloud-user/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/var/home/cloud-user/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:49311/v1 server INFO 2024-08-22 15:16:46,594 instructlab.sdg:375: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
当 CLI 显示新数据集的位置时,SDG 进程完成。
SDG 运行成功的输出示例
INFO 2024-08-16 17:12:46,548 instructlab.sdg.datamixing:200: Mixed Dataset saved to /home/example-user/.local/share/instructlab/datasets/skills_train_msgs_2024-08-16T16_50_11.jsonl INFO 2024-08-16 17:12:46,549 instructlab.sdg:438: Generation took 1355.74s注意根据您的硬件规格,这个过程可能会消耗大量时间。
验证
要验证 SDG 文件是否已创建,请导航到
~/.local/share/instructlab/datasets/目录,并在生成数据时列出与日期对应的文件。例如:$ ls 2024-03-24_194933输出示例
knowledge_recipe_2024-03-24T20_54_21.yaml skills_recipe_2024-03-24T20_54_21.yaml knowledge_train_msgs_2024-03-24T20_54_21.jsonl skills_train_msgs_2024-03-24T20_54_21.jsonl messages_granite-7b-lab-Q4_K_M_2024-03-24T20_54_21.jsonl node_datasets_2024-03-24T15_12_12/重要记录您最近的
knowledge_train_msgs.jsonl和skills_train_msgs.jsonl文件。您需要在多阶段培训期间指定此文件。每个 JSONL 都有文件的时间戳,如knowledge_train_msgs_2024-08-08T20_04_28.jsonl,在培训时使用最新的文件。可选:您可以通过导航到
~/.local/share/datasets/目录并打开JSONL文件来查看 SDG 的输出。$ cat ~/.local/share/datasets/<generation-date>/<jsonl-dataset>SDG JSONL 文件的输出示例
{"messages":[{"content":"I am, Red Hat\u00ae Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.","role":"system"},{"content":"<|user|>\n### Deep-sky objects\n\nThe constellation does not lie on the [galactic\nplane](galactic_plane \"wikilink\") of the Milky Way, and there are no\nprominent star clusters. [NGC 625](NGC_625 \"wikilink\") is a dwarf\n[irregular galaxy](irregular_galaxy \"wikilink\") of apparent magnitude\n11.0 and lying some 12.7 million light years distant. Only 24000 light\nyears in diameter, it is an outlying member of the [Sculptor\nGroup](Sculptor_Group \"wikilink\"). NGC 625 is thought to have been\ninvolved in a collision and is experiencing a burst of [active star\nformation](Active_galactic_nucleus \"wikilink\"). [NGC\n37](NGC_37 \"wikilink\") is a [lenticular\ngalaxy](lenticular_galaxy \"wikilink\") of apparent magnitude 14.66. It is\napproximately 42 [kiloparsecs](kiloparsecs \"wikilink\") (137,000\n[light-years](light-years \"wikilink\")) in diameter and about 12.9\nbillion years old. [Robert's Quartet](Robert's_Quartet \"wikilink\")\n(composed of the irregular galaxy [NGC 87](NGC_87 \"wikilink\"), and three\nspiral galaxies [NGC 88](NGC_88 \"wikilink\"), [NGC 89](NGC_89 \"wikilink\")\nand [NGC 92](NGC_92 \"wikilink\")) is a group of four galaxies located\naround 160 million light-years away which are in the process of\ncolliding and merging. They are within a circle of radius of 1.6 arcmin,\ncorresponding to about 75,000 light-years. Located in the galaxy ESO\n243-49 is [HLX-1](HLX-1 \"wikilink\"), an [intermediate-mass black\nhole](intermediate-mass_black_hole \"wikilink\")\u2014the first one of its kind\nidentified. It is thought to be a remnant of a dwarf galaxy that was\nabsorbed in a [collision](Interacting_galaxy \"wikilink\") with ESO\n243-49. Before its discovery, this class of black hole was only\nhypothesized.\n\nLying within the bounds of the constellation is the gigantic [Phoenix\ncluster](Phoenix_cluster \"wikilink\"), which is around 7.3 million light\nyears wide and 5.7 billion light years away, making it one of the most\nmassive [galaxy clusters](galaxy_cluster \"wikilink\"). It was first\ndiscovered in 2010, and the central galaxy is producing an estimated 740\nnew stars a year. Larger still is [El\nGordo](El_Gordo_(galaxy_cluster) \"wikilink\"), or officially ACT-CL\nJ0102-4915, whose discovery was announced in 2012.