Chapter 4. Generate synthetic data
When you customize a model for your enterprise, you must generate high-quality synthetic data to augment your dataset, improve model robustness, and cover edge cases.
Red Hat provides the Synthetic Data Generation (SDG) Hub, a modular Python framework for building synthetic data generation pipelines by using composable blocks and flows. Each block performs a specific task, such as LLM chat, parse text, evaluate, or transform data. Flows chain blocks together to create complex data generation pipelines that include validation and parameter management. A flow (data generation pipeline) is a YAML specification that defines an instance of a data generation algorithm.
4.1. Explore the SDG Hub examples Copy linkLink copied to clipboard!
To get started with SDG Hub, explore the provided examples.
Prerequisites
- Install the Synthetic Data Generation (SDG) Hub library as described in Set up your working environment.
Procedure
To access the SDG Hub examples, clone the SDG Hub Git repository:
- To clone the https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git repository from JupyterLab, follow the steps in Clone an example Git repository.
To create a local clone of the repository, run the following command:
git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub
git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hubCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Go to the
examplesdirectory to view the notebooks and YAML files for these use cases:- Knowledge tuning - Generate data to fine-tune a model on enterprise documents so that the resulting trained model can accurately recall relevant content and facts in response to user queries. This example provides a complete walkthrough of data generation and preparation for training.
Text analysis - Generate data for teaching models to extract meaningful insights from text in structured format. Create custom blocks and extend existing flows for new applications.
Each use case directory includes a README file that provides details for each use case — such as instructions, performance notes, and configuration tips.
When you run the example notebooks, consider the following information:
- Data generation time and statistics: The total time to generate data depends on both the maximum concurrency supported by your endpoint and the complexity of the running flow. Longer flows, such as the flows in the Knowledge Generation notebooks, take more time to complete because they produce a large number of summaries and Q&A pairs, each of which undergoes verification within the pipeline.
LLM endpoint requirements: For running flows in the Knowledge Generation notebooks, Red Hat recommends that you set the following values:
-
Set
NUMBER_OF_SUMMARIESto a minimum of 10. - To achieve reasonable data generation times and avoid timeouts, use an endpoint that supports a maximum concurrency of at least 50.
-
Extend LiteLLM’s request timeout by setting the environment variable
LITELLM_REQUEST_TIMEOUT.
-
Set
Additional resources
- SDG community documentation: https://github.com/instructlab/sdg/tree/main/docs
- SDG GitHub repository: https://github.com/instructlab/sdg
4.2. Performance benchmarks for knowledge tuning Copy linkLink copied to clipboard!
To get an estimate of the total time a flow will take, you can run the dry_run function and set enable_time_estimation to true.
For example, tests that use the gpt-oss-120b LLM on 4x H100 GPUs with the QuALITY dataset (266 articles) showed significant variance between flows.
- The estimated generation times for the full dataset were approximately 15.12 hours for Extractive Summary and 12.99 hours for Detailed Summary, both of which were evaluated with 50 completions per summary (N=50).
- In contrast, the Key Facts and Document Based flows, which generated only a single summary per document, completed in approximately 0.35 and 0.44 hours, respectively.
- Additionally, analysis of the Extractive Summary flow highlights that the steepest time reductions occurred between concurrency levels 10 and 30, with returns observed to diminish significantly beyond 50 in this specific configuration.
To view a graph that illustrates the accuracy on QuALITY Benchmark (4,609 Evaluation QA), go to: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/enhanced_summary_knowledge_tuning/imgs/quality_benchmark_accuracy.png.
4.3. Guided example - Build a KFP pipeline for SDG Copy linkLink copied to clipboard!
You can generate synthetic data for domain-specific model customization by using a Kubeflow Pipeline (KFP) on Red Hat OpenShift AI. The Domain Customization Data Generation using Kubeflow Pipelines (KFP) is a guided example.
Prerequisites
- Install the Synthetic Data Generation (SDG) Hub library as described in Set up your working environment.
Procedure
Run the following command to clone the (org-name) AI examples repository that includes the KFP pipeline for knowledge tuning example.
git clone https://github.com/red-hat-data-services/red-hat-ai-examples
git clone https://github.com/red-hat-data-services/red-hat-ai-examplesCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Navigate to the
examples/domain_customization_kfp_pipelinedirectory. Follow the instructions in the README file to run the example:
-
Configure an environment variable (
.env) file, provide your model endpoint, and store the file as a Kubernetes secret. The KFP pipeline consumes the secret as environment variables. - Generate the KFP pipeline YAML file.
- Upload the YAML file to OpenShift AI and deploy the pipeline.
-
Configure an environment variable (
Verification
The example pipeline generates three types of document augmentations and four types of QA on top of 3 augmentation and 1 original document. It stores the generated data in the Cloud Object Storage (COS) bucket that is linked through the pipeline server.