Chapter 4. Generate synthetic data


When you customize a model for your enterprise, you must generate high-quality synthetic data to augment your dataset, improve model robustness, and cover edge cases.

Red Hat provides the Synthetic Data Generation (SDG) Hub, a modular Python framework for building synthetic data generation pipelines by using composable blocks and flows. Each block performs a specific task, such as LLM chat, parse text, evaluate, or transform data. Flows chain blocks together to create complex data generation pipelines that include validation and parameter management. A flow (data generation pipeline) is a YAML specification that defines an instance of a data generation algorithm.

4.1. Explore the SDG Hub examples

To get started with SDG Hub, explore the provided examples.

Prerequisites

Procedure

  1. To access the SDG Hub examples, clone the SDG Hub Git repository:

  2. Go to the examples directory to view the notebooks and YAML files for these use cases:

    • Knowledge tuning - Generate data to fine-tune a model on enterprise documents so that the resulting trained model can accurately recall relevant content and facts in response to user queries. This example provides a complete walkthrough of data generation and preparation for training.
    • Text analysis - Generate data for teaching models to extract meaningful insights from text in structured format. Create custom blocks and extend existing flows for new applications.

      Each use case directory includes a README file that provides details for each use case — such as instructions, performance notes, and configuration tips.

  3. When you run the example notebooks, consider the following information:

    • Data generation time and statistics: The total time to generate data depends on both the maximum concurrency supported by your endpoint and the complexity of the running flow. Longer flows, such as the flows in the Knowledge Generation notebooks, take more time to complete because they produce a large number of summaries and Q&A pairs, each of which undergoes verification within the pipeline.
    • LLM endpoint requirements: For running flows in the Knowledge Generation notebooks, Red Hat recommends that you set the following values:

      • Set NUMBER_OF_SUMMARIES to a minimum of 10.
      • To achieve reasonable data generation times and avoid timeouts, use an endpoint that supports a maximum concurrency of at least 50.
      • Extend LiteLLM’s request timeout by setting the environment variable LITELLM_REQUEST_TIMEOUT.

Additional resources

4.2. Performance benchmarks for knowledge tuning

To get an estimate of the total time a flow will take, you can run the dry_run function and set enable_time_estimation to true.

For example, tests that use the gpt-oss-120b LLM on 4x H100 GPUs with the QuALITY dataset (266 articles) showed significant variance between flows.

  • The estimated generation times for the full dataset were approximately 15.12 hours for Extractive Summary and 12.99 hours for Detailed Summary, both of which were evaluated with 50 completions per summary (N=50).
  • In contrast, the Key Facts and Document Based flows, which generated only a single summary per document, completed in approximately 0.35 and 0.44 hours, respectively.
  • Additionally, analysis of the Extractive Summary flow highlights that the steepest time reductions occurred between concurrency levels 10 and 30, with returns observed to diminish significantly beyond 50 in this specific configuration.

To view a graph that illustrates the accuracy on QuALITY Benchmark (4,609 Evaluation QA), go to: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/enhanced_summary_knowledge_tuning/imgs/quality_benchmark_accuracy.png.

4.3. Guided example - Build a KFP pipeline for SDG

You can generate synthetic data for domain-specific model customization by using a Kubeflow Pipeline (KFP) on Red Hat OpenShift AI. The Domain Customization Data Generation using Kubeflow Pipelines (KFP) is a guided example.

Prerequisites

Procedure

  1. Run the following command to clone the (org-name) AI examples repository that includes the KFP pipeline for knowledge tuning example.

    git clone https://github.com/red-hat-data-services/red-hat-ai-examples
    Copy to Clipboard Toggle word wrap
  2. Navigate to the examples/domain_customization_kfp_pipeline directory.
  3. Follow the instructions in the README file to run the example:

    1. Configure an environment variable (.env) file, provide your model endpoint, and store the file as a Kubernetes secret. The KFP pipeline consumes the secret as environment variables.
    2. Generate the KFP pipeline YAML file.
    3. Upload the YAML file to OpenShift AI and deploy the pipeline.

Verification

The example pipeline generates three types of document augmentations and four types of QA on top of 3 augmentation and 1 original document. It stores the generated data in the Cloud Object Storage (COS) bucket that is linked through the pipeline server.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top