Chapter 3. Prepare your data for AI consumption


To prepare your data, use Docling to transform unstructured data (such as text documents, images, and audio files) into structured formats that models can consume.

To automate data processing tasks, you can build Kubeflow Pipelines (KFP). For examples of pre-built pipelines for unstructured data processing with Docling, see https://github.com/opendatahub-io/data-processing.

3.1. Process data by using Docling

Docling is the Python library that you use to prepare unstructured data (like PDFs and images) for consumption by large language models.

3.2. Explore the data processing examples

To get started with data processing with Docling explore the provided examples.

Prerequisites

Procedure

  1. To access the data processing examples, clone the data processing Git repository:

    • To clone the https://github.com/opendatahub-io/data-processing.git repository from JupyterLab, follow the steps in Clone an example Git repository and specify the 3.0 branch.
    • To create a local clone of the repository, run the following command:

      git clone https://github.com/opendatahub-io/data-processing -b stable-3.0
      Copy to Clipboard Toggle word wrap
  2. Go to the notebooks directory to learn how to use Docling for the following tasks:

    Use cases

    • Convert unstructured documents (PDF files) to structured format (Markdown) - with and without vision‑language model (VLM)
    • Chunk - Split documents into smaller, semantically meaningful pieces
    • Information extraction - Use template formats to extract specific data fields from documents like invoices.
    • Subset selection - Use this script or notebook to reduce the size of your dataset. The algorithm analyzes an input dataset and reduces it in size, while ensuring data diversity and coverage.

    Tutorials - An example notebook that provides a complete, end-to-end workflow for preparing a dataset of documents for a RAG (Retrieval-Augmented Generation) system.

Additional resources

With Kubeflow Pipelines (KFP), you can automate complex, multi-step Docling data processing tasks into scalable workflows.

With the KFP Software Development Kit (SDK), you can define custom components and stitch them together into a complete pipeline. The SDK allows you to fully control and automate Docling conversion tasks with specific parameters.

Note: You can build a custom runtime image to ensure that all required Docling dependencies are present for pipeline execution. For information on how to run a Docling pipeline with a custom image see the Docling Pipeline documentation.

3.4. Explore the kubeflow pipeline examples

To get started with kubeflow pipelines, explore the provided examples. You can download and modify the example code to quickly create a Docling data processing or model training pipeline.

Prerequisites

Procedure

  1. To access the kubeflow pipeline examples, run the following command to clone the data processing Git repository:

    git clone https://github.com/opendatahub-io/data-processing -b stable-3.0
    Copy to Clipboard Toggle word wrap
  2. Go to the kubeflow-pipelines directory which contains the following tested examples for running Docling as a scalable pipeline. For instructions on how to import, configure, and run the examples, see the README file and the Red Hat AI Working with AI pipelines guide.

    • Standard Pipeline: For converting standard documents that contain text and structured elements. For more information, see the Standard Conversion Pipelines documentation.
    • VLM (Vision Language Model): For converting highly complex or difficult-to-parse documents, such as those with custom instructions or complex layouts, or to add image descriptors. For more information, see the VLM Pipelines documentation.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top