Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 3. Prepare your data for AI consumption


To prepare your data, use Docling to transform unstructured data (such as text documents, images, and audio files) into structured formats that models can consume.

To automate data processing tasks, you can build Kubeflow Pipelines (KFP). For examples of pre-built pipelines for unstructured data processing with Docling, see https://github.com/opendatahub-io/data-processing.

3.1. Process data by using Docling

Docling is the Python library that you use to prepare unstructured data (like PDFs and images) for consumption by large language models.

3.2. Explore the data processing examples

To get started with data processing with Docling explore the provided examples.

Prerequisites

Procedure

  1. To access the data processing examples, clone the data processing Git repository:

  2. Go to the notebooks directory to learn how to use Docling for the following tasks:

    Use cases

    • Convert unstructured documents (PDF files) to structured format (Markdown) - with and without vision‑language model (VLM)
    • Chunk - Split documents into smaller, semantically meaningful pieces
    • Information extraction - Use template formats to extract specific data fields from documents like invoices.
    • Subset selection - Use this script or notebook to reduce the size of your dataset. The algorithm analyzes an input dataset and reduces it in size, while ensuring data diversity and coverage.

    Tutorials - An example notebook that provides a complete, end-to-end workflow for preparing a dataset of documents for a RAG (Retrieval-Augmented Generation) system.

Additional resources

3.3. Automate data processing steps by building AI pipelines

With Kubeflow Pipelines (KFP), you can automate complex, multi-step Docling data processing tasks into scalable workflows.

With the KFP Software Development Kit (SDK), you can define custom components and stitch them together into a complete pipeline. The SDK allows you to fully control and automate Docling conversion tasks with specific parameters.

Note: You can build a custom runtime image to ensure that all required Docling dependencies are present for pipeline execution. For information on how to run a Docling pipeline with a custom image see the Docling Pipeline documentation.

3.4. Explore the kubeflow pipeline examples

To get started with kubeflow pipelines, explore the provided examples. You can download and modify the example code to quickly create a Docling data processing or model training pipeline.

Prerequisites

Procedure

  1. To access the kubeflow pipeline examples, run the following command to clone the data processing Git repository:

    git clone https://github.com/opendatahub-io/data-processing -b stable-3.0
  2. Go to the kubeflow-pipelines directory which contains the following tested examples for running Docling as a scalable pipeline. For instructions on how to import, configure, and run the examples, see the README file and the Red Hat AI Working with AI pipelines guide.

    • Standard Pipeline: For converting standard documents that contain text and structured elements. For more information, see the Standard Conversion Pipelines documentation.
    • VLM (Vision Language Model): For converting highly complex or difficult-to-parse documents, such as those with custom instructions or complex layouts, or to add image descriptors. For more information, see the VLM Pipelines documentation.
Red Hat logoGithubredditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance. Découvrez nos récentes mises à jour.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez le Blog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

Theme

© 2026 Red Hat
Retour au début