Chapter 3. Prepare your data for AI consumption
To prepare your data, use Docling to transform unstructured data (such as text documents, images, and audio files) into structured formats that models can consume.
To automate data processing tasks, you can build Kubeflow Pipelines (KFP). For examples of pre-built pipelines for unstructured data processing with Docling, see https://github.com/opendatahub-io/data-processing.
3.1. Process data by using Docling Copy linkLink copied to clipboard!
Docling is the Python library that you use to prepare unstructured data (like PDFs and images) for consumption by large language models.
3.2. Explore the data processing examples Copy linkLink copied to clipboard!
To get started with data processing with Docling explore the provided examples.
Prerequisites
- Install the data processing library as described in Set up your working environment.
Procedure
To access the data processing examples, clone the data processing Git repository:
- To clone the https://github.com/opendatahub-io/data-processing.git repository from JupyterLab, follow the steps in Clone an example Git repository and specify the 3.0 branch.
To create a local clone of the repository, run the following command:
git clone https://github.com/opendatahub-io/data-processing -b stable-3.0
git clone https://github.com/opendatahub-io/data-processing -b stable-3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Go to the
notebooksdirectory to learn how to use Docling for the following tasks:- Convert unstructured documents (PDF files) to structured format (Markdown) - with and without vision‑language model (VLM)
- Chunk - Split documents into smaller, semantically meaningful pieces
- Information extraction - Use template formats to extract specific data fields from documents like invoices.
- Subset selection - Use this script or notebook to reduce the size of your dataset. The algorithm analyzes an input dataset and reduces it in size, while ensuring data diversity and coverage.
Tutorials - An example notebook that provides a complete, end-to-end workflow for preparing a dataset of documents for a RAG (Retrieval-Augmented Generation) system.
Additional resources
- Docling community project: https://docling-project.github.io/docling/
- GitHub Repository for the Docling project source code: https://github.com/docling-project/docling
3.3. Automate data processing steps by building AI pipelines Copy linkLink copied to clipboard!
With Kubeflow Pipelines (KFP), you can automate complex, multi-step Docling data processing tasks into scalable workflows.
With the KFP Software Development Kit (SDK), you can define custom components and stitch them together into a complete pipeline. The SDK allows you to fully control and automate Docling conversion tasks with specific parameters.
Note: You can build a custom runtime image to ensure that all required Docling dependencies are present for pipeline execution. For information on how to run a Docling pipeline with a custom image see the Docling Pipeline documentation.
3.4. Explore the kubeflow pipeline examples Copy linkLink copied to clipboard!
To get started with kubeflow pipelines, explore the provided examples. You can download and modify the example code to quickly create a Docling data processing or model training pipeline.
Prerequisites
- Install the data processing library as described in Set up your working environment.
Procedure
To access the kubeflow pipeline examples, run the following command to clone the data processing Git repository:
git clone https://github.com/opendatahub-io/data-processing -b stable-3.0
git clone https://github.com/opendatahub-io/data-processing -b stable-3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Go to the
kubeflow-pipelinesdirectory which contains the following tested examples for running Docling as a scalable pipeline. For instructions on how to import, configure, and run the examples, see the README file and the Red Hat AI Working with AI pipelines guide.- Standard Pipeline: For converting standard documents that contain text and structured elements. For more information, see the Standard Conversion Pipelines documentation.
- VLM (Vision Language Model): For converting highly complex or difficult-to-parse documents, such as those with custom instructions or complex layouts, or to add image descriptors. For more information, see the VLM Pipelines documentation.