Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 3. Prepare your data for AI consumption

To prepare your data, use Docling to transform unstructured data (such as text documents, images, and audio files) into structured formats that models can consume.

To automate data processing tasks, you can build Kubeflow Pipelines (KFP). For examples of pre-built pipelines for unstructured data processing with Docling, see https://github.com/opendatahub-io/data-processing.

3.1. Process data by using Docling
Copier lien

Docling is the Python library that you use to prepare unstructured data (like PDFs and images) for consumption by large language models.

3.2. Explore the data processing examples
Copier lien

To get started with data processing with Docling explore the provided examples.

Prerequisites

Install the data processing library as described in Set up your working environment.

Procedure

To access the data processing examples, clone the data processing Git repository:
- To clone the https://github.com/opendatahub-io/data-processing.git repository from JupyterLab, follow the steps in Clone an example Git repository and specify the 3.0 branch.
- To create a local clone of the repository, run the following command:
  git clone https://github.com/opendatahub-io/data-processing -b stable-3.0
Go to the notebooks directory to learn how to use Docling for the following tasks:
Use cases
- Convert unstructured documents (PDF files) to structured format (Markdown) - with and without vision‑language model (VLM)
- Chunk - Split documents into smaller, semantically meaningful pieces
- Information extraction - Use template formats to extract specific data fields from documents like invoices.
- Subset selection - Use this script or notebook to reduce the size of your dataset. The algorithm analyzes an input dataset and reduces it in size, while ensuring data diversity and coverage.
Tutorials - An example notebook that provides a complete, end-to-end workflow for preparing a dataset of documents for a RAG (Retrieval-Augmented Generation) system.

Additional resources

Docling community project: https://docling-project.github.io/docling/
GitHub Repository for the Docling project source code: https://github.com/docling-project/docling

3.3. Automate data processing steps by building AI pipelines
Copier lien

With Kubeflow Pipelines (KFP), you can automate complex, multi-step Docling data processing tasks into scalable workflows.

With the KFP Software Development Kit (SDK), you can define custom components and stitch them together into a complete pipeline. The SDK allows you to fully control and automate Docling conversion tasks with specific parameters.

Note: You can build a custom runtime image to ensure that all required Docling dependencies are present for pipeline execution. For information on how to run a Docling pipeline with a custom image see the Docling Pipeline documentation.

3.4. Explore the kubeflow pipeline examples
Copier lien

To get started with kubeflow pipelines, explore the provided examples. You can download and modify the example code to quickly create a Docling data processing or model training pipeline.

Prerequisites

Install the data processing library as described in Set up your working environment.

Procedure

To access the kubeflow pipeline examples, run the following command to clone the data processing Git repository:
```
git clone https://github.com/opendatahub-io/data-processing -b stable-3.0
```
Go to the kubeflow-pipelines directory which contains the following tested examples for running Docling as a scalable pipeline. For instructions on how to import, configure, and run the examples, see the README file and the Red Hat AI Working with AI pipelines guide.
- Standard Pipeline: For converting standard documents that contain text and structured elements. For more information, see the Standard Conversion Pipelines documentation.
- VLM (Vision Language Model): For converting highly complex or difficult-to-parse documents, such as those with custom instructions or complex layouts, or to add image descriptors. For more information, see the VLM Pipelines documentation.

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 3. Prepare your data for AI consumption

3.1. Process data by using Docling
Copier lien

3.2. Explore the data processing examples
Copier lien

3.3. Automate data processing steps by building AI pipelines
Copier lien

3.4. Explore the kubeflow pipeline examples
Copier lien

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 3. Prepare your data for AI consumption

3.1. Process data by using DoclingCopier lienLien copié sur presse-papiers!

3.2. Explore the data processing examplesCopier lienLien copié sur presse-papiers!

3.3. Automate data processing steps by building AI pipelinesCopier lienLien copié sur presse-papiers!

3.4. Explore the kubeflow pipeline examplesCopier lienLien copié sur presse-papiers!

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. Process data by using Docling
Copier lien

3.2. Explore the data processing examples
Copier lien

3.3. Automate data processing steps by building AI pipelines
Copier lien

3.4. Explore the kubeflow pipeline examples
Copier lien