Home
Products
Red Hat OpenShift AI Self-Managed
3.4
Accelerate data processing and training with distributed workloads
Chapter 6. Configuring model checkpointing for distributed training with Kubeflow Trainer V2

Chapter 6. Configuring model checkpointing for distributed training with Kubeflow Trainer V2

Kubeflow Trainer v2 provides model checkpointing for distributed training jobs on Red Hat OpenShift AI. Checkpointing saves the training state (model weights, optimizer state, learning rate scheduler, and current training step) at regular intervals. If a training job is interrupted due to pod preemption, eviction, or node maintenance, training can resume from the latest checkpoint rather than restarting from scratch.

Kubeflow Trainer v2 supports two checkpoint storage backends:

PersistentVolumeClaim (PVC): Checkpoints are saved directly to a persistent volume mounted to the training pods.
S3-compatible object storage: Checkpoints are saved to fast local storage first, then uploaded to S3 in the background without blocking GPU training. Supported providers include AWS S3, MinIO, Ceph RGW, and IBM Cloud Object Storage.

With both storage backends, the SDK provides Just-In-Time (JIT) checkpointing, which automatically saves the training state when a termination signal (SIGTERM) is received, ensuring that in-progress training steps are not lost during interruptions.

6.1. Configuring checkpointing with a PersistentVolumeClaim
Copy link

Configure Kubeflow Trainer v2 to save model checkpoints to a PersistentVolumeClaim (PVC). When you specify a PVC as the checkpoint destination, all training pods share the same storage, and the SDK handles mounting automatically.

Prerequisites

You have a PVC with the ReadWriteMany (RWX) access mode in the same namespace as the training job.
The PVC has enough capacity to store your model checkpoints.

Example

from kubeflow.trainer.rhai.transformers import TransformersTrainer

trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=2,
    resources_per_node={
        "nvidia.com/gpu": 2,
        "memory": "128Gi",
        "cpu": "8",
    },
    output_dir="pvc://my-checkpoints-pvc/llama3-fine-tune",
)

In this configuration, all training pods read checkpoints from and write checkpoints to the shared PVC. No additional storage provisioning is required beyond the PVC itself.

6.2. Configuring checkpointing with S3-compatible object storage
Copy link

Configure Kubeflow Trainer v2 to save model checkpoints to S3-compatible object storage. This approach uploads checkpoints asynchronously so that checkpoint writes do not block GPU training.

Prerequisites

You have an S3-compatible object storage bucket.

Procedure

Create a data connection in the Red Hat OpenShift AI dashboard.
1. In the Red Hat OpenShift AI dashboard, navigate to Data Science Projects and select your project.
2. Go to the Connections tab.
3. Click Add connection and select S3-compatible object storage.
4. Fill in the connection details:
  1. Name: A descriptive name for the connection (for example, my-s3-checkpoint-storage).
  2. Access key: Your S3 access key ID.
  3. Secret key: Your S3 secret access key.
  4. Endpoint: Your S3 endpoint URL (for example, https://s3.amazonaws.com for AWS S3, or your MinIO or Ceph endpoint).
  5. Region: The S3 region (e.g., us-east-1).
  6. Bucket: The S3 bucket name.
5. Click Add connection. This creates a Kubernetes secret in your project namespace.
6. Note the resource name of the connection. You can find this on the Connections tab.
  Important
  If you rename a connection after creating it, the underlying Kubernetes secret retains its original name. For example, if you create a connection named s3-storage-connection and later rename it to s3-storage-connection-old, the secret is still named s3-storage-connection.

Configure the training job.

Specify the data connection resource name as data_connection_name in your TransformersTrainer configuration, as shown in the following example:

from kubeflow.trainer.rhai.transformers import TransformersTrainer

trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=2,
    resources_per_node={
        "nvidia.com/gpu": 2,
        "memory": "128Gi",
        "cpu": "8",
    },
    output_dir="s3://my-bucket/llama3-fine-tune",
    data_connection_name="my-s3-checkpoint-storage",
)

The SDK reads the S3 credentials from the Kubernetes secret and exposes them as environment variables to the training pods. You do not need to pass credentials in the env parameter.

Important

Disable SSL verification only for endpoints with self-signed certificates in non-production environments. Disabling SSL verification in production exposes training data and credentials to potential interception.

6.3. S3 checkpointing workflow
Copy link

When you configure S3 as the checkpoint storage, the SDK uses a local-first architecture. Checkpoints are saved to local storage on each pod first, then uploaded to S3 in the background. This design avoids blocking GPU training during upload operations.

The SDK automatically provisions an emptyDir volume on each training pod for local checkpoint staging. The volume uses the local disk of the node. Kubernetes creates the volume when the pod starts and deletes it when the pod terminates.

The checkpointing lifecycle follows these phases:

Training start (resume): If a previous checkpoint exists in S3, the SDK downloads it to local storage and automatically resumes training from the latest valid checkpoint.
During training (periodic save): Hugging Face Transformers saves checkpoints to local storage at intervals configured by save_steps. The SDK moves completed checkpoints to a staging directory and uploads them to S3 using a background thread. Training continues immediately without waiting for the upload to finish.
Preemption or termination (JIT save): If the pod receives a SIGTERM signal, the SDK saves the current training state at the next safe synchronization point before the job exits. The SDK then uploads the JIT checkpoint to S3.
Training end (final upload): The SDK waits for any pending uploads to complete, then uploads the final trained model artifacts to S3.

6.4. PVC and S3 checkpoint storage comparison
Copy link

Compare PVC and S3 checkpoint storage backends across setup complexity, multi-node support, capacity, portability, cost, and JIT checkpointing to choose the option that best fits your environment.

Expand

Consideration	PVC	S3
Setup complexity	Simpler. Mount a PersistentVolumeClaim directly to the training pods.	Complex. Requires an S3-compatible storage service and credentials.
Multi-node training	Requires a `ReadWriteMany`(RWX) storage class, which might not be available in all clusters.	Works with any cluster. Each pod uploads independently using local `emptyDir` storage.
Storage capacity	Limited by PVC size, which must be provisioned in advance.	Scales with bucket capacity. You do not need to size storage in advance.
Checkpoint portability	Tied to the specific cluster and namespace.	Portable. Checkpoints are accessible from any cluster and can be shared with collaborators.
Cost	Pay for persistent block storage continuously, even when no training jobs are running.	Pay for storage used and data transfer. Lifecycle policies can manage costs.
JIT checkpointing	Supported. The SDK saves training state directly to the PVC.	Supported. The SDK saves training state to local storage, then uploads it to S3.

6.5. Best practices for S3 checkpointing
Copy link

Follow these best practices when configuring S3 checkpointing for distributed training jobs to avoid common issues such as pod eviction due to storage pressure, inefficient GPU utilization, and slow startup times.

S3 checkpointing works well for distributed training at scale but requires careful configuration to balance storage capacity, GPU efficiency, and recovery granularity. The topics in this section cover how to configure training pods efficiently, estimate storage requirements, choose appropriate training strategies, and monitor storage usage.

6.5.1. GPU distribution guidelines for training jobs
Copy link

How you distribute GPUs across nodes affects storage usage, checkpoint download times, and training performance. When using S3 storage, each pod independently downloads the model and checkpoints to its local emptyDir volume. Minimizing the number of pods reduces the total storage consumed and the number of redundant downloads.

Consider the following two configurations for training with six GPUs:

Less efficient configuration (6 pods, 1 GPU each)

trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=6,
    resources_per_node={
        "nvidia.com/gpu": 1,
        "memory": "64Gi",
        "cpu": "4",
    },
    output_dir="s3://my-bucket/llama3-fine-tune",
)

More efficient configuration (2 pods, 3 GPUs each)

trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=2,
    resources_per_node={
        "nvidia.com/gpu": 3,
        "memory": "192Gi",
        "cpu": "12",
    },
    output_dir="s3://my-bucket/llama3-fine-tune",
)

The second configuration is more efficient for the following reasons:

Fewer model downloads: Each pod downloads the full model to its local cache. With two pods, the model is downloaded twice instead of six times. This reduces startup time and network bandwidth consumption.
Fewer checkpoint downloads on resume: When resuming training from an S3 checkpoint, each pod downloads the checkpoint independently. Fewer pods means fewer redundant downloads.
Faster intra-pod communication: GPUs within the same pod communicate via high-bandwidth NVLink or PCIe using P2P or CUMEM, which is faster than inter-pod communication over the network using NCCL over TCP or Socket. Packing more GPUs in each pod maximizes the proportion of communication that uses the fast intra-pod path.
Less total local storage consumed: Each pod requires its own emptyDir volume for model cache, checkpoints, and checkpoint staging. Fewer pods means less total node storage consumed.

When possible, maximize the number of GPUs per node to reduce the total number of pods in your training job.

6.5.2. Understanding local storage requirements
Copy link

When using S3 checkpointing, each pod writes checkpoints to local emptyDir storage before uploading them to S3. Storage usage fluctuates during training, with temporary spikes during checkpoint consolidation. Provision local storage to handle peak usage rather than average usage, to avoid pod eviction due to storage pressure.

Storage usage follows this general pattern:

Normal operation: Storage holds the model cache and any locally retained checkpoints.
Checkpoint save spike: Storage temporarily increases during checkpoint consolidation. This increase is most pronounced for DeepSpeed ZeRO-3, where temporary consolidation storage can be up to 42.5 times the final checkpoint size.
Training end: The SDK writes final model artifacts to local storage before uploading them to S3.

The following tables show storage measurements from internal benchmarks. Your actual storage requirements can vary depending on your model, precision, training strategy, and dataset. Use these figures as a starting point for capacity planning and validate with with testing against your own workload.

Expand

Table 6.1. Observed peak local storage per pod
Strategy	Model size	Training type	Observed peak per pod
DDP	8B	Full fine-tuning	Up to 200 GB (rank-0) / 150 GB (workers)
FSDP	7B	Full fine-tuning	Up to 200 GB
DeepSpeed ZeRO-3	70B	LoRA	Up to 150 GB

Expand

Table 6.2. Storage breakdown by component (measured)
Component	DDP (8B)	FSDP (7B)	DeepSpeed (70B LoRA)
Base cache (model download)	~25 GB	23 GB	6 GB
Checkpoint download (resume)	~45 GB	21 GB	2.4 GB
Local checkpoints	~90 GB	84 GB	16 GB
Consolidation peak (temporary)	~90 GB	42 GB	68 GB
Final model	15 GB	21 GB	67 GB
Safety buffer	~10 GB	10.7 GB	11 GB

6.5.3. Storage requirements for training workloads
Copy link

Estimate per-pod local storage requirements for training workloads using a formula that accounts for model cache, checkpoint downloads, consolidation peaks, and final model storage. Example calculations are provided for DDP, FSDP, and DeepSpeed ZeRO-3 strategies.

Use the following formula to estimate per-pod local storage requirements:

Per-pod storage = base_cache + checkpoint_download + (N x checkpoint_size) + consolidation_peak + final_model + safety_buffer

where:

N

is the value of save_total_limit in your training arguments.

Example calculations based on benchmark data

DDP (rank-0, 8B full fine-tuning): 25 + 45 + (2 x 45) + 90 + 15 + 10 = 275 GB. With sequential cleanup, approximately 200 GB.
FSDP (7B full fine-tuning): 23 + 21 + (4 x 21) + 42 + 21 + 9 = 200 GB.
DeepSpeed ZeRO-3 (70B LoRA): 6 + (10 x 1.6) + 68 + 67 + 11 = 168 GB. With sequential cleanup, approximately 150 GB.

Sequential cleanup automatically removes old checkpoints as new ones are controlled by save_total_limit parameter, resulting in the lower storage estimates above. Full calculations show worst case if cleanup does not occur.

Note

These estimates assume specific benchmark configurations. Run a short test with your own model and configuration to validate storage requirements before starting long training runs.

6.5.4. Checkpoint consolidation peaks
Copy link

During checkpoint saves, the underlying training framework temporarily requires additional storage to consolidate model state before writing the final checkpoint files. This temporary spike is the most common cause of pod eviction due to storage pressure.

Expand

Strategy	Steady-state checkpoint	Consolidation peak	Peak multiplier
DDP	~45 GB	~90 GB	2x
FSDP	21 GB	42 GB	2x
DeepSpeed ZeRO-3	1.6 GB	68 GB	42.5x

DeepSpeed ZeRO-3 has the largest consolidation peaks. A checkpoint that is only 1.6 GB in its final form can require up to 68 GB of temporary storage during the save operation. If you provision storage based on the steady-state checkpoint size alone, pods might be evicted during checkpoint operations.

6.5.5. Periodic checkpoint configuration
Copy link

The PeriodicCheckpointConfig class controls how often Kubeflow Trainer saves checkpoints during training and how many recent checkpoints it retains in local storage.

Pass this configuration to your TransformersTrainer:

from kubeflow.trainer import TransformersTrainer, PeriodicCheckpointConfig

checkpoint_config = PeriodicCheckpointConfig(
    save_strategy="steps",   # or "epoch"
    save_steps=50,           # Save every 50 steps
    save_total_limit=2,      # Keep only the 2 most recent checkpoints locally
)

trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=2,
    resources_per_node={"nvidia.com/gpu": 2},
    output_dir="s3://my-bucket/llama3-fine-tune",
    data_connection_name="my-s3-checkpoint-storage",
    periodic_checkpoint_config=checkpoint_config,
)

When you configure periodic checkpoints, consider the following guidelines:

Avoid setting save_steps too low. Periodic checkpoint saves block GPU computation while the checkpoint is being written. Saving too frequently, such as every 5 steps, can significantly slow down training throughput. Choose an interval that balances recovery granularity with training performance.
With PVC storage, save_total_limit controls how many checkpoints are kept on the PVC. A high save_total_limit combined with a frequent save_steps can fill the PVC quickly. Monitor PVC usage during training.
With S3 storage, save_total_limit controls only how many checkpoints are retained on the local emptyDir volume. Every periodic checkpoint and JIT checkpoint that is uploaded to S3 remains in the S3 bucket permanently. The SDK does not automatically delete old checkpoints from S3. You must manage S3 checkpoint cleanup manually, either through S3 lifecycle policies or by deleting objects from the bucket.

6.5.6. Monitoring storage during training
Copy link

Monitor local storage consumption on training pods during training runs to detect storage pressure before pods are evicted. Use oc exec with standard Linux commands (df and du) to inspect storage usage on individual pods.

# Check storage usage on a training pod
$ oc exec <pod-name> -- df -h /mnt/kubeflow-checkpoints

# Detailed breakdown by directory
$ oc exec <pod-name> -- du -h /mnt/kubeflow-checkpoints | sort -h | tail -20

6.5.7. Storage characteristics of training strategies
Copy link

Compare the storage characteristics of FSDP, DDP, and DeepSpeed ZeRO-3 distributed training strategies to plan checkpoint storage capacity and understand performance tradeoffs. Each distributed training strategy has different storage characteristics :

Fully Sharded Data Parallel (FSDP): This strategy features predictable consolidation peaks of two times the final checkpoint size, even checkpoint distribution across pods, and the fastest observed upload and download speeds in internal benchmarks (479 MB/s download and 68 MB/s upload). FSDP is generally the most straightforward strategy for storage planning.
Distributed Data Parallel (DDP): This strategy is suitable for smaller models but creates uneven storage distribution. Only rank-0 saves checkpoints and the final model, so the rank-0 pod requires more storage capacity than worker pods.
DeepSpeed Zero Redundancy Optimizer (ZeRO) -3: This strategy enables training very large models with parameter-efficient methods such as Low-Rank Adaptation (LoRA), but has significant consolidation peaks (up to 42.5 times the final checkpoint size) that require additional storage planning.

6.6. Known limitations
Copy link

The following limitations apply to model checkpointing with Kubeflow Trainer v2 on Red Hat OpenShift AI.

Training pods do not share a model cache, so when downloading pre-trained models (for example, from Hugging Face Hub) each pod downloads the entire model independently. Training pods must therefore have enough local or mounted storage to hold the entire downloaded model. For example, a 70B parameter model in BF16 precision will use approximately 140 GB storage in model cache per pod.
TorchElastic enforces a default graceful shutdown period that might be insufficient for checkpointing very large models. If JIT checkpoint operations do not complete within this period, the checkpoint might be incomplete.

Chapter 6. Configuring model checkpointing for distributed training with Kubeflow Trainer V2

6.1. Configuring checkpointing with a PersistentVolumeClaim
Copy link

6.2. Configuring checkpointing with S3-compatible object storage
Copy link

6.3. S3 checkpointing workflow
Copy link

6.4. PVC and S3 checkpoint storage comparison
Copy link

6.5. Best practices for S3 checkpointing
Copy link

6.5.1. GPU distribution guidelines for training jobs
Copy link

6.5.2. Understanding local storage requirements
Copy link

6.5.3. Storage requirements for training workloads
Copy link

6.5.4. Checkpoint consolidation peaks
Copy link

6.5.5. Periodic checkpoint configuration
Copy link

6.5.6. Monitoring storage during training
Copy link

6.5.7. Storage characteristics of training strategies
Copy link

6.6. Known limitations
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 6. Configuring model checkpointing for distributed training with Kubeflow Trainer V2

6.1. Configuring checkpointing with a PersistentVolumeClaimCopy linkLink copied to clipboard!

6.2. Configuring checkpointing with S3-compatible object storageCopy linkLink copied to clipboard!

6.3. S3 checkpointing workflowCopy linkLink copied to clipboard!

6.4. PVC and S3 checkpoint storage comparisonCopy linkLink copied to clipboard!

6.5. Best practices for S3 checkpointingCopy linkLink copied to clipboard!

6.5.1. GPU distribution guidelines for training jobsCopy linkLink copied to clipboard!

6.5.2. Understanding local storage requirementsCopy linkLink copied to clipboard!

6.5.3. Storage requirements for training workloadsCopy linkLink copied to clipboard!

6.5.4. Checkpoint consolidation peaksCopy linkLink copied to clipboard!

6.5.5. Periodic checkpoint configurationCopy linkLink copied to clipboard!

6.5.6. Monitoring storage during trainingCopy linkLink copied to clipboard!

6.5.7. Storage characteristics of training strategiesCopy linkLink copied to clipboard!

6.6. Known limitationsCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. Configuring checkpointing with a PersistentVolumeClaim
Copy link

6.2. Configuring checkpointing with S3-compatible object storage
Copy link

6.3. S3 checkpointing workflow
Copy link

6.4. PVC and S3 checkpoint storage comparison
Copy link

6.5. Best practices for S3 checkpointing
Copy link

6.5.1. GPU distribution guidelines for training jobs
Copy link

6.5.2. Understanding local storage requirements
Copy link

6.5.3. Storage requirements for training workloads
Copy link

6.5.4. Checkpoint consolidation peaks
Copy link

6.5.5. Periodic checkpoint configuration
Copy link

6.5.6. Monitoring storage during training
Copy link

6.5.7. Storage characteristics of training strategies
Copy link

6.6. Known limitations
Copy link