Chapter 6. Configuring model checkpointing for distributed training with Kubeflow Trainer V2
Kubeflow Trainer v2 provides model checkpointing for distributed training jobs on Red Hat OpenShift AI. Checkpointing saves the training state (model weights, optimizer state, learning rate scheduler, and current training step) at regular intervals. If a training job is interrupted due to pod preemption, eviction, or node maintenance, training can resume from the latest checkpoint rather than restarting from scratch.
Kubeflow Trainer v2 supports two checkpoint storage backends:
- PersistentVolumeClaim (PVC)
- Checkpoints are saved directly to a persistent volume mounted to the training pods.
- S3-compatible object storage
- Checkpoints are saved to fast local storage first, then uploaded to S3 in the background without blocking GPU training. Supported providers include AWS S3, MinIO, Ceph RGW, and IBM Cloud Object Storage.
With both storage backends, the SDK provides Just-In-Time (JIT) checkpointing, which automatically saves the training state when a termination signal (SIGTERM) is received, ensuring that in-progress training steps are not lost during interruptions.
6.1. Configuring checkpointing with a PersistentVolumeClaim Copy linkLink copied to clipboard!
Configure Kubeflow Trainer v2 to save model checkpoints to a PersistentVolumeClaim (PVC). When you specify a PVC as the checkpoint destination, all training pods share the same storage, and the SDK handles mounting automatically.
Prerequisites
- You have a PVC with the ReadWriteMany (RWX) access mode in the same namespace as the training job.
- The PVC has enough capacity to store your model checkpoints.
Example
from kubeflow.trainer.rhai.transformers import TransformersTrainer
trainer = TransformersTrainer(
func=train_fn,
num_nodes=2,
resources_per_node={
"nvidia.com/gpu": 2,
"memory": "128Gi",
"cpu": "8",
},
output_dir="pvc://my-checkpoints-pvc/llama3-fine-tune",
)
In this configuration, all training pods read checkpoints from and write checkpoints to the shared PVC. No additional storage provisioning is required beyond the PVC itself.
6.2. Configuring checkpointing with S3-compatible object storage Copy linkLink copied to clipboard!
Configure Kubeflow Trainer v2 to save model checkpoints to S3-compatible object storage. This approach uploads checkpoints asynchronously so that checkpoint writes do not block GPU training.
Prerequisites
- You have an S3-compatible object storage bucket.
Procedure
Create a data connection in the Red Hat OpenShift AI dashboard.
- In the Red Hat OpenShift AI dashboard, navigate to Data Science Projects and select your project.
- Go to the Connections tab.
- Click Add connection and select S3-compatible object storage.
Fill in the connection details:
-
Name: A descriptive name for the connection (for example,
my-s3-checkpoint-storage). - Access key: Your S3 access key ID.
- Secret key: Your S3 secret access key.
-
Endpoint: Your S3 endpoint URL (for example,
https://s3.amazonaws.comfor AWS S3, or your MinIO or Ceph endpoint). -
Region: The S3 region (e.g.,
us-east-1). - Bucket: The S3 bucket name.
-
Name: A descriptive name for the connection (for example,
- Click Add connection. This creates a Kubernetes secret in your project namespace.
Note the resource name of the connection. You can find this on the Connections tab.
ImportantIf you rename a connection after creating it, the underlying Kubernetes secret retains its original name. For example, if you create a connection named
s3-storage-connectionand later rename it tos3-storage-connection-old, the secret is still named s3-storage-connection.
Configure the training job.
Specify the data connection resource name as
data_connection_namein yourTransformersTrainerconfiguration, as shown in the following example:from kubeflow.trainer.rhai.transformers import TransformersTrainer trainer = TransformersTrainer( func=train_fn, num_nodes=2, resources_per_node={ "nvidia.com/gpu": 2, "memory": "128Gi", "cpu": "8", }, output_dir="s3://my-bucket/llama3-fine-tune", data_connection_name="my-s3-checkpoint-storage", )
The SDK reads the S3 credentials from the Kubernetes secret and exposes them as environment variables to the training pods. You do not need to pass credentials in the env parameter.
Disable SSL verification only for endpoints with self-signed certificates in non-production environments. Disabling SSL verification in production exposes training data and credentials to potential interception.
6.3. S3 checkpointing workflow Copy linkLink copied to clipboard!
When you configure S3 as the checkpoint storage, the SDK uses a local-first architecture. Checkpoints are saved to local storage on each pod first, then uploaded to S3 in the background. This design avoids blocking GPU training during upload operations.
The SDK automatically provisions an emptyDir volume on each training pod for local checkpoint staging. The volume uses the local disk of the node. Kubernetes creates the volume when the pod starts and deletes it when the pod terminates.
The checkpointing lifecycle follows these phases:
- Training start (resume): If a previous checkpoint exists in S3, the SDK downloads it to local storage and automatically resumes training from the latest valid checkpoint.
-
During training (periodic save): Hugging Face Transformers saves checkpoints to local storage at intervals configured by
save_steps. The SDK moves completed checkpoints to a staging directory and uploads them to S3 using a background thread. Training continues immediately without waiting for the upload to finish. -
Preemption or termination (JIT save): If the pod receives a
SIGTERMsignal, the SDK saves the current training state at the next safe synchronization point before the job exits. The SDK then uploads the JIT checkpoint to S3. - Training end (final upload): The SDK waits for any pending uploads to complete, then uploads the final trained model artifacts to S3.
6.4. PVC and S3 checkpoint storage comparison Copy linkLink copied to clipboard!
Compare PVC and S3 checkpoint storage backends across setup complexity, multi-node support, capacity, portability, cost, and JIT checkpointing to choose the option that best fits your environment.
| Consideration | PVC | S3 |
|---|---|---|
| Setup complexity | Simpler. Mount a PersistentVolumeClaim directly to the training pods. | Complex. Requires an S3-compatible storage service and credentials. |
| Multi-node training |
Requires a |
Works with any cluster. Each pod uploads independently using local |
| Storage capacity | Limited by PVC size, which must be provisioned in advance. | Scales with bucket capacity. You do not need to size storage in advance. |
| Checkpoint portability | Tied to the specific cluster and namespace. | Portable. Checkpoints are accessible from any cluster and can be shared with collaborators. |
| Cost | Pay for persistent block storage continuously, even when no training jobs are running. | Pay for storage used and data transfer. Lifecycle policies can manage costs. |
| JIT checkpointing | Supported. The SDK saves training state directly to the PVC. | Supported. The SDK saves training state to local storage, then uploads it to S3. |
6.5. Best practices for S3 checkpointing Copy linkLink copied to clipboard!
Follow these best practices when configuring S3 checkpointing for distributed training jobs to avoid common issues such as pod eviction due to storage pressure, inefficient GPU utilization, and slow startup times.
S3 checkpointing works well for distributed training at scale but requires careful configuration to balance storage capacity, GPU efficiency, and recovery granularity. The topics in this section cover how to configure training pods efficiently, estimate storage requirements, choose appropriate training strategies, and monitor storage usage.
6.5.1. GPU distribution guidelines for training jobs Copy linkLink copied to clipboard!
How you distribute GPUs across nodes affects storage usage, checkpoint download times, and training performance. When using S3 storage, each pod independently downloads the model and checkpoints to its local emptyDir volume. Minimizing the number of pods reduces the total storage consumed and the number of redundant downloads.
Consider the following two configurations for training with six GPUs:
- Less efficient configuration (6 pods, 1 GPU each)
trainer = TransformersTrainer(
func=train_fn,
num_nodes=6,
resources_per_node={
"nvidia.com/gpu": 1,
"memory": "64Gi",
"cpu": "4",
},
output_dir="s3://my-bucket/llama3-fine-tune",
)
- More efficient configuration (2 pods, 3 GPUs each)
trainer = TransformersTrainer(
func=train_fn,
num_nodes=2,
resources_per_node={
"nvidia.com/gpu": 3,
"memory": "192Gi",
"cpu": "12",
},
output_dir="s3://my-bucket/llama3-fine-tune",
)
The second configuration is more efficient for the following reasons:
- Fewer model downloads: Each pod downloads the full model to its local cache. With two pods, the model is downloaded twice instead of six times. This reduces startup time and network bandwidth consumption.
- Fewer checkpoint downloads on resume: When resuming training from an S3 checkpoint, each pod downloads the checkpoint independently. Fewer pods means fewer redundant downloads.
- Faster intra-pod communication: GPUs within the same pod communicate via high-bandwidth NVLink or PCIe using P2P or CUMEM, which is faster than inter-pod communication over the network using NCCL over TCP or Socket. Packing more GPUs in each pod maximizes the proportion of communication that uses the fast intra-pod path.
-
Less total local storage consumed: Each pod requires its own
emptyDirvolume for model cache, checkpoints, and checkpoint staging. Fewer pods means less total node storage consumed.
When possible, maximize the number of GPUs per node to reduce the total number of pods in your training job.
6.5.2. Understanding local storage requirements Copy linkLink copied to clipboard!
When using S3 checkpointing, each pod writes checkpoints to local emptyDir storage before uploading them to S3. Storage usage fluctuates during training, with temporary spikes during checkpoint consolidation. Provision local storage to handle peak usage rather than average usage, to avoid pod eviction due to storage pressure.
Storage usage follows this general pattern:
- Normal operation: Storage holds the model cache and any locally retained checkpoints.
- Checkpoint save spike: Storage temporarily increases during checkpoint consolidation. This increase is most pronounced for DeepSpeed ZeRO-3, where temporary consolidation storage can be up to 42.5 times the final checkpoint size.
- Training end: The SDK writes final model artifacts to local storage before uploading them to S3.
The following tables show storage measurements from internal benchmarks. Your actual storage requirements can vary depending on your model, precision, training strategy, and dataset. Use these figures as a starting point for capacity planning and validate with with testing against your own workload.
| Strategy | Model size | Training type | Observed peak per pod |
|---|---|---|---|
| DDP | 8B | Full fine-tuning | Up to 200 GB (rank-0) / 150 GB (workers) |
| FSDP | 7B | Full fine-tuning | Up to 200 GB |
| DeepSpeed ZeRO-3 | 70B | LoRA | Up to 150 GB |
| Component | DDP (8B) | FSDP (7B) | DeepSpeed (70B LoRA) |
|---|---|---|---|
| Base cache (model download) | ~25 GB | 23 GB | 6 GB |
| Checkpoint download (resume) | ~45 GB | 21 GB | 2.4 GB |
| Local checkpoints | ~90 GB | 84 GB | 16 GB |
| Consolidation peak (temporary) | ~90 GB | 42 GB | 68 GB |
| Final model | 15 GB | 21 GB | 67 GB |
| Safety buffer | ~10 GB | 10.7 GB | 11 GB |
6.5.3. Storage requirements for training workloads Copy linkLink copied to clipboard!
Estimate per-pod local storage requirements for training workloads using a formula that accounts for model cache, checkpoint downloads, consolidation peaks, and final model storage. Example calculations are provided for DDP, FSDP, and DeepSpeed ZeRO-3 strategies.
Use the following formula to estimate per-pod local storage requirements:
Per-pod storage = base_cache + checkpoint_download + (N x checkpoint_size) + consolidation_peak + final_model + safety_buffer
where:
N- is the value of save_total_limit in your training arguments.
- Example calculations based on benchmark data
- DDP (rank-0, 8B full fine-tuning): 25 + 45 + (2 x 45) + 90 + 15 + 10 = 275 GB. With sequential cleanup, approximately 200 GB.
- FSDP (7B full fine-tuning): 23 + 21 + (4 x 21) + 42 + 21 + 9 = 200 GB.
- DeepSpeed ZeRO-3 (70B LoRA): 6 + (10 x 1.6) + 68 + 67 + 11 = 168 GB. With sequential cleanup, approximately 150 GB.
Sequential cleanup automatically removes old checkpoints as new ones are controlled by save_total_limit parameter, resulting in the lower storage estimates above. Full calculations show worst case if cleanup does not occur.
These estimates assume specific benchmark configurations. Run a short test with your own model and configuration to validate storage requirements before starting long training runs.
6.5.4. Checkpoint consolidation peaks Copy linkLink copied to clipboard!
During checkpoint saves, the underlying training framework temporarily requires additional storage to consolidate model state before writing the final checkpoint files. This temporary spike is the most common cause of pod eviction due to storage pressure.
| Strategy | Steady-state checkpoint | Consolidation peak | Peak multiplier |
|---|---|---|---|
| DDP | ~45 GB | ~90 GB | 2x |
| FSDP | 21 GB | 42 GB | 2x |
| DeepSpeed ZeRO-3 | 1.6 GB | 68 GB | 42.5x |
DeepSpeed ZeRO-3 has the largest consolidation peaks. A checkpoint that is only 1.6 GB in its final form can require up to 68 GB of temporary storage during the save operation. If you provision storage based on the steady-state checkpoint size alone, pods might be evicted during checkpoint operations.
6.5.5. Periodic checkpoint configuration Copy linkLink copied to clipboard!
The PeriodicCheckpointConfig class controls how often Kubeflow Trainer saves checkpoints during training and how many recent checkpoints it retains in local storage.
Pass this configuration to your TransformersTrainer:
from kubeflow.trainer import TransformersTrainer, PeriodicCheckpointConfig
checkpoint_config = PeriodicCheckpointConfig(
save_strategy="steps", # or "epoch"
save_steps=50, # Save every 50 steps
save_total_limit=2, # Keep only the 2 most recent checkpoints locally
)
trainer = TransformersTrainer(
func=train_fn,
num_nodes=2,
resources_per_node={"nvidia.com/gpu": 2},
output_dir="s3://my-bucket/llama3-fine-tune",
data_connection_name="my-s3-checkpoint-storage",
periodic_checkpoint_config=checkpoint_config,
)
When you configure periodic checkpoints, consider the following guidelines:
-
Avoid setting
save_stepstoo low. Periodic checkpoint saves block GPU computation while the checkpoint is being written. Saving too frequently, such as every 5 steps, can significantly slow down training throughput. Choose an interval that balances recovery granularity with training performance. -
With PVC storage,
save_total_limitcontrols how many checkpoints are kept on the PVC. A highsave_total_limitcombined with a frequent save_steps can fill the PVC quickly. Monitor PVC usage during training. -
With S3 storage,
save_total_limitcontrols only how many checkpoints are retained on the localemptyDirvolume. Every periodic checkpoint and JIT checkpoint that is uploaded to S3 remains in the S3 bucket permanently. The SDK does not automatically delete old checkpoints from S3. You must manage S3 checkpoint cleanup manually, either through S3 lifecycle policies or by deleting objects from the bucket.
6.5.6. Monitoring storage during training Copy linkLink copied to clipboard!
Monitor local storage consumption on training pods during training runs to detect storage pressure before pods are evicted. Use oc exec with standard Linux commands (df and du) to inspect storage usage on individual pods.
# Check storage usage on a training pod
$ oc exec <pod-name> -- df -h /mnt/kubeflow-checkpoints
# Detailed breakdown by directory
$ oc exec <pod-name> -- du -h /mnt/kubeflow-checkpoints | sort -h | tail -20
6.5.7. Storage characteristics of training strategies Copy linkLink copied to clipboard!
Compare the storage characteristics of FSDP, DDP, and DeepSpeed ZeRO-3 distributed training strategies to plan checkpoint storage capacity and understand performance tradeoffs. Each distributed training strategy has different storage characteristics :
- Fully Sharded Data Parallel (FSDP)
- This strategy features predictable consolidation peaks of two times the final checkpoint size, even checkpoint distribution across pods, and the fastest observed upload and download speeds in internal benchmarks (479 MB/s download and 68 MB/s upload). FSDP is generally the most straightforward strategy for storage planning.
- Distributed Data Parallel (DDP)
- This strategy is suitable for smaller models but creates uneven storage distribution. Only rank-0 saves checkpoints and the final model, so the rank-0 pod requires more storage capacity than worker pods.
- DeepSpeed Zero Redundancy Optimizer (ZeRO) -3
- This strategy enables training very large models with parameter-efficient methods such as Low-Rank Adaptation (LoRA), but has significant consolidation peaks (up to 42.5 times the final checkpoint size) that require additional storage planning.
6.6. Known limitations Copy linkLink copied to clipboard!
The following limitations apply to model checkpointing with Kubeflow Trainer v2 on Red Hat OpenShift AI.
- Training pods do not share a model cache, so when downloading pre-trained models (for example, from Hugging Face Hub) each pod downloads the entire model independently. Training pods must therefore have enough local or mounted storage to hold the entire downloaded model. For example, a 70B parameter model in BF16 precision will use approximately 140 GB storage in model cache per pod.
- TorchElastic enforces a default graceful shutdown period that might be insufficient for checkpointing very large models. If JIT checkpoint operations do not complete within this period, the checkpoint might be incomplete.