Chapter 6. Configuring model checkpointing for distributed training with Kubeflow Trainer V2


Kubeflow Trainer v2 provides model checkpointing for distributed training jobs on Red Hat OpenShift AI. Checkpointing saves the training state (model weights, optimizer state, learning rate scheduler, and current training step) at regular intervals. If a training job is interrupted due to pod preemption, eviction, or node maintenance, training can resume from the latest checkpoint rather than restarting from scratch.

Kubeflow Trainer v2 supports two checkpoint storage backends:

PersistentVolumeClaim (PVC)
Checkpoints are saved directly to a persistent volume mounted to the training pods.
S3-compatible object storage
Checkpoints are saved to fast local storage first, then uploaded to S3 in the background without blocking GPU training. Supported providers include AWS S3, MinIO, Ceph RGW, and IBM Cloud Object Storage.

With both storage backends, the SDK provides Just-In-Time (JIT) checkpointing, which automatically saves the training state when a termination signal (SIGTERM) is received, ensuring that in-progress training steps are not lost during interruptions.

Configure Kubeflow Trainer v2 to save model checkpoints to a PersistentVolumeClaim (PVC). When you specify a PVC as the checkpoint destination, all training pods share the same storage, and the SDK handles mounting automatically.

Prerequisites

  • You have a PVC with the ReadWriteMany (RWX) access mode in the same namespace as the training job.
  • The PVC has enough capacity to store your model checkpoints.

Example

from kubeflow.trainer.rhai.transformers import TransformersTrainer

trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=2,
    resources_per_node={
        "nvidia.com/gpu": 2,
        "memory": "128Gi",
        "cpu": "8",
    },
    output_dir="pvc://my-checkpoints-pvc/llama3-fine-tune",
)

In this configuration, all training pods read checkpoints from and write checkpoints to the shared PVC. No additional storage provisioning is required beyond the PVC itself.

Configure Kubeflow Trainer v2 to save model checkpoints to S3-compatible object storage. This approach uploads checkpoints asynchronously so that checkpoint writes do not block GPU training.

Prerequisites

  • You have an S3-compatible object storage bucket.

Procedure

  1. Create a data connection in the Red Hat OpenShift AI dashboard.

    1. In the Red Hat OpenShift AI dashboard, navigate to Data Science Projects and select your project.
    2. Go to the Connections tab.
    3. Click Add connection and select S3-compatible object storage.
    4. Fill in the connection details:

      1. Name: A descriptive name for the connection (for example, my-s3-checkpoint-storage).
      2. Access key: Your S3 access key ID.
      3. Secret key: Your S3 secret access key.
      4. Endpoint: Your S3 endpoint URL (for example, https://s3.amazonaws.com for AWS S3, or your MinIO or Ceph endpoint).
      5. Region: The S3 region (e.g., us-east-1).
      6. Bucket: The S3 bucket name.
    5. Click Add connection. This creates a Kubernetes secret in your project namespace.
    6. Note the resource name of the connection. You can find this on the Connections tab.

      Important

      If you rename a connection after creating it, the underlying Kubernetes secret retains its original name. For example, if you create a connection named s3-storage-connection and later rename it to s3-storage-connection-old, the secret is still named s3-storage-connection.

  2. Configure the training job.

    1. Specify the data connection resource name as data_connection_name in your TransformersTrainer configuration, as shown in the following example:

      from kubeflow.trainer.rhai.transformers import TransformersTrainer
      
      trainer = TransformersTrainer(
          func=train_fn,
          num_nodes=2,
          resources_per_node={
              "nvidia.com/gpu": 2,
              "memory": "128Gi",
              "cpu": "8",
          },
          output_dir="s3://my-bucket/llama3-fine-tune",
          data_connection_name="my-s3-checkpoint-storage",
      )

The SDK reads the S3 credentials from the Kubernetes secret and exposes them as environment variables to the training pods. You do not need to pass credentials in the env parameter.

Important

Disable SSL verification only for endpoints with self-signed certificates in non-production environments. Disabling SSL verification in production exposes training data and credentials to potential interception.

6.3. S3 checkpointing workflow

When you configure S3 as the checkpoint storage, the SDK uses a local-first architecture. Checkpoints are saved to local storage on each pod first, then uploaded to S3 in the background. This design avoids blocking GPU training during upload operations.

S3 checkpointing workflow diagram

The SDK automatically provisions an emptyDir volume on each training pod for local checkpoint staging. The volume uses the local disk of the node. Kubernetes creates the volume when the pod starts and deletes it when the pod terminates.

The checkpointing lifecycle follows these phases:

  1. Training start (resume): If a previous checkpoint exists in S3, the SDK downloads it to local storage and automatically resumes training from the latest valid checkpoint.
  2. During training (periodic save): Hugging Face Transformers saves checkpoints to local storage at intervals configured by save_steps. The SDK moves completed checkpoints to a staging directory and uploads them to S3 using a background thread. Training continues immediately without waiting for the upload to finish.
  3. Preemption or termination (JIT save): If the pod receives a SIGTERM signal, the SDK saves the current training state at the next safe synchronization point before the job exits. The SDK then uploads the JIT checkpoint to S3.
  4. Training end (final upload): The SDK waits for any pending uploads to complete, then uploads the final trained model artifacts to S3.

6.4. PVC and S3 checkpoint storage comparison

Compare PVC and S3 checkpoint storage backends across setup complexity, multi-node support, capacity, portability, cost, and JIT checkpointing to choose the option that best fits your environment.

Expand
ConsiderationPVCS3

Setup complexity

Simpler. Mount a PersistentVolumeClaim directly to the training pods.

Complex. Requires an S3-compatible storage service and credentials.

Multi-node training

Requires a ReadWriteMany(RWX) storage class, which might not be available in all clusters.

Works with any cluster. Each pod uploads independently using local emptyDir storage.

Storage capacity

Limited by PVC size, which must be provisioned in advance.

Scales with bucket capacity. You do not need to size storage in advance.

Checkpoint portability

Tied to the specific cluster and namespace.

Portable. Checkpoints are accessible from any cluster and can be shared with collaborators.

Cost

Pay for persistent block storage continuously, even when no training jobs are running.

Pay for storage used and data transfer. Lifecycle policies can manage costs.

JIT checkpointing

Supported. The SDK saves training state directly to the PVC.

Supported. The SDK saves training state to local storage, then uploads it to S3.

6.5. Best practices for S3 checkpointing

Follow these best practices when configuring S3 checkpointing for distributed training jobs to avoid common issues such as pod eviction due to storage pressure, inefficient GPU utilization, and slow startup times.

S3 checkpointing works well for distributed training at scale but requires careful configuration to balance storage capacity, GPU efficiency, and recovery granularity. The topics in this section cover how to configure training pods efficiently, estimate storage requirements, choose appropriate training strategies, and monitor storage usage.

How you distribute GPUs across nodes affects storage usage, checkpoint download times, and training performance. When using S3 storage, each pod independently downloads the model and checkpoints to its local emptyDir volume. Minimizing the number of pods reduces the total storage consumed and the number of redundant downloads.

Consider the following two configurations for training with six GPUs:

Less efficient configuration (6 pods, 1 GPU each)
trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=6,
    resources_per_node={
        "nvidia.com/gpu": 1,
        "memory": "64Gi",
        "cpu": "4",
    },
    output_dir="s3://my-bucket/llama3-fine-tune",
)
More efficient configuration (2 pods, 3 GPUs each)
trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=2,
    resources_per_node={
        "nvidia.com/gpu": 3,
        "memory": "192Gi",
        "cpu": "12",
    },
    output_dir="s3://my-bucket/llama3-fine-tune",
)

The second configuration is more efficient for the following reasons:

  • Fewer model downloads: Each pod downloads the full model to its local cache. With two pods, the model is downloaded twice instead of six times. This reduces startup time and network bandwidth consumption.
  • Fewer checkpoint downloads on resume: When resuming training from an S3 checkpoint, each pod downloads the checkpoint independently. Fewer pods means fewer redundant downloads.
  • Faster intra-pod communication: GPUs within the same pod communicate via high-bandwidth NVLink or PCIe using P2P or CUMEM, which is faster than inter-pod communication over the network using NCCL over TCP or Socket. Packing more GPUs in each pod maximizes the proportion of communication that uses the fast intra-pod path.
  • Less total local storage consumed: Each pod requires its own emptyDir volume for model cache, checkpoints, and checkpoint staging. Fewer pods means less total node storage consumed.

When possible, maximize the number of GPUs per node to reduce the total number of pods in your training job.

6.5.2. Understanding local storage requirements

When using S3 checkpointing, each pod writes checkpoints to local emptyDir storage before uploading them to S3. Storage usage fluctuates during training, with temporary spikes during checkpoint consolidation. Provision local storage to handle peak usage rather than average usage, to avoid pod eviction due to storage pressure.

Storage usage follows this general pattern:

  • Normal operation: Storage holds the model cache and any locally retained checkpoints.
  • Checkpoint save spike: Storage temporarily increases during checkpoint consolidation. This increase is most pronounced for DeepSpeed ZeRO-3, where temporary consolidation storage can be up to 42.5 times the final checkpoint size.
  • Training end: The SDK writes final model artifacts to local storage before uploading them to S3.

The following tables show storage measurements from internal benchmarks. Your actual storage requirements can vary depending on your model, precision, training strategy, and dataset. Use these figures as a starting point for capacity planning and validate with with testing against your own workload.

Expand
Table 6.1. Observed peak local storage per pod
StrategyModel sizeTraining typeObserved peak per pod

DDP

8B

Full fine-tuning

Up to 200 GB (rank-0) / 150 GB (workers)

FSDP

7B

Full fine-tuning

Up to 200 GB

DeepSpeed ZeRO-3

70B

LoRA

Up to 150 GB

Expand
Table 6.2. Storage breakdown by component (measured)
ComponentDDP (8B)FSDP (7B)DeepSpeed (70B LoRA)

Base cache (model download)

~25 GB

23 GB

6 GB

Checkpoint download (resume)

~45 GB

21 GB

2.4 GB

Local checkpoints

~90 GB

84 GB

16 GB

Consolidation peak (temporary)

~90 GB

42 GB

68 GB

Final model

15 GB

21 GB

67 GB

Safety buffer

~10 GB

10.7 GB

11 GB

6.5.3. Storage requirements for training workloads

Estimate per-pod local storage requirements for training workloads using a formula that accounts for model cache, checkpoint downloads, consolidation peaks, and final model storage. Example calculations are provided for DDP, FSDP, and DeepSpeed ZeRO-3 strategies.

Use the following formula to estimate per-pod local storage requirements:

Per-pod storage = base_cache + checkpoint_download + (N x checkpoint_size) + consolidation_peak + final_model + safety_buffer

where:

N
is the value of save_total_limit in your training arguments.
Example calculations based on benchmark data
  • DDP (rank-0, 8B full fine-tuning): 25 + 45 + (2 x 45) + 90 + 15 + 10 = 275 GB. With sequential cleanup, approximately 200 GB.
  • FSDP (7B full fine-tuning): 23 + 21 + (4 x 21) + 42 + 21 + 9 = 200 GB.
  • DeepSpeed ZeRO-3 (70B LoRA): 6 + (10 x 1.6) + 68 + 67 + 11 = 168 GB. With sequential cleanup, approximately 150 GB.

Sequential cleanup automatically removes old checkpoints as new ones are controlled by save_total_limit parameter, resulting in the lower storage estimates above. Full calculations show worst case if cleanup does not occur.

Note

These estimates assume specific benchmark configurations. Run a short test with your own model and configuration to validate storage requirements before starting long training runs.

6.5.4. Checkpoint consolidation peaks

During checkpoint saves, the underlying training framework temporarily requires additional storage to consolidate model state before writing the final checkpoint files. This temporary spike is the most common cause of pod eviction due to storage pressure.

Expand
StrategySteady-state checkpointConsolidation peakPeak multiplier

DDP

~45 GB

~90 GB

2x

FSDP

21 GB

42 GB

2x

DeepSpeed ZeRO-3

1.6 GB

68 GB

42.5x

DeepSpeed ZeRO-3 has the largest consolidation peaks. A checkpoint that is only 1.6 GB in its final form can require up to 68 GB of temporary storage during the save operation. If you provision storage based on the steady-state checkpoint size alone, pods might be evicted during checkpoint operations.

6.5.5. Periodic checkpoint configuration

The PeriodicCheckpointConfig class controls how often Kubeflow Trainer saves checkpoints during training and how many recent checkpoints it retains in local storage.

Pass this configuration to your TransformersTrainer:

from kubeflow.trainer import TransformersTrainer, PeriodicCheckpointConfig

checkpoint_config = PeriodicCheckpointConfig(
    save_strategy="steps",   # or "epoch"
    save_steps=50,           # Save every 50 steps
    save_total_limit=2,      # Keep only the 2 most recent checkpoints locally
)

trainer = TransformersTrainer(
    func=train_fn,
    num_nodes=2,
    resources_per_node={"nvidia.com/gpu": 2},
    output_dir="s3://my-bucket/llama3-fine-tune",
    data_connection_name="my-s3-checkpoint-storage",
    periodic_checkpoint_config=checkpoint_config,
)

When you configure periodic checkpoints, consider the following guidelines:

  • Avoid setting save_steps too low. Periodic checkpoint saves block GPU computation while the checkpoint is being written. Saving too frequently, such as every 5 steps, can significantly slow down training throughput. Choose an interval that balances recovery granularity with training performance.
  • With PVC storage, save_total_limit controls how many checkpoints are kept on the PVC. A high save_total_limit combined with a frequent save_steps can fill the PVC quickly. Monitor PVC usage during training.
  • With S3 storage, save_total_limit controls only how many checkpoints are retained on the local emptyDir volume. Every periodic checkpoint and JIT checkpoint that is uploaded to S3 remains in the S3 bucket permanently. The SDK does not automatically delete old checkpoints from S3. You must manage S3 checkpoint cleanup manually, either through S3 lifecycle policies or by deleting objects from the bucket.

6.5.6. Monitoring storage during training

Monitor local storage consumption on training pods during training runs to detect storage pressure before pods are evicted. Use oc exec with standard Linux commands (df and du) to inspect storage usage on individual pods.

# Check storage usage on a training pod
$ oc exec <pod-name> -- df -h /mnt/kubeflow-checkpoints

# Detailed breakdown by directory
$ oc exec <pod-name> -- du -h /mnt/kubeflow-checkpoints | sort -h | tail -20

Compare the storage characteristics of FSDP, DDP, and DeepSpeed ZeRO-3 distributed training strategies to plan checkpoint storage capacity and understand performance tradeoffs. Each distributed training strategy has different storage characteristics :

Fully Sharded Data Parallel (FSDP)
This strategy features predictable consolidation peaks of two times the final checkpoint size, even checkpoint distribution across pods, and the fastest observed upload and download speeds in internal benchmarks (479 MB/s download and 68 MB/s upload). FSDP is generally the most straightforward strategy for storage planning.
Distributed Data Parallel (DDP)
This strategy is suitable for smaller models but creates uneven storage distribution. Only rank-0 saves checkpoints and the final model, so the rank-0 pod requires more storage capacity than worker pods.
DeepSpeed Zero Redundancy Optimizer (ZeRO) -3
This strategy enables training very large models with parameter-efficient methods such as Low-Rank Adaptation (LoRA), but has significant consolidation peaks (up to 42.5 times the final checkpoint size) that require additional storage planning.

6.6. Known limitations

The following limitations apply to model checkpointing with Kubeflow Trainer v2 on Red Hat OpenShift AI.

  • Training pods do not share a model cache, so when downloading pre-trained models (for example, from Hugging Face Hub) each pod downloads the entire model independently. Training pods must therefore have enough local or mounted storage to hold the entire downloaded model. For example, a 70B parameter model in BF16 precision will use approximately 140 GB storage in model cache per pod.
  • TorchElastic enforces a default graceful shutdown period that might be insufficient for checkpointing very large models. If JIT checkpoint operations do not complete within this period, the checkpoint might be incomplete.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top