Background and Evolution
User Personas
Python SDK
Simplified API
Extensibility and Pipeline Framework
LLMs Fine-Tuning Support
Dataset and Model Initializers
Use of JobSet API
Kueue Integration
MPI Support
Gang-Scheduling
Fault Tolerance Improvements
What’s Next?
Migration from Training Operator v1
Resources and Community

Running machine learning workloads on Kubernetes can be challenging. Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs.

The main goals of Kubeflow Trainer v2 include:

Make AI/ML workloads easier to manage at scale
Provide a Pythonic interface to train models
Deliver the easiest and most scalable PyTorch distributed training on Kubernetes
Add built-in support for fine-tuning large language models
Abstract Kubernetes complexity from AI Practitioners
Consolidate efforts between Kubernetes Batch WG and Kubeflow community

We’re deeply grateful to all contributors and community members who made the Trainer v2 possible with their hard work and valuable feedback. We’d like to give special recognition to andreyvelich, tenzen-y, electronic-waste, astefanutti, ironicbo, mahdikhashan, kramaranya, harshal292004, akshaychitneni, chenyi015 and the rest of the contributors. We would also like to highlight ahg-g, kannon92, and vsoch whose feedback was essential while we designed the Kubeflow Trainer architecture together with the Batch WG. See the full contributor list for everyone who helped make this release possible.

Background and Evolution

Kubeflow Trainer v2 represents the next evolution of the Kubeflow Training Operator, building on over seven years of experience running ML workloads on Kubernetes. The journey began in 2017 when the Kubeflow project introduced TFJob to orchestrate TensorFlow training on Kubernetes. At that time, Kubernetes lacked many of the advanced batch processing features needed for distributed ML training, so the community had to implement these capabilities from scratch.

Over the years, the project expanded to support multiple ML frameworks including PyTorch, MXNet, MPI, and XGBoost through various specialized operators. In 2021, these were consolidated into the unified Training Operator v1. Meanwhile, the Kubernetes community introduced the Batch Working Group, developing important APIs like JobSet, Kueue, Indexed Jobs, and PodFailurePolicy that improved HPC and AI workload management.

Trainer v2 leverages these Kubernetes-native improvements to make use of existing functionality and not reinvent the wheel. This collaboration between the Kubernetes and Kubeflow communities delivers a more standardized approach to ML training on Kubernetes.

User Personas

One of the main challenges with ML training on Kubernetes is that it often requires AI Practitioners to have an understanding of Kubernetes concepts and the infrastructure being used for training. This distracts AI Practitioners from their primary focus.

The KF Trainer v2 addresses this by separating the infrastructure configuration from the training job definition. This separation is built around three new custom resources definitions (CRDs):

TrainingRuntime - a namespace-scoped resource that contains the infrastructure details that are required for a training job, such as the training image to use, failure policy, and gang-scheduling configuration.
ClusterTrainingRuntime - similar to TrainingRuntime, but cluster scoped.
TrainJob - specifies the training job configuration, including the training code to run, config for pulling the training dataset & model, and a reference to the training runtime.

The diagram below shows how different personas interact with these custom resources:

user_personas

Platform Administrators define and manage the infrastructure configurations required for training jobs using TrainingRuntimes or ClusterTrainingRuntimes.
AI Practitioners focus on model development using the simplified TrainJob resource or Python SDK wrapper, providing a reference to the training runtime created by Platform Administrators.

Python SDK

The KF Trainer v2 introduces a redesigned Python SDK, which is intended to be the primary interface for AI Practitioners. The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.

The diagram below illustrates how Kubeflow Trainer provides a consistent experience for running ML jobs across different ML frameworks, Kubernetes infrastructures, and cloud providers:

trainerv2

Kubeflow Trainer v2 supports multiple ML frameworks through pre-configured runtimes. The table below shows the current framework support:

runtimes

The SDK makes it easier for users familiar with Python to create, manage, and monitor training jobs, without requiring them to deal with any YAML definitions:

from kubeflow.trainer import TrainerClient

client = TrainerClient()

def my_train_func():
    """User defined function that runs on each distributed node process"""
    import os
    import torch
    import torch.distributed as dist
    from torch.utils.data import DataLoader, DistributedSampler
    
    # Setup PyTorch distributed
    backend = "nccl" if torch.cuda.is_available() else "gloo"
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    dist.init_process_group(backend=backend)
    
    # Define your model, dataset, and training loop
    model = YourModel()
    dataset = YourDataset()
    train_loader = DataLoader(dataset, sampler=DistributedSampler(dataset))
    
    # Your training logic here
    for epoch in range(num_epochs):
        for batch in train_loader:
            # Forward pass, backward pass, optimizer step
            ...
            
    # Wait for the distributed training to complete
    dist.barrier()
    if dist.get_rank() == 0:
        print("Training is finished")

    # Clean up PyTorch distributed
    dist.destroy_process_group()

job_name = client.train(
  runtime=client.get_runtime("torch-distributed"),
  trainer=CustomTrainer(
    func=my_train_func,
    num_nodes=5,
    resources_per_node={
      "gpu": 2,
     },
  ),
)

job = client.get_job(name=job_name)

for step in job.steps:
   print(f"Step: {step.name}, Status: {step.status}")

client.get_job_logs(job_name, follow=True)

The SDK handles all Kubernetes API interactions. This eliminates the need for AI Practitioners to directly interact with the Kubernetes API.

Simplified API

Previously, in the Kubeflow Training Operator users worked with different custom resources for each ML framework, each with their own framework-specific configurations. The KF Trainer v2 replaces these multiple CRDs with a unified TrainJob API that works with multiple ML frameworks.

For example, here’s how a PyTorch training job looks like using KF Trainer v1:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-simple
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"

In the KF Trainer v2, creating an equivalent job becomes much simpler:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: pytorch-simple
  namespace: kubeflow
spec:
  trainer:
    numNodes: 2
    image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    command:
      - "python3"
      - "/opt/pytorch-mnist/mnist.py"
      - "--epochs=1"
  runtimeRef:
    name: torch-distributed
    apiGroup: trainer.kubeflow.org
    kind: ClusterTrainingRuntime

Additional infrastructure and Kubernetes-specific details are provided in the referenced runtime definition, and managed separately by Platform Administrators. In the future, we might support other runtimes in addition to TrainingRuntime and ClusterTrainingRuntime, for example SlurmRuntime.

Extensibility and Pipeline Framework

One of the challenges in KF Trainer v1 was supporting additional ML frameworks, especially for closed-sourced frameworks. The v2 architecture addresses this by introducing a Pipeline Framework that allows Platform Administrators to extend the Plugins and support orchestration for their custom in-house ML frameworks.

The diagram below shows Kubeflow Trainer Pipeline Framework overview:

trainer_pipeline_framework

The framework works through a series of phases - Startup, PreExecution, Build, and PostExecution - each with extension points where custom Plugins can hook in. This approach allows adding support for new frameworks, custom validation logic, or specialized training orchestration without changing the underlying system.

LLMs Fine-Tuning Support

Another improvement of Trainer v2 is its built-in support for fine-tuning large language models, where we provide two types of trainers:

BuiltinTrainer - already includes the fine-tuning logic and allows AI Practitioners to quickly start fine-tuning requiring only parameter adjustments.
CustomTrainer - allows users to provide their own training function that encapsulates the entire LLMs fine-tuning.

In the first release, we support TorchTune LLM Trainer as the initial option for BuiltinTrainer. For TorchTune, we provide pre-configured runtimes (ClusterTrainingRuntime) that currently support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in the manifest. This approach means that in the future, we can add more frameworks, such as unsloth, as additional BuiltinTrainer options. Here’s an example using the BuiltinTrainer with TorchTune:

job_name = client.train(
    runtime=Runtime(
        name="torchtune-llama3.2-1b"
    ),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
            access_token="<YOUR_HF_TOKEN>"  # Replace with your Hugging Face token,
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET,
            ),
            resources_per_node={
                "gpu": 1,
            }
        )
    )
)

This example uses a builtin runtime image that uses a foundation Llama model, and fine-tunes it using a dataset pulled from Hugging Face, with the TorchTune configuration provided by the AI Practitioner. For more details, please refer to this example.

Dataset and Model Initializers

Trainer v2 provides dedicated initializers for datasets and models, which significantly simplify the setup process. Instead of each training pod independently downloading large models and datasets, initializers handle this once and share the data across all training nodes through a shared volume.

This approach saves both time and resources by preventing network slowdowns, and reducing GPU waiting time during setup by offloading data loading tasks to CPU-based initializers, which preserves expensive GPU resources for the actual training.

Use of JobSet API

Under the hood, the KF Trainer v2 uses JobSet, a Kubernetes-native API for managing groups of jobs. This integration allows the KF Trainer v2 to better utilize standard Kubernetes features instead of trying to recreate them.

Kueue Integration

Resource management is improved through integration with Kueue, a Kubernetes-native queueing system. The KF Trainer v2 includes initial support for Kueue through Pod Integration, which allows individual training pods to be queued when resources are busy. We are working on native Kueue support for TrainJob to provide richer queueing features in future releases.

MPI Support

The KF Trainer v2 also provides MPI v2 support, which includes automatic generation of SSH keys for secure inter-node communication and boosting performance MPI on Kubernetes.

MPI_support

The diagram above shows how this works in practice - the KF Trainer automatically handles the SSH key generation and MPI communication between training pods, which allows frameworks like DeepSpeed to coordinate training across multiple GPU nodes without requiring manual configuration of inter-node communication.

Gang-Scheduling

Gang-scheduling is an important feature for distributed training that ensures all pods in a training job are scheduled together or not at all. This prevents scenarios where only some pods are scheduled while others remain pending due to resource constraints, which would waste GPU resources and prevent training from starting.

The KF Trainer v2 provides built-in gang-scheduling support through PodGroupPolicy API. This creates PodGroup resources that ensure all required pods can be scheduled simultaneously before the training job starts.

Platform Administrators can configure gang-scheduling in their TrainingRuntime or ClusterTrainingRuntime definitions. Here’s an example:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-distributed-gang-scheduling
spec:
  mlPolicy:
    numNodes: 3
    torch:
      numProcPerNode: 2
  podGroupPolicy:
    coscheduling:
      scheduleTimeoutSeconds: 120
  # ... rest of runtime configuration

Currently, KF Trainer v2 supports the Co-Scheduling plugin from Kubernetes scheduler-plugins project. Volcano and KAI scheduler support is coming in future releases to provide more advanced scheduling capabilities.

Fault Tolerance Improvements

Training jobs can sometimes fail due to node issues or other problems. The KF Trainer v2 improves handling these faults by supporting Kubernetes PodFailurePolicy, which allows users to define specific rules for handling different types of failures, such as restarting the job after temporary node issues or terminating the job after critical errors.

What’s Next?

Future enhancements will continue to improve the user experience, integrate deeper with other Kubeflow components, and support more training frameworks. Upcoming features include:

Local Execution - run training jobs locally without Kubernetes
Unified Kubeflow SDK - a single SDK for all Kubeflow projects
Trainer UI - a user interface to expose high level metrics for training jobs and monitor training logs
Native Kueue integration - improve resource management and scheduling capabilities for TrainJob resources
Model Registry integrations - export trained models directly to Model Registry
Distributed Data Cache - in-memory Apache Arrow caching for tabular datasets
Volcano support - advanced AI-specific scheduling with gang scheduling, priority queues, and resource management capabilities
JAX runtime support - ClusterTrainingRuntime for JAX distributed training
KAI Scheduler support - NVIDIA’s GPU-optimized scheduler for AI workloads

Migration from Training Operator v1

For users migrating from Kubeflow Training Operator v1, check out a Migration Guide.

Resources and Community

For more information about Trainer V2, check out the Kubeflow Trainer documentation and the design proposal for technical implementation details.

For more details about Kubeflow Trainer, you can also watch our KubeCon presentations:

Join the community via the #kubeflow-trainer channel on CNCF Slack, or attend the AutoML and Training Working Group meetings to contribute or ask questions. Your feedback, contributions, and questions are always welcome!