Unified Training Operator release announcement

Release highlights
MPI Operator roadmap
Acknowledgement
Join the WG-Training

The Kubeflow Training Operator Working Group introduced several enhancements in the recent Kubeflow 1.4 release. The most significant was the introduction of the new unified training operator that enables Kubernetes custom resources (CR) for many of the popular training frameworks: Tensorflow, Pytorch, MXNet and XGboost. In addition, the tf-operator repository has been renamed to training-operator.

This single operator provides several valuable benefits:

Better resource utilisation - For releases prior to 1.4, each framework had a separate corresponding controller managing its distributed job. The unified training operator manages all distributed jobs across frameworks, which improves resource utilization and performance.
Less maintenance overhead - Unified training operator reduces the maintenance efforts in managing distributed jobs across the framework. By default, all supported schemas(TFJob, PyTorchJob, MXNetJob, XGBoostJob) are enabled. However, specific schemas can be enabled using the flag ‘enable-scheme’. Setting this flag enables the user to enable the framework(s) that are necessary for the deployment environment.
Easy adoption of new operators - Common code is abstracted from all framework implementations, which makes it easy for adopting new operators with less code. The common infrastructure code can be reused for many of the new operator efforts. Reference: Paddle operator proposal, DGL operator proposal
Better developer experience - Common features can be shared across frameworks without code duplication thereby, creating a developer friendly environment. For example, Prometheus Monitoring and Job Scheduling features are common, making them available to all frameworks without any extra code.

The unified training operator’s manifests include an enhanced training operator, which manages custom resource definitions for TFJob, PyTorchJob, MXNet Job and XGBoostJob. All individual operator repositories, including pytorch-operator, mxnet-operator, xgboost-operator, will be archived soon. Please check out the latest release for more details and give it a try!

Release highlights

Kubeflow 1.4 release includes the following major changes to training.

Universal Training Operator changes

Unified Training Operator for TF, PyTorch, MXNet, XGBoost #1302 #1295 #1294 #1293 #1296
More common code refactoring for reusability #1297
API code restructuring to consistent format #1300
Prometheus counters for all frameworks #1375
Python SDK for all frameworks #1420
API doc for all frameworks #1370
Restructuring of examples across all frameworks #1373 #1391

Common package updates

Make training container port customizable to support profiling #131
Optimize the TTL setting of all Jobs #137
More appropriate use of expectation for Jobs #139

MPI Operator updates

Scalability improvements to reduce pressure on kube-apiserver #360
V2beta1 MPIJob API #366 #378
Intel MPI Support #389 #403 #417 #425

MPI Operator roadmap

The MPI framework integration with the unified training operator is under development and is planned for delivery in the next release i.e. post 1.4. Currently, it needs to be separately installed using MPIJob manifests.

Acknowledgement

The unified training operator is the outcome of efforts from all existing Kubeflow training operators and aims to provide a unified and simplified experience for both users and developers. We’d like to thank everyone who has contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers
MPI Operator: list of contributors and maintainers
XGBoost Operator: list of contributors and maintainers
MXNet Operator: list of contributors and maintainers

Join the WG-Training

If you want to help, or are looking for issues to work on, feel free to check the resources below!

Slack: #wg-training

Community: wg-training

Issues: https://github.com/kubeflow/training-operator/issues