The Kubeflow v1.5 software release improves ML model accuracy, lowers infrastructure costs, and simplifies operations by providing a more consistent user experience between components.

Lower infrastructure costs and improve model accuracy

Several enhancements in Kubeflow v1.5 lower infrastructure costs and help improve model accuracy. For example, new elastic training augments the Kubeflow training operator for PyTorch and enables PyTorch workers to be scaled up and down, providing fault tolerant and elastic training. This allows the training jobs to continue without restarting from scratch even if a worker fails. Elastic training can also enable the use of ephemeral or spot instances, which can save infrastructure costs. See the elastic training proposal for more details.

In addition, v1.5 extends notebook monitoring and culling. Kernel monitoring shuts down notebook servers that have been inactive when a configurable timer expires. Kubeflow v1.5 introduces greater precision in assessing notebook server UI activity.

Finally, v1.5 includes early validation of hyperparameter tuning. This feature improves model accuracy by reducing the overfitting of the model to the data sets used during hyperparameter tuning. It can also reduce the infrastructure used as it will stop hyperparameter tuning when configurable thresholds are reached.

Simplified operations

v1.5 simplifies operations in several ways including easier support of high availability for the Katib Controller via the new hyperparameter leader election. v1.5 also adds the MPI framework to the Unified Training Operator, which deploys a single operator for handling the most popular frameworks: Tensorflow, Pytorch, MXNet, XGBoost and MPI. Kubeflow Pipelines (KFP) has incorporated support for the Emissary Executor as the default executor, which means that users are not required to manually configure this option to Emissary. This enhancement also enables KFP to support the newer Kubernetes versions that require a Container Runtime Interface (CRI) and provides support for Docker containers.

In addition, KServe v0.7 is integrated with Kubeflow v1.5. For users who are still using KFServing v0.6.1, v1.5 does support the installation of the KFServing v0.6.1 release. For users migrating from KFServing to KServe, there is helpful documentation, which you can find here. From a feature standpoint, KServe v0.7 has added a beta of ModelMesh, which enables easier scaling of model serving and provides an architecture that overcomes pod and IP limitations that limit the number of models that can run on a single node and/or cluster.

Note on K8s 1.22: Although many of the Kubeflow services (KFP, AutoML, Training Operator, KServe) have started testing with K8s 1.22, v1.5 was not thoroughly tested with K8s 1.22. Additionally, critical work is still outstanding for Kubeflow Notebooks and other central services. While the Community always encourages testing, K8s 1.22 is not formally supported with Kubeflow v1.5. K8s 1.22 support is on the roadmap for Kubeflow’s next release and if you are interested in K8s 1.22 support, please add a comment on this tracking issue.

Streamlined user experience

In v1.5, the Kubeflow User Interface (UI) dashboards for Notebooks, Tensorboards and Volume Manager more closely match KFP’s dashboard. These enhancements include how dashboard fields are displayed and how operations are performed. This gives users a consistent look and feel when working in the UI. Additionally, the Kubeflow notebooks manager web app form and its configuration template have been updated for allowing users to fully define the PVC objects that will be created for a notebook. This will give end users the ability to modify crucial parts of the PVCs, such as StorageClasses, and to more easily support popular storage offerings, including NFS. In addition, AutoML has also updated its SDK, CI framework and parameter settings across frameworks (goptuna, optuna, hyperopt).

Installation, tutorials and documentation

The Kubeflow v1.5 release process includes improvements to Kubeflow’s installation, tutorial and documentation. For installation, the Manifest Working Group’s documentation provided the starting place for the Kubeflow distributions, and this collaboration has increased the maturity of the code and documentation for Kubeflow’s GitOps installation pattern. The v1.5 release process included a tracking issue for the critical web pages that are updated in each release. Additionally, there are many Working Group documentation updates and we invite you to open issues when you find documentation that needs improvement. Kubeflow v1.5 also includes two tutorials, which are described below:

Tutorial 1 - E2E MNIST with Kubeflow Tutorial,which provides an end-to-end test sequence i.e. start a notebook, run a pipeline, execute training, hyperparameter tuning and model serving.

Tutorial 2 - Deploy and run your first interfere service with KServe Tutorial Run your first interface service on Kubeflow 1.5 i.e. define, create and check an InferenceService, post an interfere request and receive response. Optionally run performance tests.

Join the community

The v1.5 Release Team would like to thank everyone for their efforts on Kubeflow v1.5, especially the users, code contributors, working group leads, and distribution team leads. As you can see from the extensive contributions to Kubeflow 1.5, the Kubeflow Community is vibrant and diverse, and solving real world problems for organizations around the globe.

Excited to contribute your great ideas? The Kubeflow Community Working Groups hold open meetings, public lists, and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!