Kubeflow 1.1 improves ML Workflow Productivity, Isolation & Security, and GitOps

The Kubeflow Community’s delivery of Kubeflow 1.1 offers users valuable ML workflow automation with Fairing and Kale along with MXNet and XGBoost distributed training operators. It extends isolation and security through the delivery of multi-user pipelines, CVE scanning, and support for Google’s Private GKE and Anthos. 1.1 also improves Katib’s hyperparameter tuning features by offering new frameworks & algorithms, and enables flexible configuration & tuning options. 1.1 provides a foundation for consistent and repeatable installation and operations using GitOps methodologies powered by blueprints and kpt primitives.

The ML productivity enhancements in 1.1 include end-to-end workflows using Fairing and Kale. The Fairing workflow enables users to build, train and deploy models from a notebook and Fairing improvements include the support of configuring environment variables and mounting secrets. Fairing also added a config map for a deployer and bug fixes for TensorRTSpec. The workflows enabled by Kale include the ability to write model code in a notebook and then automatically build a Kubeflow pipeline that deploys, trains and tunes that model efficiently, using Katib and cached pipeline steps. Kubeflow 1.1 also delivers stable release deliveries of MXNet and XGBoost operators, which simplify distributed training on multiple nodes and speeds model creation.

The isolation and security feature deliveries include Private GKE and Anthos support, a stable version of Kubeflow Pipelines with Multi-User Kubeflow Pipelines support, and a process for Kubeflow container image scanning, CVE reporting, and an optional process for distroless image creation. 1.1 also includes options for authentication and authorization. This includes the option for administrators to turn off self-service namespace creation mode, as admins may have other processes for namespace creation. The Community also developed a best practice to build user authorization in Kubeflow web apps using subject access review.

The 1.1 Katib improvements deliver new frameworks & algorithms, and flexible configuration & tuning options. The new frameworks & algorithms include:

Integrate goptuna framework with (CMA-ES) Covariance Matrix Adaptation Evolution Strategy algorithm.
DARTS (Differentiable Architecture Search) algorithm implementation.
Better support for HP frameworks (chocolate, hyperopt, skopt).

Katib also adds these flexible configuration & tuning options:

Implements a python SDK to run Katib experiments from Kubeflow notebooks.
Enables users to run experiments without the goal defined.
Provides a new trial template UI editor.
Enables users to view experiment and suggestion status in the UI, during the experiment run.
Adds a resume policy for experiments to clean-up suggestion resources.

The installation and operations of Kubeflow have been enhanced to support GitOps methodologies. Several Kubeflow platform providers and software support vendors are developing time-saving GitOps processes to simplify and codify the installation, configuration and operations of the various layers in the Kubeflow 1.1 hardware and software stack. Some examples are provided in the next section.

More details and 1.1 tutorials

Kubeflow 1.1 includes many technical enhancements, which are being delivered via the Community’s release process. Details on the application feature development can be found in the 1.1 KanBan Board and in the Kubeflow Roadmap. As the Kubeflow application improvements are merged, the platform teams (GCP, AWS, IBM, Red Hat, Azure, and Arrikto MiniKF) are working to validate the feature improvements on their respective environments.

Kubeflow 1.1 includes KFServing v0.3, where the focus has been on providing more stability by doing a major move to KNative v1 APIs. Additionally, we added GPU support for PyTorch model servers, and pickled model format support for SKLearn. There were other enhancements vis a vis routing, payload logging, bug fixes etc., details of which can be found here.

Kubeflow 1.1 demo scripts and workflow tutorials are available as validated by the individual platforms. Please find those below:

1.1 users can also leverage several other Kubeflow ecosystem tools including:

Seldon Core 1.1, which handles scaling to thousands of production ML models and provides advanced ML capabilities including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries and more.
Feast: A feature store that allows teams ML teams to define, manage, discover, and serve ML features to their models.

What’s Coming and Getting involved

The Kubeflow Community has started planning for its next release. Although we have a nice backlog of issues, our process includes discussions and surveys with users and contributors to validate use cases and their value.

The Community continues to refine its governance and refine this proposal, Proposal for Kubeflow WG Guidelines/Governance. We are actively developing Working Group team charters, tech leads, chairs and members. We look forward to this growth.

The following provides some helpful links to those looking to get involved with the Kubeflow Community:

If you have questions, run into issues, please leverage the Slack channel and/or submit bugs via Kubeflow on GitHub. Thanks from all of us in the Community, and we look forward to your success with Kubeflow 1.1.

Special thanks to Yuan Tang (Ant Group), Josh Bottum (Arrikto), Constantinos Venetsanopoulos (Arrikto), Yannis Zarkadas (Arrikto), Jiaxin Shan (AWS), Dan Sun (Bloomberg), Andrey Velichkevich (Cisco), Krishna Durai (Cisco), Hamel Husain (GitHub), Willem Pienaar (GoJek), Yuan Gong (Google), Jeremy Lewi (Google), Animesh Singh (IBM) and Clive Cox (Seldon) for their help on this post.