Kubeflow 1.10.0 delivers essential updates that enhance the flexibility, efficiency, and scalability of machine learning workflows. The new features span across several components, improving both user experience and system performance.

Highlight features

  • Trainer 2.0
  • New UI for Model Registry
  • Spark Operator as a core Kubeflow component
  • Kubernetes and container security (CISO compatibility)
  • Hyperparameter Optimization for LLMs Fine-Tuning
  • Loop parallelism in Pipelines
  • New parameter distributions for Katib
  • Deeper Model Registry integrations with KServe
  • New Python SDK, OCI storage, and model caching for KServe
  • New security contexts and rootless Istio-CNI integrations for Spark Operator

Kubeflow Platform (Manifests & Security)

The Kubeflow Platform Working Group focuses on simplifying Kubeflow installation, operations, and security. See details below.

Manifests:

  • Spark Operator 2.1.0 included in Kubeflow platform, although not installed yet by default
  • Documentation updates that make it easier to install, extend and upgrade Kubeflow
  • For more details and future plans please consult the 1.10.0 and 1.10.1/1.11.0 milestones
Notebooks Dashboard Pipelines Katib Trainer KServe Model Registry Spark
1.10 1.10 2.4.1 0.18 1.9 0.14 0.2.15 2.1.0
Kubernetes Kind Kustomize Cert Manager Knative Istio Dex Oauth2-proxy
1.31-1.33 0.26 5.4.3 1.16.1 1.16 1.24 2.41 7.7

Security:

  • CVE reductions - regular scanning with trivy
  • Kubernetes and container security best practices:
    • Rootless containers / PodSecurityStandards restricted for: Istio-CNI, Knative, Dex, Oauth2-proxy, Spark
    • 50 % done: KFP, Notebooks / Workspaces, Katib, Trainer, Kserve, …
    • Istio-CNI as default for rootless Kubeflow postponed to 1.10.1
  • OIDC-authservice has been replaced by oauth2-proxy
  • Oauth2-proxy and Dex documentation for external OIDC authentication (Keycloak, and OIDC providers such as Azure, Google etc.)

Trivy CVE scans March 25 2025:

Working Group Images Critical CVE High CVE Medium CVE Low CVE
Katib 17 11 101 417 734
Pipelines 15 57 490 4030 1922
Workbenches(Notebooks) 12 12 59 179 224
Kserve 16 21 305 6803 1588
Manifests 14 8 4 94 52
Trainer 1 0 0 1 0
Model Registry 6 1 13 153 188
Spark 1 5 37 1640 141
All Images 81 115 1009 13275 4804

Pipelines

Support for Placeholders in Resource Limits

Kubeflow Pipelines 2.4.1 introduces support for placeholders in resource limits, enhancing flexibility in pipeline execution.This update allows users to define dynamic resource limits using parameterized values, enabling more adaptable and reusable pipeline definitions.

Support for Loop Parallelism

Kubeflow Pipelines 2.4.1 introduces a new Parallelism Limit for ParallelFor tasks, giving users the ability to run massively parallel inference pipelines, with more control over parallel execution in their workflows. This feature allows users to specify the maximum number of parallel iterations, preventing resource overutilization and improving system stability. When running large pipelines with GPUs, proper use of this feature could save your team thousands of dollars in compute expenses.

Implement SubDAG Output Resolution

Kubeflow 1.10 ensures that pipelines using nested DAGs work correctly and reliably when treated as components. Outputs from deeply nested DAGs will now resolve properly, avoiding broken dependencies.

Model Registry

Model Registry introduces a new user interface and enhanced model management capabilities.

Model Registry UI

The new Kubeflow Model Registry UI provides a user-friendly web interface for managing machine learning models within the Kubeflow platform. It centralizes model metadata, version tracking, and artifact management, streamlining MLOps workflows.

Key features include:

  • Easy model registration with custom metadata
  • Comprehensive model management with filtering and sorting
  • Archiving capabilities
  • Version control
  • Metadata editing

Model Registry UI

The UI interacts with the Model Registry’s REST API, making it accessible to users of all technical backgrounds and enhancing collaboration across data science, ML engineering, and MLOps teams.

To get started with the Model Registry UI, which is currently in Alpha, you can follow the instructions here.

The Kubeflow Model Registry UI Team would like to conduct user research to identify possible enhancements we can contribute in future iterations of the Kubeflow Model Registry UI. If you are interested in participating in this study, please fill out this survey.

Custom Storage Initializer

The Model Registry Custom Storage Initializer (CSI) is a custom implementation of the KServe ClusterStorageContainer. This feature allows users to utilize Model Registry metadata to download and deploy models efficiently. With the newest release of the Model Registry, it is now possible to install and use the Custom Storage Initializer (CSI).

You can find detailed installation instructions and a small example in the “Getting Started” section of the Model Registry component on the Kubeflow website.

For additional information and future developments towards better integration with KServe, you can refer to the slides here.

Training Operator (Trainer) & Katib

Kubeflow 1.10 enhances the Training Operator and Katib, providing new tools and APIs for hyperparameter optimization, particularly for large language models.

Moreover, the Kubeflow Training Operator now supports JAX for distributed training, enabling users to leverage JAX’s capabilities for efficient and scalable model training.

Finally, if you want to get involved with Trainer V2, take a look at this KEP and issue.

Hyperparameter Optimization API for LLMs

Katib introduces a new high-level API for hyperparameter tuning, streamlining LLMOps workflows in Kubernetes. This API integrates Katib and the Training Operator to automate hyperparameter optimization, reducing manual effort for data scientists fine-tuning large language models. For more information, refer to the feature release blog post.

Support for Various Parameter Distributions

Katib now adds support for multiple probability distributions. Previously limited to uniform distributions, Katib now supports log-uniform, normal, and log-normal distributions, providing data scientists with greater flexibility in tuning hyperparameters. This is particularly useful for parameters like learning rates, which benefit from log-uniform sampling, or values expected to vary around a mean, suited for normal distributions.

Push-Based Metrics Collection

Katib now allows users to push metrics to Katib DB directly. The new push-based design provides administrative and performanace improvements to the existing pull based design. For further details, please refer to the Push-Based Metrics Collection blog post.

Dashboard & Notebooks

Kubeflow 1.10 improves the observability and usability of Notebooks, while providing updated default images.

Prometheus Metrics for Notebooks

Both the Notebooks component and CRUD backends now feature Prometheus metrics. Notebooks expose custom metrics using the prom-client library, and CRUD backends utilize the prometheus_flask_exporter library. This ensures consistent metrics integration across all backend services.

More Descriptive Error Messages

Error messages for notebook creation failures due to resource constraints are now more descriptive. Users can quickly identify issues such as insufficient resources.

Spark Operator

The Spark Operator, now integrated as a core Kubeflow component, includes several key enhancements focusing on architecture, security, and performance:

  • Rebuilt with Controller Runtime (v2.0.0): Modernized core architecture using controller-runtime, aligning with Kubernetes controller patterns for improved structure, extensibility, and testability.
  • YuniKorn Gang Scheduling Support (v2.0.0): Enables efficient scheduling of Spark driver & executor pods as a group, ideal for large-scale data pipelines with resource guarantees.
  • Enhanced Security Contexts & SeccompProfile Support (v2.1.1): Adds support for seccompProfile: RuntimeDefault & readOnlyRootFilesystem, aligning with Kubernetes Pod Security Standards and minimizing security risk.

KServe

KServe v0.14.1 introduces several essential features that enhance its capabilities for deploying and managing machine learning models.

New Python SDK

The release includes a new Python SDK with both REST and GRPC inference clients, offering asynchronous support and the ability to handle tensor data in binary format.

OCI Storage for Models

OCI storage for models has also been promoted to a stable feature, with improvements to stability by configuring OCI models as init containers.

Model Cache Feature

Additionally, the introduction of the Model Cache feature leverages local node storage to reduce model load times, especially for large models, enhancing scalability.

Hugging Face Integration

KServe v0.14.1 further expands integration with Hugging Face, enabling direct model deployment from the Hugbing Face hub via a new hf:// URI schema.

What comes next?

If you want to take a peek into the Kubeflow 1.11 roadmap planning and contribute with your ideas, see Notebooks, Manifests & Security, Pipelines, Model Registry, Katib, Training Operator.

How to get started with 1.10

Visit the Kubeflow 1.10 release page or head over to the Getting Started and Support pages.

Join the Community

We would like to thank everyone for the contribution to Kubeflow 1.10, especially Ricardo Martinelli De Oliveira for his work as the v1.10 Release Manager, all the release team and the working group leads, who relentlessly dedicate their time to this great project.

Release team members : Ricardo Martinelli De Oliveira, Dimitris Poulopoulos, Matteo Morttari, Julius von Kohout Valentina Rodriguez Sosa, Helber Belmiro, Vraj Bhatt, Diego Lovison, Dagvanorov Lkhagvajav, Sailesh Duddupudi, Manos Vlassis, Tarek Abouzeid, Milos Grubjesic

Working Group leads : Andrey Velichkevich, Julius von Kohout, Mathew Wicks, …

Kubeflow Steering Committee : Andrey Velichkevich, Julius von Kohout, Yuan Tang, Johnu George, Francisco Javier Araceo

Participating Distributions : Charmed Kubeflow (Canonical), Nutanix, OpenShift AI (RedHat), QBO

You can find more details about Kubeflow distributions here.

Want to help?

The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!