Kubeflow AI Reference Platform 1.11 Release Announcement

Kubeflow AI Reference Platform 1.11 delivers substantial platform improvements focused on scalability, security, and operational efficiency. The release reduces per namespace overhead, strengthens multi-tenant defaults, and improves overall reliability for running Kubeflow at scale on Kubernetes.

Highlight features

Trainer v2.1.0 with unified TrainJob API, Python-first workflows, and built-in LLM fine-tuning support
Multi-tenant S3 storage with per-namespace credentials, with SeaweedFS replacing MinIO as the default backend
Massive scalability improvements enabling Kubeflow deployments to scale to 1,000+ users, profiles, and namespaces
Zero pod overhead by default for namespaces and profiles, significantly reducing baseline resource consumption
Optimized Istio service mesh configuration to dramatically reduce sidecar memory usage and network traffic in large clusters
Stronger security defaults with Pod Security Standards (restricted for system namespaces, baseline for user namespaces)
Improved authentication and exposure patterns for KServe inference services, with automated tests and documentation
Expanded Helm chart support (experimental) to improve modularity and deployment flexibility
Updates across core components, including Kubeflow Pipelines, Katib, KServe, Model Registry, Istio, and Spark Operator

Kubeflow Platform (Manifests & Security)

The Kubeflow Platform Working Group focuses on simplifying Kubeflow installation, operations, and security. See details below.

Manifests:

Documentation updates that make it easier to install, extend and upgrade Kubeflow
For more details and future plans please check 1.12.0 roadmap.

Notebooks	Dashboard	Pipelines	Katib	Trainer	KServe	Model Registry	Spark
1.10	1.10	2.15.2	0.19.0	2.1.0	0.15.2	0.3.4	2.4.0

Kubernetes	Kind	Kustomize	Cert Manager	Knative	Istio	Dex	OAuth2-proxy
1.33+	0.30.0	5.7.1	1.16.1	1.20	1.28	2.43	7.10

Security:

Pod Security Standards enforced by default:
- restricted for all Kubeflow system namespaces
  (#3190, #3050)
- baseline for user namespaces
  (#3204, #3220)
Network policies enabled by default for critical system namespaces
(knative-serving, oauth2-proxy, cert-manager, istio-system, auth)
(#3228)
Improved multi-tenant isolation for object storage, with per-namespace S3 credentials
(#3240)
Authentication enforcement for KServe inference services
(#3180)

Trivy CVE scans December 15 2025:

Working Group	Images	Critical CVE	High CVE	Medium CVE	Low CVE
Katib	18	1	35	158	562
Pipelines	15	12	432	1051	1558
Workbenches(Notebooks)	12	39	312	525	267
Kserve	16	35	535	11929	1745
Manifests	15	6	105	256	55
Trainer	9	4	157	9012	728
Model Registry	3	3	75	132	36
Spark	1	4	22	1688	151
All Images	89	104	1673	24751	5102

Pipelines

This release of KFP introduces several notable changes that users should consider prior to upgrading. Comprehensive upgrade and documentation notes will follow shortly. In the interim, please note the following key modifications

Default object store update

Kubeflow Pipelines now defaults to SeaweedFS for the object store deployment, replacing the previous default of MinIO. MinIO remains fully supported, as does any S3-compatible object storage backend, only the default deployment configuration has changed.

Existing MinIO manifests are still available for users who wish to continue using MinIO, though these legacy manifests may be removed in future releases. Users with existing data are advised to back up and restore as needed when switching object store backends.

Database backend upgrade

This release includes a major upgrade to the Gorm database backend, which introduces an automated database index migration for users upgrading from versions prior to 2.15.0. Because this migration does not support rollback, it is strongly recommended that production databases be backed up before performing the upgrade.

Model Registry

Model Registry continues to mature with new capabilities for model discovery, governance, and deeper integration with the Kubeflow ecosystem.

Model Registry UI

The user-friendly web interface for centralized model metadata, version tracking, and artifact management now supports filtering, sorting, archiving, custom metadata, and metadata editing making it easier for teams to organize and govern their model lifecycle.

Model Catalog

A new Model Catalog feature enables model discovery and sharing with governance controls. A Model Catalog is a pattern where an organisation can define their validated and approved models, enabling discovery and sharing across teams, while at the same time ensuring model governance and compliance. Admin can define a number of catalog sources, filtering and enable model visibility, including Hugging Face. Teams can discover and use approved models from the organisation’s catalog. The catalog UI and backend are under active development.

KServe Integration

Custom Storage Initializer (CSI) Enables model download and deployment using model metadata directly from the Registry.
Reconciliation loop A deployable Kubernetes controller which observes KServe InferenceServices to automatically populate Model Registry logical-model records, keeping registry audit records of live deployments.

Storage Integrations

Python client workflows Data scientists can leverage convenience functions in the Python client to package, store, and register models and their metadata in a single playbook.
Async Upload Job A Kubernetes Job for transferring and packaging models (including KServe ModelCar OCI Image format), simplifying model storage operations in production environments, leveraging scaling and orchestration capabilities of Kubernetes without additional dependencies.

Additional Improvements

Removal of the legacy Google MLMD dependency.
PostgreSQL support alongside MySQL.
Multi-architecture container builds (amd64/arm64).
SBOM generation for container builds and OpenSSF Scorecard CI integration.

Training Operator (Trainer) & Katib

Kubeflow 1.11 includes Trainer v2.1.0, a major architectural evolution that simplifies distributed training on Kubernetes with a unified API, Python-first workflows, and enhanced LLM fine-tuning capabilities.

New API Architecture

Kubeflow Trainer v2 introduces TrainJob a unified training job API that replaces framework-specific CRDs (PyTorchJob, TFJob, etc.). Infrastructure configuration is now separated into TrainingRuntime and ClusterTrainingRuntime resources, creating a clean boundary between platform engineering (runtime setup) and data science (job submission).

Python-First Experience

No YAML required Install with pip install kubeflow and submit jobs directly from Python notebooks or scripts.
Local execution mode Develop and test training code locally without a Kubernetes cluster before scaling to production.
Helm Charts Deploy with helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0.

LLM Fine-Tuning

Built-in support for large language model fine-tuning workflows:

TorchTune trainer with pre-configured runtimes for Llama 3.2, Qwen 2.5, and more.
LoRA, QLoRA, and DoRA for parameter-efficient fine-tuning.
Dataset and model initializers for HuggingFace and S3 storage.

Distributed AI Data Cache

Optional in-memory cache cluster (powered by Apache Arrow and Apache DataFusion ) streams datasets directly to GPU nodes with zero-copy transfers, maximizing GPU utilization and minimizing I/O wait times for large-scale training workloads. More details can be found here.

Scheduler Integrations

Kueue Topology-aware scheduling and multi-cluster job dispatching for TrainJobs, enabling optimal placement for distributed training across node groups.
Volcano Gang-scheduling support with PodGroup integration.
MPI First-class support for MPI-based distributed training workloads on Kubernetes.

Katib

Katib hyperparameter tuning remains compatible with Trainer v2, allowing users to optimize model hyperparameters alongside the new training workflow.

A major addition is the integration with Kubeflow SDK (KEP-46, PR #124). The new OptimizerClient allows users to define and run hyperparameter experiments directly from Python notebooks without writing YAML. You can configure search spaces, objectives, and algorithms using OptimizerClient().optimize(). Each trial runs as a TrainJob with different hyperparameter values, and training code can report metrics using simple Python functions. The client includes standard methods for managing jobs: create_job(), get_job(), list_jobs(), and delete_job().

Spark Operator

The Spark Operator has received broad improvements in Kubeflow 1.11, spanning Spark version support, workload management, scheduling, and operational simplicity.

Broader Spark Support

The operator now supports Apache Spark 4 and introduces Spark Connect, enabling modern client–server Spark interactions. This allows users to connect to Spark sessions remotely and improves compatibility with the evolving Spark ecosystem.

Workload Management & Scheduling

Suspend / Resume SparkApplications Users can now suspend and resume jobs, giving greater control over workload lifecycle.
Kueue integration Integration with Kueue enables queue-based workload management and fair sharing of cluster resources across teams.
Enhanced dynamic allocation Improved shuffle tracking and dynamic allocation controls for more efficient resource usage.

Operations & Security

Automatic CRD upgrades Helm hooks now handle CRD upgrades automatically, reducing manual steps during upgrades.
Deprecation of sparkctl Legacy sparkctl has been deprecated in favor of kubectl-native workflows.
Flexible Ingress & cert-manager support More configurable Ingress (TLS, annotations, URL patterns) and simplified certificate handling via cert-manager.

Observability

Structured logging Configurable JSON and console log output formats.
Better validation Stricter validation of SparkApplication names and specs, catching misconfigurations earlier.

KServe

KServe in Kubeflow 1.11 delivers major improvements across model serving, inference capabilities, and operational maturity.

Multi-Node Inference

KServe now supports multi-node inference, enabling large models to be distributed across multiple nodes using Ray-based serving runtimes. This is critical for deploying very large language models that exceed single-node GPU capacity.

Model Cache Improvements

The Model Cache feature, introduced in v0.14, has been significantly hardened. Fixes include correct URI matching, protection against cache mismatches, support for multiple node groups, and PVC/PV retention after InferenceService deletion making model caching more reliable for production use.

KEDA Autoscaling Integration

KServe introduces integration with KEDA for event-driven autoscaling, including an external scaler implementation. This gives users more flexible scaling options beyond the built-in Knative and HPA-based autoscalers.

Gateway API Support

Raw deployment mode now supports the Kubernetes Gateway API, providing a modern, standardized alternative to Ingress for routing inference traffic.

vLLM & Hugging Face Runtime Updates

Upgraded vLLM to v0.8.1+ with support for reasoning models, tool calling, embeddings, reranking, and Llama 4 / Qwen 3.
vLLM V1 engine support and CPU inference via Intel Extension for PyTorch.
LMCache integration with vLLM for improved KV cache reuse.
Hugging Face runtime updates include 4-bit quantization support (bitsandbytes), speculative decoding, and deprecation of OpenVINO support.

Inference Graph Enhancements

InferenceGraphs now support pod spec fields (affinity, tolerations, resources) and well-known labels.
Improved Istio mesh compatibility and fixed response codes for conditional routing steps.

Operational & Security Improvements

ModelCar (OCI-based model loading) enabled by default.
Collocation of transformer and predictor containers in a single pod.
Stop-and-resume model serving via annotations (serverless mode).
Configurable label and annotation propagation to serving pods.
SBOM generation and third-party license inclusion for all images.
Multiple CVE fixes including CVE-2025-43859 and CVE-2025-24357.

Kubeflow SDK

Kubeflow 1.11 is the first AI Reference Platform release where users can simply pip install kubeflow to start working with AI workloads, no Kubernetes expertise required. The Kubeflow SDK provides a unified Python interface to train models, run hyperparameter tuning, and manage model artifacts across the Kubeflow ecosystem. It also enables local development without a Kubernetes cluster, so users can iterate on their training code locally before scaling to production. For documentation and examples, visit sdk.kubeflow.org.

Dashboard and Notebooks

The Kubeflow Central Dashboard and Notebooks remain at version 1.10 in this release, providing stable and reliable experiences. Stay tuned for interesting updates in upcoming Kubeflow AI Reference Platform releases.

How to get started with 1.11

Visit the Kubeflow AI Reference Platform 1.11 release page or head over to the Getting Started and Support pages.

Join the Community

We would like to thank everyone who contributed to Kubeflow 1.11, and especially Valentina Rodriguez Sosa for her work as the v1.11 Release Manager. We also extend our thanks to the entire release team and the working group leads, who continuously and generously dedicate their time and expertise to Kubeflow.

Release team members : Valentina Rodriguez Sosa, Anya Kramar, Tarek Abouzeid, Andy Stoneberg, Humair Khan, Matteo Mortari, Adysen Rothman, Jon Burdo, Milos Grubjesic, Vraj Bhatt, Dhanisha Phadate, Alok Dangre

Working Group leads : Andrey Velichkevich, Julius von Kohout, Mathew Wicks, Matteo Mortari

Kubeflow Steering Committee : Andrey Velichkevich, Julius von Kohout, Yuan Tang, Johnu George, Francisco Javier Araceo

You can find more details about Kubeflow distributions here.

Want to help?

The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!

Visit our Kubeflow website or Kubeflow GitHub Page.
Join the Kubeflow Slack channel.
Join the kubeflow-discuss mailing list.
Attend our weekly community meeting.