Kubeflow AI Reference Platform 1.11 Release Announcement
Kubeflow AI Reference Platform 1.11 delivers substantial platform improvements focused on scalability, security, and operational efficiency. The release reduces per namespace overhead, strengthens multi-tenant defaults, and improves overall reliability for running Kubeflow at scale on Kubernetes.
Highlight features
- Trainer v2.1.0 with unified TrainJob API, Python-first workflows, and built-in LLM fine-tuning support
- Multi-tenant S3 storage with per-namespace credentials, with SeaweedFS replacing MinIO as the default backend
- Massive scalability improvements enabling Kubeflow deployments to scale to 1,000+ users, profiles, and namespaces
- Zero pod overhead by default for namespaces and profiles, significantly reducing baseline resource consumption
- Optimized Istio service mesh configuration to dramatically reduce sidecar memory usage and network traffic in large clusters
- Stronger security defaults with Pod Security Standards (restricted for system namespaces, baseline for user namespaces)
- Improved authentication and exposure patterns for KServe inference services, with automated tests and documentation
- Expanded Helm chart support (experimental) to improve modularity and deployment flexibility
- Updates across core components, including Kubeflow Pipelines, Katib, KServe, Model Registry, Istio, and Spark Operator
Kubeflow Platform (Manifests & Security)
The Kubeflow Platform Working Group focuses on simplifying Kubeflow installation, operations, and security. See details below.
Manifests:
- Documentation updates that make it easier to install, extend and upgrade Kubeflow
- For more details and future plans please check 1.12.0 roadmap.
| Notebooks | Dashboard | Pipelines | Katib | Trainer | KServe | Model Registry | Spark |
|---|---|---|---|---|---|---|---|
| 1.10 | 1.10 | 2.15.2 | 0.19.0 | 2.1.0 | 0.15.2 | 0.3.4 | 2.4.0 |
| Kubernetes | Kind | Kustomize | Cert Manager | Knative | Istio | Dex | OAuth2-proxy |
|---|---|---|---|---|---|---|---|
| 1.33+ | 0.30.0 | 5.7.1 | 1.16.1 | 1.20 | 1.28 | 2.43 | 7.10 |
Security:
- Pod Security Standards enforced by default:
- Network policies enabled by default for critical system namespaces
(knative-serving,oauth2-proxy,cert-manager,istio-system,auth)
(#3228) - Improved multi-tenant isolation for object storage, with per-namespace S3 credentials
(#3240) - Authentication enforcement for KServe inference services
(#3180)
Trivy CVE scans December 15 2025:
| Working Group | Images | Critical CVE | High CVE | Medium CVE | Low CVE |
|---|---|---|---|---|---|
| Katib | 18 | 1 | 35 | 158 | 562 |
| Pipelines | 15 | 12 | 432 | 1051 | 1558 |
| Workbenches(Notebooks) | 12 | 39 | 312 | 525 | 267 |
| Kserve | 16 | 35 | 535 | 11929 | 1745 |
| Manifests | 15 | 6 | 105 | 256 | 55 |
| Trainer | 9 | 4 | 157 | 9012 | 728 |
| Model Registry | 3 | 3 | 75 | 132 | 36 |
| Spark | 1 | 4 | 22 | 1688 | 151 |
| All Images | 89 | 104 | 1673 | 24751 | 5102 |
Pipelines
This release of KFP introduces several notable changes that users should consider prior to upgrading. Comprehensive upgrade and documentation notes will follow shortly. In the interim, please note the following key modifications
Default object store update
Kubeflow Pipelines now defaults to SeaweedFS for the object store deployment, replacing the previous default of MinIO. MinIO remains fully supported, as does any S3-compatible object storage backend, only the default deployment configuration has changed.
Existing MinIO manifests are still available for users who wish to continue using MinIO, though these legacy manifests may be removed in future releases. Users with existing data are advised to back up and restore as needed when switching object store backends.
Database backend upgrade
This release includes a major upgrade to the Gorm database backend, which introduces an automated database index migration for users upgrading from versions prior to 2.15.0. Because this migration does not support rollback, it is strongly recommended that production databases be backed up before performing the upgrade.
Model Registry
Model Registry continues to mature with new capabilities for model discovery, governance, and deeper integration with the Kubeflow ecosystem.
Model Registry UI
The user-friendly web interface for centralized model metadata, version tracking, and artifact management now supports filtering, sorting, archiving, custom metadata, and metadata editing making it easier for teams to organize and govern their model lifecycle.
Model Catalog
A new Model Catalog feature enables model discovery and sharing with governance controls. A Model Catalog is a pattern where an organisation can define their validated and approved models, enabling discovery and sharing across teams, while at the same time ensuring model governance and compliance. Admin can define a number of catalog sources, filtering and enable model visibility, including Hugging Face. Teams can discover and use approved models from the organisation’s catalog. The catalog UI and backend are under active development.
KServe Integration
- Custom Storage Initializer (CSI) Enables model download and deployment using model metadata directly from the Registry.
- Reconciliation loop A deployable Kubernetes controller which observes KServe InferenceServices to automatically populate Model Registry logical-model records, keeping registry audit records of live deployments.
Storage Integrations
- Python client workflows Data scientists can leverage convenience functions in the Python client to package, store, and register models and their metadata in a single playbook.
- Async Upload Job A Kubernetes Job for transferring and packaging models (including KServe ModelCar OCI Image format), simplifying model storage operations in production environments, leveraging scaling and orchestration capabilities of Kubernetes without additional dependencies.
Additional Improvements
- Removal of the legacy Google MLMD dependency.
- PostgreSQL support alongside MySQL.
- Multi-architecture container builds (amd64/arm64).
- SBOM generation for container builds and OpenSSF Scorecard CI integration.
Training Operator (Trainer) & Katib
Kubeflow 1.11 includes Trainer v2.1.0, a major architectural evolution that simplifies distributed training on Kubernetes with a unified API, Python-first workflows, and enhanced LLM fine-tuning capabilities.
New API Architecture
Kubeflow Trainer v2 introduces TrainJob a unified training job API that replaces framework-specific CRDs (PyTorchJob, TFJob, etc.). Infrastructure configuration is now separated into TrainingRuntime and ClusterTrainingRuntime resources, creating a clean boundary between platform engineering (runtime setup) and data science (job submission).
Python-First Experience
- No YAML required Install with
pip install kubeflowand submit jobs directly from Python notebooks or scripts. - Local execution mode Develop and test training code locally without a Kubernetes cluster before scaling to production.
- Helm Charts Deploy with
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0.
LLM Fine-Tuning
Built-in support for large language model fine-tuning workflows:
- TorchTune trainer with pre-configured runtimes for Llama 3.2, Qwen 2.5, and more.
- LoRA, QLoRA, and DoRA for parameter-efficient fine-tuning.
- Dataset and model initializers for HuggingFace and S3 storage.
Distributed AI Data Cache
Optional in-memory cache cluster (powered by Apache Arrow and Apache DataFusion ) streams datasets directly to GPU nodes with zero-copy transfers, maximizing GPU utilization and minimizing I/O wait times for large-scale training workloads. More details can be found here.
Scheduler Integrations
- Kueue Topology-aware scheduling and multi-cluster job dispatching for TrainJobs, enabling optimal placement for distributed training across node groups.
- Volcano Gang-scheduling support with PodGroup integration.
- MPI First-class support for MPI-based distributed training workloads on Kubernetes.
Katib
Katib hyperparameter tuning remains compatible with Trainer v2, allowing users to optimize model hyperparameters alongside the new training workflow.
A major addition is the integration with Kubeflow SDK (KEP-46, PR #124). The new OptimizerClient allows users to define and run hyperparameter experiments directly from Python notebooks without writing YAML. You can configure search spaces, objectives, and algorithms using OptimizerClient().optimize(). Each trial runs as a TrainJob with different hyperparameter values, and training code can report metrics using simple Python functions. The client includes standard methods for managing jobs: create_job(), get_job(), list_jobs(), and delete_job().
Spark Operator
The Spark Operator has received broad improvements in Kubeflow 1.11, spanning Spark version support, workload management, scheduling, and operational simplicity.
Broader Spark Support
The operator now supports Apache Spark 4 and introduces Spark Connect, enabling modern client–server Spark interactions. This allows users to connect to Spark sessions remotely and improves compatibility with the evolving Spark ecosystem.
Workload Management & Scheduling
- Suspend / Resume SparkApplications Users can now suspend and resume jobs, giving greater control over workload lifecycle.
- Kueue integration Integration with Kueue enables queue-based workload management and fair sharing of cluster resources across teams.
- Enhanced dynamic allocation Improved shuffle tracking and dynamic allocation controls for more efficient resource usage.
Operations & Security
- Automatic CRD upgrades Helm hooks now handle CRD upgrades automatically, reducing manual steps during upgrades.
- Deprecation of sparkctl Legacy
sparkctlhas been deprecated in favor of kubectl-native workflows. - Flexible Ingress & cert-manager support More configurable Ingress (TLS, annotations, URL patterns) and simplified certificate handling via cert-manager.
Observability
- Structured logging Configurable JSON and console log output formats.
- Better validation Stricter validation of SparkApplication names and specs, catching misconfigurations earlier.
KServe
KServe in Kubeflow 1.11 delivers major improvements across model serving, inference capabilities, and operational maturity.
Multi-Node Inference
KServe now supports multi-node inference, enabling large models to be distributed across multiple nodes using Ray-based serving runtimes. This is critical for deploying very large language models that exceed single-node GPU capacity.
Model Cache Improvements
The Model Cache feature, introduced in v0.14, has been significantly hardened. Fixes include correct URI matching, protection against cache mismatches, support for multiple node groups, and PVC/PV retention after InferenceService deletion making model caching more reliable for production use.
KEDA Autoscaling Integration
KServe introduces integration with KEDA for event-driven autoscaling, including an external scaler implementation. This gives users more flexible scaling options beyond the built-in Knative and HPA-based autoscalers.
Gateway API Support
Raw deployment mode now supports the Kubernetes Gateway API, providing a modern, standardized alternative to Ingress for routing inference traffic.
vLLM & Hugging Face Runtime Updates
- Upgraded vLLM to v0.8.1+ with support for reasoning models, tool calling, embeddings, reranking, and Llama 4 / Qwen 3.
- vLLM V1 engine support and CPU inference via Intel Extension for PyTorch.
- LMCache integration with vLLM for improved KV cache reuse.
- Hugging Face runtime updates include 4-bit quantization support (bitsandbytes), speculative decoding, and deprecation of OpenVINO support.
Inference Graph Enhancements
- InferenceGraphs now support pod spec fields (affinity, tolerations, resources) and well-known labels.
- Improved Istio mesh compatibility and fixed response codes for conditional routing steps.
Operational & Security Improvements
- ModelCar (OCI-based model loading) enabled by default.
- Collocation of transformer and predictor containers in a single pod.
- Stop-and-resume model serving via annotations (serverless mode).
- Configurable label and annotation propagation to serving pods.
- SBOM generation and third-party license inclusion for all images.
- Multiple CVE fixes including
CVE-2025-43859andCVE-2025-24357.
Kubeflow SDK
Kubeflow 1.11 is the first AI Reference Platform release where users can simply pip install kubeflow to start working with AI workloads, no Kubernetes expertise required. The Kubeflow SDK provides a unified Python interface to train models, run hyperparameter tuning, and manage model artifacts across the Kubeflow ecosystem. It also enables local development without a Kubernetes cluster, so users can iterate on their training code locally before scaling to production. For documentation and examples, visit sdk.kubeflow.org.
Dashboard and Notebooks
The Kubeflow Central Dashboard and Notebooks remain at version 1.10 in this release, providing stable and reliable experiences. Stay tuned for interesting updates in upcoming Kubeflow AI Reference Platform releases.
How to get started with 1.11
Visit the Kubeflow AI Reference Platform 1.11 release page or head over to the Getting Started and Support pages.
Join the Community
We would like to thank everyone who contributed to Kubeflow 1.11, and especially Valentina Rodriguez Sosa for her work as the v1.11 Release Manager. We also extend our thanks to the entire release team and the working group leads, who continuously and generously dedicate their time and expertise to Kubeflow.
Release team members : Valentina Rodriguez Sosa, Anya Kramar, Tarek Abouzeid, Andy Stoneberg, Humair Khan, Matteo Mortari, Adysen Rothman, Jon Burdo, Milos Grubjesic, Vraj Bhatt, Dhanisha Phadate, Alok Dangre
Working Group leads : Andrey Velichkevich, Julius von Kohout, Mathew Wicks, Matteo Mortari
Kubeflow Steering Committee : Andrey Velichkevich, Julius von Kohout, Yuan Tang, Johnu George, Francisco Javier Araceo
You can find more details about Kubeflow distributions here.
Want to help?
The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!
- Visit our Kubeflow website or Kubeflow GitHub Page.
- Join the Kubeflow Slack channel.
- Join the kubeflow-discuss mailing list.
- Attend our weekly community meeting.