Releases · volcano-sh/volcano

27 Jun 03:49

JesseStutler

v1.14.3

27fec3b

v1.14.3

What's Changed

Bug fixes

[release-1.14] Fix Network-Topology-Aware scheduling soft mode for jobs and subjobs by @3th4novo in #5513
[release-1.14] Fix scalar in-queue resource accounting to use milli-units by @WHOIM1205 in #5503
[release-1.14] Clear device-related annotations when releasing pods by @archlitchi in #5486
[release-1.14] Fix maxFloat and maxInt display in scheduler resource strings by @r0hansaxena in #5477
[release-1.14] Fall back minAvailable to replicas when partition minPartitions is omitted by @togettoyou in #5459
[release-1.14] Check Ascend vNPU health using Allocatable resources by @DSFans2014 in #5437
[release-1.14] Prevent unbounded growth of job.Status.Conditions by @avinxshKD in #5448
[release-1.14] Fix HAMi vGPU scheduling failures in medium and large clusters by @Copilot in #5434
[release-1.14] Restore ImageStates in scheduler node snapshots by @kitianFresh in #5356
[release-1.14] Reprieve higher-priority pods first during preemption by @dengaosong in #5307

Full Changelog: v1.14.2...v1.14.3

Contributors

kitianFresh, togettoyou, and 7 other contributors

Assets 2

01 Jun 03:33

JesseStutler

v1.15.0

8fc394c

v1.15.0 Latest

Latest

Summary

Volcano v1.15.0 further strengthens Volcano as a unified scheduling platform for converged general-purpose and AI computing at scale. As clusters increasingly run batch training, inference, AI Agent, HPC, big-data, and heterogeneous accelerator workloads together, the scheduler must make higher-quality decisions under resource contention while preserving workload-level semantics, queue fairness, topology locality, and operational stability.

The most important addition is Gang-Aware Preemption and Resource Reclamation. It makes eviction decisions gang-aware on both sides: the incoming gang is placed as a unit, and victim jobs are ordered and selected at job/gang granularity to avoid arbitrary partial disruption. v1.15.0 also introduces DRA queue quota in the capacity plugin, pluggable multi-sharding policies, benchmark and performance observability tooling, Kubernetes 1.35 support, NodeGroup preference ordering, Agent Scheduler stability improvements, incremental GPU/vGPU improvements, and opt-in scheduling gates for queue admission control.

What's New

Key Features Overview

Gang-Aware Preemption and Resource Reclamation (Alpha): Makes eviction decisions gang-aware on both sides. Volcano evaluates the preemptor as a whole gang and selects victims at job/gang granularity, reducing random partial disruption across training jobs and supporting HyperNode-scoped victim search when topology is enabled.
DRA Queue Quota in Capacity Plugin: Adds queue quota accounting for Pods that request DRA resources, bringing ResourceClaim usage into the existing capability, deserved, and guarantee model.
Pluggable Multi-Sharding Policy Support (Alpha): Moves sharding configuration from fixed shard parameters to a pluggable policy framework with composable filter, score, and select phases, built-in allocation-rate, warmup, and node-limit policies, and ConfigMap live reload.
Volcano Benchmark and Performance Observability: Provides a benchmark framework for one-click environment setup and performance report collection on Kind/KWOK or existing clusters, helping users establish baseline data and identify scheduler bottlenecks.
Scheduling Gates for Queue Admission (Alpha): Uses opt-in scheduling gates to hold pods blocked by queue capacity, so Cluster Autoscaler and Karpenter do not scale up for quota-only blockers.

Key Feature Details

Gang-Aware Preemption and Resource Reclamation (Alpha)

Background and Motivation:

Volcano's legacy preempt and reclaim actions are task-centric. For gang-style jobs, evicting individual tasks from many different victim jobs can create wide disruption without guaranteeing that the pending gang can be scheduled afterward. Some scheduling systems only make the preemptor gang-aware: they try to place the incoming gang as a unit, but still choose victims task by task. That can protect the incoming job while randomly breaking multiple victim gangs.

Volcano v1.15.0 makes both sides of the eviction decision gang-aware. Victim jobs are ordered and selected at job/gang granularity, so the scheduler can reason about the disruption cost of breaking a victim gang instead of treating every task as an interchangeable victim. This distinction is important even when HyperNode topology is not used, because the scheduler still avoids spreading arbitrary partial evictions across unrelated jobs.

This matters especially for training-style workloads. A task-by-task victim loop can evict one replica from many different jobs. If each job depends on gang semantics, one scheduling cycle may break every victim job while still failing to place the incoming gang. Volcano now groups candidate victim tasks by job and evaluates victim bundles. When bundle splitting is available, the scheduler treats resources above the gang requirement, such as replicas - minAvailable, as lower-cost safe bundles before considering whole-job disruption. Bundle ordering then follows the action policy first. gangPreempt is priority-driven, and gangReclaim is fairness-driven. Efficiency based on local gain inside the selected HyperNode versus global disruption is only used after those policy constraints are satisfied.

When HyperNode topology is configured, the new actions additionally scope victim search to HyperNode candidates. Volcano evaluates preemption and reclaim inside a selected topology scope rather than freely preempting across topology domains. The scheduler then runs plugin filtering and placement simulation before committing evictions and nominations, so eviction and placement are evaluated as one scheduling decision.

Alpha Feature Notice: Gang-aware preemption and reclamation is alpha and must be enabled explicitly through gangPreempt and gangReclaim. Future releases may merge these dedicated actions with the legacy preempt and reclaim actions after the rollout is validated.

Key Capabilities:

Preemptor-gang placement: Evaluates whether the incoming gang can be placed as a whole before eviction is selected.
Victim-gang awareness: Groups victim candidates by job/gang, prioritizes lower-cost victim bundles such as replicas above minAvailable, and avoids spreading partial disruption across many jobs.
Topology-scoped eviction: When HyperNode topology is enabled, searches victims inside the selected topology scope instead of freely preempting across topology domains.
Policy-aware victim ordering: Uses priority for gangPreempt and queue fairness for gangReclaim, with efficiency used as a secondary ordering signal.

Configuration:

actions: "enqueue, allocate, backfill, gangPreempt, gangReclaim"
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: drf
  - name: predicates
  - name: nodeorder
  - name: binpack

Note: Do not configure gangPreempt/gangReclaim together with the legacy preempt/reclaim actions in the same scheduler action list.

PRs: #5250, #4780, #5170
Design Docs: Gang-Aware Eviction Design, EvictableFn Evolution for Gang Eviction
Contributors: @vzhou-p

DRA Queue Quota in Capacity Plugin

Background and Motivation:

Previous Volcano releases already supported scheduling Pods that request Kubernetes Dynamic Resource Allocation resources. The missing part was queue quota: DRA ResourceClaim requests were not accounted against capability, deserved, or guarantee, so queues could not control DRA resource usage the same way they control CPU, memory, and extended resources.

Kubernetes DRA introduces DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice, while Volcano queues already manage quota through capability, deserved, and guarantee. v1.15.0 brings DRA resources into that queue quota model instead of requiring a separate DRA-only quota API.

The capacity plugin now accounts DRA resource requests for queue enqueue and allocation decisions. Operators can limit whole devices or consumable device dimensions such as virtual GPU cores and memory. Shared ResourceClaims are deduplicated so multiple pods referencing the same logical claim do not inflate queue usage.

Compatibility Note: DRA quota requires Kubernetes DRA support and a DRA-capable driver. Some DRA allocation modes remain outside the first quota-accounting scope.

Key Capabilities:

Whole-device quota: Controls DRA DeviceClass device counts at queue level.
Consumable-capacity quota: Controls device dimensions such as cores or memory through queue quota.
Existing queue semantics: Applies the same capability, deserved, and guarantee model used by other queue resources.
ResourceClaim-aware accounting: Accounts direct claims, template-created claims, and shared claims without inflating queue usage.

Configuration:

kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, reclaim"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: capacity
        arguments:
          capacity.DynamicResourceAllocationEnable: true
          capacity.DRAConsumableCapacityEnable: true
      - name: nodeorder

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-team
spec:
  reclaimable: true
  capability:
    cpu: "100"
    memory: "200Gi"
    "nvidia.com/gpu": "4"
    "deviceclass/gpu.nvidia.com": "8"
    "cores.deviceclass/hami-core-gpu.project-hami.io": "800"
    "memory.deviceclass/hami-core-gpu.project-hami.io": "320Gi"

PRs: #5058
Design Doc: DeviceClass Quota Support in Capacity Plugin
User Guide: DeviceClass Quota User Guide
Contributors: @xu-wentao

Pluggable Multi-Sharding Policy Support (Alpha)

Background and Motivation:

The v1.14.0 Sharding Controller introduced dynamic node scheduling shards for multi-scheduler deployments. v1.15.0 builds on that architecture by adding pluggable multi-sharding policy support. In...

Contributors

goyalankit, praveen0raj, and 42 other contributors

Assets 2

09 May 01:42

JesseStutler

v1.14.2

92d6e4c

v1.14.2

Important:
This release addresses a security vulnerability and multiple bug fixes. We strongly advise all users to upgrade immediately to protect your systems and data.

Security Fixes

CVE-2026-44247: Webhook Server OOM via unbounded HTTP request body size

A security vulnerability has been discovered in the Volcano webhook server that could allow a pod with network access to the webhook endpoint to cause a denial of service by sending an arbitrarily large HTTP request body, leading to the webhook server being killed by OOM.

Affected Versions:

volcano <= v1.14.1
volcano <= v1.13.2
volcano <= v1.12.3

Fixed Versions:

volcano v1.14.2
volcano v1.13.3
volcano v1.12.4

This vulnerability was reported by @bugbunny-research and mitigated by @JesseStutler.

CVSS Rating: Moderate (6.8) CVSS:3.1/AV:A/AC:L/PR:L/UI:N/S:C/C:N/I:N/A:H

Bug Fixes

fix: remove duplicated session close call (#5056 @qi-min)
Update KUBE_VERSION to 1.34.1 in webhook-manager Dockerfile (#5063 @hajnalmt)
Update root queue capability and enhance queue validation logic (#5080 @guoqinwill)
Fix shared mutable objects in scheduler snapshot clones (#5092 @zhifei92)
fix: panic and restart of volcano scheduler pods on install (#5144 @Tau721)
Fix Agent Scheduler multi worker optimistic parallel scheduling concurrently conflict error (#5162 @JesseStutler)
Fix inaccurate E2E duration metric in agent-scheduler (#5165 @Copilot)
fix: process panic caused by concurrent map writes (#5182 @zhifei92)
wait event handler completed before start scheduling (#5183 @qi-min)
Rollback unnecessary deepcopy in snapshot (#5185 @zhifei92)
fix event handlers cache (#5188 @qi-min)
fix: highestTierName in partitionPolicy or subGroupPolicy fails to restrict scheduling to specified HyperNode tiers (#5203 @Tau721)
fix(capacity): avoid false exceeds on missing parent scalar keys (#5218 @hajnalmt)
reduce node priority if nodes wait to be checked in binder (#5260 @qi-min)
fix(scheduler): prevent preemptorTasks overwrite in multi-queue preemption (#5263 @hajnalmt)
enhancement(scheduler): honor QueueOrderFn in preempt action (#5268 @hajnalmt)
Fix: Stabilize predicates plugin execution order and rollback semantics (#5286 @wangyang0616)

Full Changelog: v1.14.1...v1.14.2

Contributors

qi-min, hajnalmt, and 6 other contributors

Assets 2

09 May 01:47

JesseStutler

v1.13.3

5fe5adf

v1.13.3

Important:
This release addresses a security vulnerability and multiple bug fixes. We strongly advise all users to upgrade immediately to protect your systems and data.

Security Fixes

CVE-2026-44247: Webhook Server OOM via unbounded HTTP request body size

Affected Versions:

volcano <= v1.14.1
volcano <= v1.13.2
volcano <= v1.12.3

Fixed Versions:

volcano v1.14.2
volcano v1.13.3
volcano v1.12.4

This vulnerability was reported by @bugbunny-research and mitigated by @JesseStutler.

CVSS Rating: Moderate (6.8) CVSS:3.1/AV:A/AC:L/PR:L/UI:N/S:C/C:N/I:N/A:H

Bug Fixes

Rollback unnecessary deepcopy in snapshot (#5186 @zhifei92)
wait event handler completed before start scheduling (#5200 @qi-min)
fix(scheduler): prevent preemptorTasks overwrite in multi-queue preemption (#5265 @hajnalmt)
enhancement(scheduler): honor QueueOrderFn in preempt action (#5269 @hajnalmt)

Full Changelog: v1.13.2...v1.13.3

Contributors

qi-min, hajnalmt, and 3 other contributors

Assets 2

09 May 01:49

JesseStutler

v1.12.4

28043a4

v1.12.4

Important:
This release addresses a security vulnerability and multiple bug fixes. We strongly advise all users to upgrade immediately to protect your systems and data.

Security Fixes

CVE-2026-44247: Webhook Server OOM via unbounded HTTP request body size

Affected Versions:

volcano <= v1.14.1
volcano <= v1.13.2
volcano <= v1.12.3

Fixed Versions:

volcano v1.14.2
volcano v1.13.3
volcano v1.12.4

This vulnerability was reported by @bugbunny-research and mitigated by @JesseStutler.

CVSS Rating: Moderate (6.8) CVSS:3.1/AV:A/AC:L/PR:L/UI:N/S:C/C:N/I:N/A:H

Bug Fixes

wait event handler completed before start scheduling (#5201 @qi-min)
fix(scheduler): prevent preemptorTasks overwrite in multi-queue preemption and honor QueueOrderFn (#5270 @hajnalmt)

Full Changelog: v1.12.3...v1.12.4

Contributors

qi-min, hajnalmt, and 2 other contributors

Assets 2

30 Mar 13:16

JesseStutler

v1.13.2

8f7a52a

v1.13.2

What's Changed

Bug fixes

cherrypick 4829 to release 1.13: keep terminating pod in job by @kingeasternsun in #4860
[release-1.13] fix potential panic on numa resources info updating in snapshot by @qi-min in #4897
[release-1.13] Fix gpu resource error by @sailorvii in #4916
[release-1.13] Update metrics_client_prometheus.go by @nitindhiman314e in #4931
[release-1.13] Fix shared mutable objects in scheduler snapshot clones by @zhifei92 in #5093

Full Changelog: v1.13.1...v1.13.2

Contributors

kingeasternsun, qi-min, and 3 other contributors

Assets 2

14 Feb 01:57

JesseStutler

v1.14.1

49b8f40

v1.14.1

What's Changed

Bug fixes

[release-1.14] Fixed issue where jobs with subgroups but not hard networkTopology.mode could not be scheduled. by @JesseStutler in #5041
[release-1.14] fix: The AllocatedHyperNode recovery for SubJobs during scheduler restart may not be the lowest tier. by @ouyangshengjia in #5012

Full Changelog: v1.14.0...v1.14.1

Contributors

JesseStutler and ouyangshengjia

Assets 2

31 Jan 06:34

JesseStutler

v1.14.0

6f86a47

v1.14.0

Summary

Volcano v1.14.0 establishes Volcano as a unified scheduling platform for diverse workloads at scale. This release introduces a scalable multi-scheduler architecture with dynamic node scheduling shard, enabling multiple schedulers to coordinate efficiently across large clusters. A new Agent Scheduler provides fast scheduling for latency-sensitive AI Agent workloads while working seamlessly with the Volcano batch scheduler. Network topology aware scheduling gains significant enhancements including HyperNode-level binpacking, SubGroup policies, and multi-level gang scheduling across Job and SubGroup scopes. Volcano Global integration advances with HyperJob for multi-cluster training and data-aware scheduling. Colocation now support generic operating systems with CPU Throttling, Memory QoS, and Cgroup V2. Additionally, integrated Ascend vNPU scheduling enables efficient sharing of Ascend AI accelerators.

What's New

Key Features Overview

Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha): Dynamically compute candidate node pools for schedulers with extensible strategies
Fast Scheduling for AI Agent Workloads (Alpha): A new Agent Scheduler for latency-sensitive AI Agent workloads is introduced, working in coordination with Volcano batch scheduler to establish a unified scheduling platform
Network Topology Aware Scheduling Enhancements: Support hyperNode-level binpacking, SubGroup level network topology aware scheduling, and multi-level gang scheduling across Job and SubGroup scopes for distributed workloads
Volcano Global Enhancements: HyperJob for multi-cluster training and data-aware scheduling for federated environments
Colocation for Generic OS: CPU Throttling, Memory QoS, CPU Burst with Cgroup V2 support on Ubuntu, CentOS, and other generic operating systems
Ascend vNPU Scheduling: Integrated support for Ascend 310P/910 series vNPU scheduling with MindCluster and HAMi modes

Key Feature Details

Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha)

Background and Motivation:

As Volcano evolves to support diverse scheduling workloads at massive scale, the single scheduler architecture faces significant challenges. Different workload types (batch training, AI agents, microservices) have distinct scheduling requirements and resource utilization patterns. A single scheduler becomes a bottleneck, and static resource allocation leads to inefficient cluster utilization.

The Sharding Controller introduces a scalable multi-scheduler architecture that dynamically computes candidate node pools for each scheduler. Unlike strict partitioning, the Sharding Controller calculates dynamic candidate node pools rather than enforcing hard isolation between schedulers. This flexible approach enables Volcano to serve as a unified scheduling platform for diverse workloads while maintaining high throughput and low latency.

Alpha Feature Notice: This feature is currently in alpha stage. The NodeShard CRD (Node Scheduling Shard) API structure and the underlying scheduling shard concepts are actively evolving.

Key Capabilities:

Dynamic Node Scheduling Shard Strategies: Compute dynamic candidate node pools based on various policies. Currently supports scheduling shard by CPU utilization, with an extensible design to support more policies in the future.
NodeShard CRD: Manages dynamic candidate node pools for specific schedulers.
Large-scale Cluster Support: Architecture designed to support large-scale clusters by distributing load across multiple schedulers
Scheduler Coordination: Enable seamless coordination among various scheduler combinations (e.g., multiple Batch Schedulers, or a mix of Agent and Batch Schedulers), establishing Volcano as a unified scheduling platform

Configuration:

# Sharding Controller startup flags
--scheduler-configs="volcano:volcano:0.0:0.6:false:2:100,agent-scheduler:agent:0.7:1.0:true:2:100"
--shard-sync-period=60s
--enable-node-event-trigger=true

# Config format: name:type:min_util:max_util:prefer_warmup:min_nodes:max_nodes

PR: #4777
Design Doc: Sharding Controller Design
Contributors: @ssfffss, @Haoran, @qi-min

Fast Scheduling for AI Agent Workloads (Alpha)

Background and Motivation:

AI Agent workloads are latency-sensitive with frequent task creation, requiring ultra-fast scheduling with high throughput. The Volcano batch scheduler is optimized for batch workloads and processes pods at fixed intervals, which cannot guarantee low latency for Agent workloads. To establish Volcano as a unified scheduling platform for both batch and latency-sensitive workloads, we introduce a dedicated Agent Scheduler.

The Agent Scheduler works in coordination with the Volcano batch scheduler through the Sharding Controller (which is introduced in "Scalable Multi-Scheduler with Dynamic Node Scheduling Shard" feature). This architecture positions Volcano as a unified scheduling platform capable of handling diverse workload types.

Alpha Feature Notice: This feature is currently in alpha stage and under active development. The Agent Scheduler related APIs, configuration options, and scheduling algorithms may be refined in future releases.

Key Capabilities:

Fast-Path Scheduling: Independent scheduler optimized for latency-sensitive workloads such as AI Agent workloads
Multi-Worker Parallel Scheduling: Multiple workers process pods concurrently from the scheduling queue, increasing throughput
Optimistic Concurrency Control: Conflict-Aware Binder resolves scheduling conflicts before executing real binding
Optimized Scheduling Queue: Enhanced queue mechanism with urgent retry support
Unified Platform Integration: Seamless coordination with Volcano batch scheduler via Sharding Controller

Issue: #4722
PRs: #4804, #4801, #4805
Design Doc: Agent Scheduler Design
Contributors: @qi-min, @JesseStutler, @handan-yxh

Network Topology Aware Scheduling Enhancements

Background and Motivation:

Volcano v1.14.0 brings significant enhancements to network topology aware scheduling, addressing the growing demands of distributed workloads including LLM training, HPC, and other network-intensive applications.

Key Enhancements:

SubGroup Level Topology Awareness: Support fine-grained network topology constraints at the SubGroup/Partition level.
Flexible Network Tier Configuration: Support highestTierName for specifying maximum network tier constraints by name.
Multi-Level Gang Scheduling: Improved gang scheduling to support both Job-level and SubGroup-level consistency.
Volcano Job Partitioning: Enable partitioning of Volcano Jobs for better resource management and fault isolation.
HyperNode-Level Binpacking: Optimization for resource utilization across network topology boundaries.

Configuration Example - Volcano Job:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: llm-training-job
spec:
  # ...other fields
  networkTopology:
    mode: hard
    highestTierAllowed: 2  # Job can cross up to Tier 2 HyperNodes
  tasks:
  - name: trainer
    replicas: 8
    partitionPolicy:
      totalPartitions: 2    # Split into 2 partitions
      partitionSize: 4      # 4 pods per partition
      minPartitions: 2      # Minimum 2 partitions required
      networkTopology:
        mode: hard
        highestTierAllowed: 1  # Each partition must stay within Tier 1
    template:
      spec:
        containers:
        - name: trainer
          image: training-image:v1
          resources:
            requests:
              nvidia.com/gpu: 8

Configuration Example - PodGroup SubGroupPolicy:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: llm-training-pg
spec:
  minMember: 4
  networkTopology:
    mode: hard
    highestTierAllowed: 2
  subGroupPolicy:
  - name: "trainer"
    subGroupSize: 4
    labelSelector:
      matchLabels:
        volcano.sh/task-spec: trainer
    matchLabelKeys:
    - volcano.sh/partition-id
    networkTopology:
      mode: hard
      highestTierAllowed: 1

Issues: #4188, #4368, #4869
PRs: #4721, #4810, #4795, #4785, #4889
Design Doc: Network Topology Aware Scheduling
Contributors: @ouyangshengjia, @3sunny, @zhaoqi, @wangyang0616, @MondayCha, @Tau721

Colocation for Generic OS

This release brings comprehensive improvements to Volcano's colocation capabilities, with a major milestone: support for generic operating systems (Ubuntu, CentOS, etc.) in addition to OpenEuler. This enables broader adoption of Volcano Agent for resource sharing between online and offline workloads.

New Features in v1.14.0:

CPU Throttling (CPU Suppression)

The CPU usage of online pod...

Contributors

aman-kumar, zhaoqi, and 53 other contributors

Assets 2

18 Jan 08:34

JesseStutler

v1.12.3

afdefdd

v1.12.3

What's Changed

Bug fixes

[Cherry-pick v1.12] add hcclrank job plugin by @wangdongyang1 in #4555
Automated cherry pick of #4347: When some scalar resources are 0 in deserved, hierarychical queues validation can not pass by @wuxiaobao in #4586
Automated cherry pick of #4590: add permissions for managing namespaces in admission rules by @suyiiyii in #4594
[Cherry-pick v1.12] fix mpi job plugin panic when mpi job only has master task by @wangdongyang1 in #4619
[Cherry-pick v1.12]Sync kube-scheduler:Improve CSILimits plugin accuracy by using VolumeAttachments by @guoqinwill in #4627
Automated cherry pick of #4599: fix: report all scalar metrics for each queue by @hajnalmt in #4651
[Cherry-pick 1.12] fix: Initialize realCapability field in newQueueAttr by @dafu-wu in #4695
[cherry-pick 1.12]Scheduling main loop blocked and timeout due to un-released PreBind lock in Volcano by @guoqinwill in #4699
[release-1.12] Cherry-pick #4786 and #4792: fix replicaset KubeGroupNameAnnotation handling and replicaSet podgroup update synchronization by @hajnalmt in #4843
Automated cherry pick of #4829: keep terminating pod in job by @wangdongyang1 in #4861
[release-1.12] fix potential panic on numa resources info updating in snapshot by @qi-min in #4898
[release-1.12] Fix gpu resource error by @ChenW66 in #4915
[release-1.12] Fix: Changes to task members in a PodGroup caused task validity checks to fail during scheduling by @ouyangshengjia in #4920
[release-1.12] Fix scheduler panic when metrics are disabled by @Copilot in #4921
[release-1.12] Update metrics_client_prometheus.go by @nitindhiman314e in #4932

Maintenance

[release-1.12] Add Free Disk Space step to E2E workflows by @Copilot in #4851

Full Changelog: v1.12.2...v1.12.3

Contributors

qi-min, hajnalmt, and 8 other contributors

Assets 2

23 Dec 11:32

JesseStutler

v1.13.1

0c6f3bf

v1.13.1

What's Changed

Bug fixes

Automated cherry pick of #4670: fix: ci err caused bt ray e2e default image by @Wonki4 in #4681
[Cherry-pick 1.13] fix: Initialize realCapability field in newQueueAttr by @dafu-wu in #4694
[cherry-pick 1.13]Scheduling main loop blocked and timeout due to un-released PreBind lock in Volcano by @guoqinwill in #4700
[release-1.13] Fix scheduler panic when metrics are disabled by @Copilot in #4770
Cherry-pick PR #4786 to release-1.13: Fix replicaSet podgroup update synchronization by @jiahuat in #4799
[release-1.13] fix: replicaset KubeGroupNameAnnotation handling by @hajnalmt in #4826
[release-1.13] fix: constant cache warnings by @hajnalmt in #4831
[release-1.13] fix: capacity plugin's preemptivefn logic by @hajnalmt in #4830
[release-1.13] Fix: Changes to task members in a PodGroup caused task validity checks to fail during scheduling by @ouyangshengjia in #4852

Maintenance

[release-1.13] Add Free Disk Space step to E2E workflows by @Copilot in #4763

Full Changelog: v1.13.0...v1.13.1

Contributors

hajnalmt, jiahuat, and 4 other contributors

Assets 2

Uh oh!

Releases: volcano-sh/volcano

v1.14.3

What's Changed

Bug fixes

Contributors

Uh oh!

v1.15.0

Summary

What's New

Key Features Overview

Key Feature Details

Gang-Aware Preemption and Resource Reclamation (Alpha)

DRA Queue Quota in Capacity Plugin

Pluggable Multi-Sharding Policy Support (Alpha)

Contributors

Uh oh!

v1.14.2

Security Fixes

CVE-2026-44247: Webhook Server OOM via unbounded HTTP request body size

Bug Fixes

Contributors

Uh oh!

v1.13.3

Security Fixes

CVE-2026-44247: Webhook Server OOM via unbounded HTTP request body size

Bug Fixes

Contributors

Uh oh!

v1.12.4

Security Fixes

CVE-2026-44247: Webhook Server OOM via unbounded HTTP request body size

Bug Fixes

Contributors

Uh oh!

v1.13.2

What's Changed

Bug fixes

Contributors

Uh oh!

v1.14.1

What's Changed

Bug fixes

Contributors

Uh oh!

v1.14.0

Summary

What's New

Key Features Overview

Key Feature Details

Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha)

Fast Scheduling for AI Agent Workloads (Alpha)

Network Topology Aware Scheduling Enhancements

Colocation for Generic OS

Contributors

Uh oh!

v1.12.3

What's Changed

Bug fixes

Maintenance

Contributors

Uh oh!

v1.13.1

What's Changed

Bug fixes

Maintenance

Contributors

Uh oh!