arch/x86/include/asm/kvm_host.h | 8 + arch/x86/kvm/lapic.c | 172 +++++++++++++++- arch/x86/kvm/x86.c | 6 + arch/x86/kvm/x86.h | 4 + include/linux/kvm_host.h | 1 + kernel/sched/core.c | 7 +- kernel/sched/debug.c | 3 + kernel/sched/fair.c | 336 ++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 9 + virt/kvm/kvm_main.c | 81 +++++++- 10 files changed, 611 insertions(+), 16 deletions(-)
From: Wanpeng Li <wanpengli@tencent.com>
This series addresses long-standing yield_to() inefficiencies in
virtualized environments through two complementary mechanisms: a vCPU
debooster in the scheduler and IPI-aware directed yield in KVM.
Problem Statement
-----------------
In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running. The kernel's
paravirtual spinlock support detects these situations and calls yield_to()
to boost the lock holder, allowing it to run and release the lock.
However, the current implementation has two critical limitations:
1. Scheduler-side limitation:
yield_to_task_fair() relies solely on set_next_buddy() to provide
preference to the target vCPU. This buddy mechanism only offers
immediate, transient preference. Once the buddy hint expires (typically
after one scheduling decision), the yielding vCPU may preempt the target
again, especially in nested cgroup hierarchies where vruntime domains
differ.
This creates a ping-pong effect: the lock holder runs briefly, gets
preempted before completing critical sections, and the yielding vCPU
spins again, triggering another futile yield_to() cycle. The overhead
accumulates rapidly in workloads with high lock contention.
2. KVM-side limitation:
kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
directed yield candidate selection. However, it lacks awareness of IPI
communication patterns. When a vCPU sends an IPI and spins waiting for
a response (common in inter-processor synchronization), the current
heuristics often fail to identify the IPI receiver as the yield target.
Instead, the code may boost an unrelated vCPU based on coarse-grained
preemption state, missing opportunities to accelerate actual IPI
response handling. This is particularly problematic when the IPI receiver
is runnable but not scheduled, as lock-holder-detection logic doesn't
capture the IPI dependency relationship.
Combined, these issues cause excessive lock hold times, cache thrashing,
and degraded throughput in overcommitted environments, particularly
affecting workloads with fine-grained synchronization patterns.
Solution Overview
-----------------
The series introduces two orthogonal improvements that work synergistically:
Part 1: Scheduler vCPU Debooster (patches 1-5)
Augment yield_to_task_fair() with bounded vruntime penalties to provide
sustained preference beyond the buddy mechanism. When a vCPU yields to a
target, apply a carefully tuned vruntime penalty to the yielding vCPU,
ensuring the target maintains scheduling advantage for longer periods.
The mechanism is EEVDF-aware and cgroup-hierarchy-aware:
- Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
both the yielding and target tasks coexist. This ensures vruntime
adjustments occur at the correct hierarchy level, maintaining fairness
across cgroup boundaries.
- Update EEVDF scheduler fields (vruntime, deadline, vlag) atomically to
keep the scheduler state consistent. The penalty shifts the yielding
task's virtual deadline forward, allowing the target to run.
- Apply queue-size-adaptive penalties that scale from 6.0× scheduling
granularity for 2-task scenarios (strong preference) down to 1.0× for
large queues (>12 tasks), balancing preference against starvation risks.
- Implement reverse-pair debouncing: when task A yields to B, then B yields
to A within a short window (~600us), downscale the penalty to prevent
ping-pong oscillation.
- Rate-limit penalty application to 6ms intervals to prevent pathological
overhead when yields occur at very high frequency.
The debooster works *with* the buddy mechanism rather than replacing it:
set_next_buddy() provides immediate preference for the next scheduling
decision, while the vruntime penalty sustains that preference over
subsequent decisions. This dual approach proves especially effective in
nested cgroup scenarios where buddy hints alone are insufficient.
Part 2: KVM IPI-Aware Directed Yield (patches 6-10)
Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
directed yield candidate selection. Track sender/receiver relationships
when IPIs are delivered and use this information to prioritize yield
targets.
The tracking mechanism:
- Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
common case for inter-processor synchronization). When exactly one
destination vCPU receives an IPI, record the sender→receiver relationship
with a monotonic timestamp.
In high VM density scenarios, software-based IPI tracking through interrupt
delivery interception becomes particularly valuable. It captures precise
sender/receiver relationships that can be leveraged for intelligent
scheduling decisions, providing performance benefits that complement or
even exceed hardware-accelerated interrupt delivery in overcommitted
environments.
- Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
per-vCPU ipi_context structure is carefully designed to avoid cache line
bouncing.
- Implements a short recency window (50ms default) to avoid stale IPI
information inflating boost priority on throughput-sensitive workloads.
Old IPI relationships are naturally aged out.
- Clears IPI context on EOI with two-stage precision: unconditionally clear
the receiver's context (it processed the interrupt), but only clear the
sender's pending flag if the receiver matches and the IPI is recent. This
prevents unrelated EOIs from prematurely clearing valid IPI state.
The candidate selection follows a priority hierarchy:
Priority 1: Confirmed IPI receiver
If the spinning vCPU recently sent an IPI to another vCPU and that IPI
is still pending (within the recency window), unconditionally boost the
receiver. This directly addresses the "spinning on IPI response" case.
Priority 2: Fast pending interrupt
Leverage arch-specific kvm_arch_dy_has_pending_interrupt() for
compatibility with existing optimizations.
Priority 3: Preempted in kernel mode
Fall back to traditional preemption-based logic when yield_to_kernel_mode
is requested, ensuring compatibility with existing workloads.
A two-round fallback mechanism provides a safety net: if the first round
with strict IPI-aware selection finds no eligible candidate (e.g., due to
missed IPI context or transient runnable set changes), a second round
applies relaxed selection gated only by preemption state. This is
controlled by the enable_relaxed_boost module parameter (default on).
Implementation Details
----------------------
Both mechanisms are designed for minimal overhead and runtime control:
- All locking occurs under existing rq->lock or per-vCPU locks; no new
lock contention is introduced.
- Penalty calculations use integer arithmetic with overflow protection.
- IPI tracking uses monotonic timestamps (ktime_get_mono_fast_ns()) for
efficient, race-free recency checks.
Advantages over paravirtualization approaches:
- No guest OS modification required: This solution operates entirely within
the host kernel, providing transparent optimization without guest kernel
changes or recompilation.
- Guest OS agnostic: Works uniformly across Linux, Windows, and other guest
operating systems, unlike PV TLB shootdown which requires guest-side
paravirtual driver support.
- Broader applicability: Captures IPI patterns from all synchronization
primitives (spinlocks, RCU, smp_call_function, etc.), not limited to
specific paravirtualized operations like TLB shootdown.
- Deployment simplicity: Existing VM images benefit immediately without
guest kernel updates, critical for production environments with diverse
guest OS versions and configurations.
- Runtime controls allow disabling features if needed:
* /sys/kernel/debug/sched/sched_vcpu_debooster_enabled
* /sys/module/kvm/parameters/ipi_tracking_enabled
* /sys/module/kvm/parameters/enable_relaxed_boost
- The infrastructure is incrementally introduced: early patches add inert
scaffolding that can be verified for zero performance impact before
activation.
Performance Results
-------------------
Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM
Dbench 16 clients per VM (filesystem metadata operations):
2 VMs: +14.4% throughput (lock contention reduction)
3 VMs: +9.8% throughput
4 VMs: +6.7% throughput
PARSEC Dedup benchmark, simlarge input (memory-intensive):
2 VMs: +47.1% throughput (IPI-heavy synchronization)
3 VMs: +28.1% throughput
4 VMs: +1.7% throughput
PARSEC VIPS benchmark, simlarge input (compute-intensive):
2 VMs: +26.2% throughput (balanced sync and compute)
3 VMs: +12.7% throughput
4 VMs: +6.0% throughput
Analysis:
- Gains are most pronounced at moderate overcommit (2-3 VMs). At this level,
contention is significant enough to benefit from better yield behavior,
but context switch overhead remains manageable.
- Dedup shows the strongest improvement (+47.1% at 2 VMs) due to its
IPI-heavy synchronization patterns. The IPI-aware directed yield
precisely targets the bottleneck.
- At 4 VMs (heavier overcommit), gains diminish as general CPU contention
dominates. However, performance never regresses, indicating the mechanisms
gracefully degrade.
- In certain high-density, resource overcommitted deployment scenarios, the
performance benefits of APICv can be constrained by scheduling and contention
patterns. In such cases, software-based IPI tracking serves as a complementary
optimization path, offering targeted scheduling hints without relying on disabling
APICv. The practical choice should be evaluated and balanced against workload
characteristics and platform configuration.
- Dbench benefits primarily from the scheduler-side debooster, as its lock
patterns involve less IPI spinning and more direct lock holder boosting.
The performance gains stem from three factors:
1. Lock holders receive sustained CPU time to complete critical sections,
reducing overall lock hold duration and cascading contention.
2. IPI receivers are promptly scheduled when senders spin, minimizing IPI
response latency and reducing wasted spin cycles.
3. Better cache utilization results from reduced context switching between
lock waiters and holders.
Patch Organization
------------------
The series is organized for incremental review and bisectability:
Patches 1-5: Scheduler vCPU debooster
Patch 1: Add infrastructure (per-rq tracking, sysctl, debugfs entry)
Infrastructure is inert; no functional change.
Patch 2: Add rate-limiting and validation helpers
Static functions with comprehensive safety checks.
Patch 3: Add cgroup LCA finder for hierarchical yield
Implements CONFIG_FAIR_GROUP_SCHED-aware LCA location.
Patch 4: Add penalty calculation and application logic
Core algorithms with queue-size adaptation and debouncing.
Patch 5: Wire up yield deboost in yield_to_task_fair()
Activation patch. Includes Dbench performance data.
Patches 6-10: KVM IPI-aware directed yield
Patch 6: Fix last_boosted_vcpu index assignment bug
Standalone bugfix for existing code.
Patch 7: Add IPI tracking infrastructure
Per-vCPU context, module parameters, helper functions.
Infrastructure is inert until activated.
Patch 8: Integrate IPI tracking with LAPIC interrupt delivery
Hook into kvm_irq_delivery_to_apic() and EOI handling.
Patch 9: Implement IPI-aware directed yield candidate selection
Replace candidate selection logic with priority-based approach.
Includes PARSEC performance data.
Patch 10: Add relaxed boost as safety net
Two-round fallback mechanism for robustness.
Each patch compiles and boots independently. Performance data is presented
where the relevant mechanism becomes active (patches 5 and 9).
Testing
-------
Workloads tested:
- Dbench (filesystem metadata stress)
- PARSEC benchmarks (Dedup, VIPS, Ferret, Blackscholes)
- Kernel compilation (make -j16 in each VM)
No regressions observed on any configuration. The mechanisms show neutral
to positive impact across diverse workloads.
Future Work
-----------
Potential extensions beyond this series:
- Adaptive recency window: dynamically adjust ipi_window_ns based on
observed workload patterns.
- Extended tracking: consider multi-round IPI patterns (A→B→C→A).
- Cross-NUMA awareness: penalty scaling based on NUMA distances.
These are intentionally deferred to keep this series focused and reviewable.
Wanpeng Li (10):
sched: Add vCPU debooster infrastructure
sched/fair: Add rate-limiting and validation helpers
sched/fair: Add cgroup LCA finder for hierarchical yield
sched/fair: Add penalty calculation and application logic
sched/fair: Wire up yield deboost in yield_to_task_fair()
KVM: Fix last_boosted_vcpu index assignment bug
KVM: x86: Add IPI tracking infrastructure
KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
KVM: Implement IPI-aware directed yield candidate selection
KVM: Relaxed boost as safety net
arch/x86/include/asm/kvm_host.h | 8 +
arch/x86/kvm/lapic.c | 172 +++++++++++++++-
arch/x86/kvm/x86.c | 6 +
arch/x86/kvm/x86.h | 4 +
include/linux/kvm_host.h | 1 +
kernel/sched/core.c | 7 +-
kernel/sched/debug.c | 3 +
kernel/sched/fair.c | 336 ++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 9 +
virt/kvm/kvm_main.c | 81 +++++++-
10 files changed, 611 insertions(+), 16 deletions(-)
--
2.43.0
Am 10.11.25 um 04:32 schrieb Wanpeng Li: > From: Wanpeng Li <wanpengli@tencent.com> > > This series addresses long-standing yield_to() inefficiencies in > virtualized environments through two complementary mechanisms: a vCPU > debooster in the scheduler and IPI-aware directed yield in KVM. > > Problem Statement > ----------------- > > In overcommitted virtualization scenarios, vCPUs frequently spin on locks > held by other vCPUs that are not currently running. The kernel's > paravirtual spinlock support detects these situations and calls yield_to() > to boost the lock holder, allowing it to run and release the lock. > > However, the current implementation has two critical limitations: > > 1. Scheduler-side limitation: > > yield_to_task_fair() relies solely on set_next_buddy() to provide > preference to the target vCPU. This buddy mechanism only offers > immediate, transient preference. Once the buddy hint expires (typically > after one scheduling decision), the yielding vCPU may preempt the target > again, especially in nested cgroup hierarchies where vruntime domains > differ. > > This creates a ping-pong effect: the lock holder runs briefly, gets > preempted before completing critical sections, and the yielding vCPU > spins again, triggering another futile yield_to() cycle. The overhead > accumulates rapidly in workloads with high lock contention. I can certainly confirm that on s390 we do see that yield_to does not always work as expected. Our spinlock code is lock holder aware so our KVM always yield correctly but often enought the hint is ignored our bounced back as you describe. So I am certainly interested in that part. I need to look more closely into the other part.
Hi Christian, On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > > Am 10.11.25 um 04:32 schrieb Wanpeng Li: > > From: Wanpeng Li <wanpengli@tencent.com> > > > > This series addresses long-standing yield_to() inefficiencies in > > virtualized environments through two complementary mechanisms: a vCPU > > debooster in the scheduler and IPI-aware directed yield in KVM. > > > > Problem Statement > > ----------------- > > > > In overcommitted virtualization scenarios, vCPUs frequently spin on locks > > held by other vCPUs that are not currently running. The kernel's > > paravirtual spinlock support detects these situations and calls yield_to() > > to boost the lock holder, allowing it to run and release the lock. > > > > However, the current implementation has two critical limitations: > > > > 1. Scheduler-side limitation: > > > > yield_to_task_fair() relies solely on set_next_buddy() to provide > > preference to the target vCPU. This buddy mechanism only offers > > immediate, transient preference. Once the buddy hint expires (typically > > after one scheduling decision), the yielding vCPU may preempt the target > > again, especially in nested cgroup hierarchies where vruntime domains > > differ. > > > > This creates a ping-pong effect: the lock holder runs briefly, gets > > preempted before completing critical sections, and the yielding vCPU > > spins again, triggering another futile yield_to() cycle. The overhead > > accumulates rapidly in workloads with high lock contention. > > I can certainly confirm that on s390 we do see that yield_to does not always > work as expected. Our spinlock code is lock holder aware so our KVM always yield > correctly but often enought the hint is ignored our bounced back as you describe. > So I am certainly interested in that part. > > I need to look more closely into the other part. Thanks for the confirmation and interest! It's valuable to hear that s390 observes similar yield_to() behavior where the hint gets ignored or bounced back despite correct lock holder identification. Since your spinlock code is already lock-holder-aware and KVM yields to the correct target, the scheduler-side improvements (patches 1-5) should directly address the ping-pong issue you're seeing. The vruntime penalties are designed to sustain the preference beyond the transient buddy hint, which should reduce the bouncing effect. Best regards, Wanpeng
Am 12.11.25 um 06:01 schrieb Wanpeng Li: > Hi Christian, > > On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger > <borntraeger@linux.ibm.com> wrote: >> >> Am 10.11.25 um 04:32 schrieb Wanpeng Li: >>> From: Wanpeng Li <wanpengli@tencent.com> >>> >>> This series addresses long-standing yield_to() inefficiencies in >>> virtualized environments through two complementary mechanisms: a vCPU >>> debooster in the scheduler and IPI-aware directed yield in KVM. >>> >>> Problem Statement >>> ----------------- >>> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks >>> held by other vCPUs that are not currently running. The kernel's >>> paravirtual spinlock support detects these situations and calls yield_to() >>> to boost the lock holder, allowing it to run and release the lock. >>> >>> However, the current implementation has two critical limitations: >>> >>> 1. Scheduler-side limitation: >>> >>> yield_to_task_fair() relies solely on set_next_buddy() to provide >>> preference to the target vCPU. This buddy mechanism only offers >>> immediate, transient preference. Once the buddy hint expires (typically >>> after one scheduling decision), the yielding vCPU may preempt the target >>> again, especially in nested cgroup hierarchies where vruntime domains >>> differ. >>> >>> This creates a ping-pong effect: the lock holder runs briefly, gets >>> preempted before completing critical sections, and the yielding vCPU >>> spins again, triggering another futile yield_to() cycle. The overhead >>> accumulates rapidly in workloads with high lock contention. >> >> I can certainly confirm that on s390 we do see that yield_to does not always >> work as expected. Our spinlock code is lock holder aware so our KVM always yield >> correctly but often enought the hint is ignored our bounced back as you describe. >> So I am certainly interested in that part. >> >> I need to look more closely into the other part. > > Thanks for the confirmation and interest! It's valuable to hear that > s390 observes similar yield_to() behavior where the hint gets ignored > or bounced back despite correct lock holder identification. > > Since your spinlock code is already lock-holder-aware and KVM yields > to the correct target, the scheduler-side improvements (patches 1-5) > should directly address the ping-pong issue you're seeing. The > vruntime penalties are designed to sustain the preference beyond the > transient buddy hint, which should reduce the bouncing effect. So we will play a bit with the first patches and check for performance improvements. I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do see "more than count" numbers of the yield hypercalls with that testcase (as before). Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment out your rate limit code I hit exactly the 4000000. Can you maybe outline a bit why the rate limit is important and needed?
Hi Christian, On Tue, 18 Nov 2025 at 16:12, Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > > Am 12.11.25 um 06:01 schrieb Wanpeng Li: > > Hi Christian, > > > > On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger > > <borntraeger@linux.ibm.com> wrote: > >> > >> Am 10.11.25 um 04:32 schrieb Wanpeng Li: > >>> From: Wanpeng Li <wanpengli@tencent.com> > >>> > >>> This series addresses long-standing yield_to() inefficiencies in > >>> virtualized environments through two complementary mechanisms: a vCPU > >>> debooster in the scheduler and IPI-aware directed yield in KVM. > >>> > >>> Problem Statement > >>> ----------------- > >>> > >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks > >>> held by other vCPUs that are not currently running. The kernel's > >>> paravirtual spinlock support detects these situations and calls yield_to() > >>> to boost the lock holder, allowing it to run and release the lock. > >>> > >>> However, the current implementation has two critical limitations: > >>> > >>> 1. Scheduler-side limitation: > >>> > >>> yield_to_task_fair() relies solely on set_next_buddy() to provide > >>> preference to the target vCPU. This buddy mechanism only offers > >>> immediate, transient preference. Once the buddy hint expires (typically > >>> after one scheduling decision), the yielding vCPU may preempt the target > >>> again, especially in nested cgroup hierarchies where vruntime domains > >>> differ. > >>> > >>> This creates a ping-pong effect: the lock holder runs briefly, gets > >>> preempted before completing critical sections, and the yielding vCPU > >>> spins again, triggering another futile yield_to() cycle. The overhead > >>> accumulates rapidly in workloads with high lock contention. > >> > >> I can certainly confirm that on s390 we do see that yield_to does not always > >> work as expected. Our spinlock code is lock holder aware so our KVM always yield > >> correctly but often enought the hint is ignored our bounced back as you describe. > >> So I am certainly interested in that part. > >> > >> I need to look more closely into the other part. > > > > Thanks for the confirmation and interest! It's valuable to hear that > > s390 observes similar yield_to() behavior where the hint gets ignored > > or bounced back despite correct lock holder identification. > > > > Since your spinlock code is already lock-holder-aware and KVM yields > > to the correct target, the scheduler-side improvements (patches 1-5) > > should directly address the ping-pong issue you're seeing. The > > vruntime penalties are designed to sustain the preference beyond the > > transient buddy hint, which should reduce the bouncing effect. > > So we will play a bit with the first patches and check for performance improvements. > > I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do > see "more than count" numbers of the yield hypercalls with that testcase (as before). > Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment > out your rate limit code I hit exactly the 4000000. > Can you maybe outline a bit why the rate limit is important and needed? Good catch! The 10× inflation is actually expected behavior. The key insight is that rate limit filters penalty applications, not yield hypercalls. In your ping-pong test with 4M counter increments, PLE hardware fires multiple times per lock acquisition (roughly 10 times based on your numbers), and each triggers kvm_vcpu_on_spin() . Without rate limit, every yield immediately applies vruntime penalty. In tight ping-pong, this causes over-penalization where the skip vCPU becomes so deprioritized it effectively starves, which paradoxically neutralizes the debooster effect. You see "exactly 4M" not because it's working optimally, but because excessive penalties create a pathological equilibrium where subsequent yields are suppressed by starvation. With a 6ms rate limit, all 40M hypercalls still occur (PLE still fires), but only the first yield in each burst applies a penalty while subsequent ones are filtered. This gives you roughly 4M penalties (one per actual lock acquisition) instead of 40M, providing sustained advantage without over-penalization. The 6ms threshold was empirically tuned as roughly 2× typical timeslice to filter intra-lock PLE bursts while preserving responsiveness to legitimate contention. Your test validates the design by showing rate limit prevents penalty amplification even in the tightest ping-pong scenario. I'll post v2 after the merge window with code comments addressing this and other review feedback, which should be more suitable for performance evaluation. Wanpeng
Hello Wanpeng,
I haven't looked at the entire series and the penalty calculation math
but I've a few questions looking at the cover-letter.
On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
>
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
>
> Problem Statement
> -----------------
>
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
>
> However, the current implementation has two critical limitations:
>
> 1. Scheduler-side limitation:
>
> yield_to_task_fair() relies solely on set_next_buddy() to provide
> preference to the target vCPU. This buddy mechanism only offers
> immediate, transient preference. Once the buddy hint expires (typically
> after one scheduling decision), the yielding vCPU may preempt the target
> again, especially in nested cgroup hierarchies where vruntime domains
> differ.
So what you are saying is there are configurations out there where vCPUs
of same guest are put in different cgroups? Why? Does the use case
warrant enabling the cpu controller for the subtree? Are you running
with the "NEXT_BUDDY" sched feat enabled?
If they are in the same cgroup, the recent optimizations/fixes to
yield_task_fair() in queue:sched/core should help remedy some of the
problems you might be seeing.
For multiple cgroups, perhaps you can extend yield_task_fair() to do:
( Only build and boot tested on top of
git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
at commit f82a0f91493f "sched/deadline: Minor cleanup in
select_task_rq_dl()" )
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4617d631549..87560f5a18b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
* which yields immediately again; without the condition the vruntime
* ends up quickly running away.
*/
- if (entity_eligible(cfs_rq, se)) {
+ do {
+ cfs_rq = cfs_rq_of(se);
+
+ /*
+ * Another entity will be selected at next pick.
+ * Single entity on cfs_rq can never be ineligible.
+ */
+ if (!entity_eligible(cfs_rq, se))
+ break;
+
se->vruntime = se->deadline;
se->deadline += calc_delta_fair(se->slice, se);
- }
+
+ /*
+ * If we have more than one runnable task queued below
+ * this cfs_rq, the next pick will likely go for a
+ * different entity now that we have advanced the
+ * vruntime and the deadline of the running entity.
+ */
+ if (cfs_rq->h_nr_runnable > 1)
+ break;
+ } while ((se = parent_entity(se)));
}
static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
---
With that, I'm pretty sure there is a good chance we'll not select the
hierarchy that did a yield_to() unless there is a large discrepancy in
their weights and just advancing se->vruntime to se->deadline once isn't
enough to make it ineligible and you'll have to do it multiple time (at
which point that cgroup hierarchy needs to be studied).
As for the problem that NEXT_BUDDY hint is used only once, you can
perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
the "prev" task during schedule?
>
> This creates a ping-pong effect: the lock holder runs briefly, gets
> preempted before completing critical sections, and the yielding vCPU
> spins again, triggering another futile yield_to() cycle. The overhead
> accumulates rapidly in workloads with high lock contention.
>
> 2. KVM-side limitation:
>
> kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
> directed yield candidate selection. However, it lacks awareness of IPI
> communication patterns. When a vCPU sends an IPI and spins waiting for
> a response (common in inter-processor synchronization), the current
> heuristics often fail to identify the IPI receiver as the yield target.
Can't that be solved on the KVM end? Also shouldn't Patch 6 be on top
with a "Fixes:" tag.
>
> Instead, the code may boost an unrelated vCPU based on coarse-grained
> preemption state, missing opportunities to accelerate actual IPI
> response handling. This is particularly problematic when the IPI receiver
> is runnable but not scheduled, as lock-holder-detection logic doesn't
> capture the IPI dependency relationship.
Are you saying the yield_to() is called with an incorrect target vCPU?
>
> Combined, these issues cause excessive lock hold times, cache thrashing,
> and degraded throughput in overcommitted environments, particularly
> affecting workloads with fine-grained synchronization patterns.
>
--
Thanks and Regards,
Prateek
Hi Prateek,
On Tue, 11 Nov 2025 at 14:28, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> I haven't looked at the entire series and the penalty calculation math
> but I've a few questions looking at the cover-letter.
Thanks for the review and the thoughtful questions.
>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> > yield_to_task_fair() relies solely on set_next_buddy() to provide
> > preference to the target vCPU. This buddy mechanism only offers
> > immediate, transient preference. Once the buddy hint expires (typically
> > after one scheduling decision), the yielding vCPU may preempt the target
> > again, especially in nested cgroup hierarchies where vruntime domains
> > differ.
>
> So what you are saying is there are configurations out there where vCPUs
> of same guest are put in different cgroups? Why? Does the use case
> warrant enabling the cpu controller for the subtree? Are you running
You're right to question this. The problematic scenario occurs with
nested cgroup hierarchies, which is common when VMs are deployed with
cgroup-based resource management. Even when all vCPUs of a single
guest are in the same leaf cgroup, that leaf sits under parent cgroups
with their own vruntime domains.
The issue manifests when:
- set_next_buddy() provides preference at the leaf level
- But vruntime competition happens at parent levels
- The buddy hint gets "diluted" when pick_task_fair() walks up the hierarchy
The cpu controller is typically enabled in these deployments for quota
enforcement and weight-based sharing. That said, the debooster
mechanism is designed to be general-purpose: it handles any scenario
where yield_to() crosses cgroup boundaries, whether due to nested
hierarchies or sibling cgroups.
> with the "NEXT_BUDDY" sched feat enabled?
Yes, NEXT_BUDDY is enabled. The problem is that set_next_buddy()
provides only immediate, transient preference. Once the buddy hint is
consumed (typically after one pick_next_task_fair() call), the
yielding vCPU can preempt the target again if their vruntime values
haven't diverged sufficiently.
>
> If they are in the same cgroup, the recent optimizations/fixes to
> yield_task_fair() in queue:sched/core should help remedy some of the
> problems you might be seeing.
Agreed - the recent yield_task_fair() improvements in queue:sched/core
(EEVDF-based vruntime = deadline with hierarchical walk) are valuable.
However, our patchset focuses on yield_to() rather than yield(), which
has different semantics:
- yield_task_fair(): "I voluntarily give up CPU, pick someone else"
→ Recent improvements handle this well with hierarchical walk
- yield_to_task_fair(): "I want *this specific task* to run
instead" → Requires finding the LCA of yielder and target, then
applying penalties at that level to influence their relative
competition
The debooster extends yield_to() to handle cross-cgroup scenarios
where the yielder and target may be in different subtrees.
>
> For multiple cgroups, perhaps you can extend yield_task_fair() to do:
Thanks for the suggestion. Your hierarchical walk approach shares
similarities with our implementation. A few questions on the details:
>
> ( Only build and boot tested on top of
> git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> at commit f82a0f91493f "sched/deadline: Minor cleanup in
> select_task_rq_dl()" )
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b4617d631549..87560f5a18b3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> * which yields immediately again; without the condition the vruntime
> * ends up quickly running away.
> */
> - if (entity_eligible(cfs_rq, se)) {
> + do {
> + cfs_rq = cfs_rq_of(se);
> +
> + /*
> + * Another entity will be selected at next pick.
> + * Single entity on cfs_rq can never be ineligible.
> + */
> + if (!entity_eligible(cfs_rq, se))
> + break;
> +
> se->vruntime = se->deadline;
Setting vruntime = deadline zeros out lag. Does this cause fairness
drift with repeated yields? We explicitly recalculate vlag after
adjustment to preserve EEVDF invariants.
> se->deadline += calc_delta_fair(se->slice, se);
> - }
> +
> + /*
> + * If we have more than one runnable task queued below
> + * this cfs_rq, the next pick will likely go for a
> + * different entity now that we have advanced the
> + * vruntime and the deadline of the running entity.
> + */
> + if (cfs_rq->h_nr_runnable > 1)
Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
correctly. Shouldn't the penalty apply at the LCA of yielder and
target? Otherwise the vruntime adjustment might not affect the level
where they actually compete.
> + break;
> + } while ((se = parent_entity(se)));
> }
>
> static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> ---
Fixed one-slice penalties underperformed in our testing (dbench:
+14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
down to 1.0× based on queue size) necessary to balance effectiveness
against starvation.
>
> With that, I'm pretty sure there is a good chance we'll not select the
> hierarchy that did a yield_to() unless there is a large discrepancy in
> their weights and just advancing se->vruntime to se->deadline once isn't
> enough to make it ineligible and you'll have to do it multiple time (at
> which point that cgroup hierarchy needs to be studied).
>
> As for the problem that NEXT_BUDDY hint is used only once, you can
> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
> the "prev" task during schedule?
That's an interesting idea. However, LAST_BUDDY was removed from the
scheduler due to concerns about fairness and latency regressions in
general workloads. Reintroducing it globally might regress non-vCPU
workloads.
Our approach is more targeted: apply vruntime penalties specifically
in the yield_to() path (controlled by debugfs flag), avoiding impact
on general scheduling. The debooster is inert unless explicitly
enabled and rate-limited to prevent pathological overhead.
>
> >
> > This creates a ping-pong effect: the lock holder runs briefly, gets
> > preempted before completing critical sections, and the yielding vCPU
> > spins again, triggering another futile yield_to() cycle. The overhead
> > accumulates rapidly in workloads with high lock contention.
> >
> > 2. KVM-side limitation:
> >
> > kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
> > directed yield candidate selection. However, it lacks awareness of IPI
> > communication patterns. When a vCPU sends an IPI and spins waiting for
> > a response (common in inter-processor synchronization), the current
> > heuristics often fail to identify the IPI receiver as the yield target.
>
> Can't that be solved on the KVM end?
Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
orthogonal mechanisms:
- Debooster: sustains yield_to() preference regardless of *who* is
yielding to whom
- IPI tracking: improves *which* target is selected when a vCPU spins
Both showed independent gains in our testing, and combined effects
were approximately additive.
> Also shouldn't Patch 6 be on top with a "Fixes:" tag.
You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
bugfix and should be at the top with a Fixes tag. I'll reorder it in
v2 with:
Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
use a single for-loop")
>
> >
> > Instead, the code may boost an unrelated vCPU based on coarse-grained
> > preemption state, missing opportunities to accelerate actual IPI
> > response handling. This is particularly problematic when the IPI receiver
> > is runnable but not scheduled, as lock-holder-detection logic doesn't
> > capture the IPI dependency relationship.
>
> Are you saying the yield_to() is called with an incorrect target vCPU?
Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
selection logic before yield_to() is called. Without IPI tracking, it
relies on preemption state, which doesn't capture "vCPU waiting for
IPI response from specific other vCPU."
The IPI tracking records sender→receiver relationships at interrupt
delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
the IPI receiver when the sender spins (patch 9). This addresses
scenarios where the spinning vCPU is waiting for IPI acknowledgment
rather than lock release.
Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
- Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
- VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs
Gains are most pronounced at moderate overcommit where the IPI
receiver is often runnable but not scheduled.
Thanks again for the review and suggestions.
Best regards,
Wanpeng
Hello Wanpeng,
On 11/12/2025 10:24 AM, Wanpeng Li wrote:
>>
>> ( Only build and boot tested on top of
>> git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
>> at commit f82a0f91493f "sched/deadline: Minor cleanup in
>> select_task_rq_dl()" )
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b4617d631549..87560f5a18b3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>> * which yields immediately again; without the condition the vruntime
>> * ends up quickly running away.
>> */
>> - if (entity_eligible(cfs_rq, se)) {
>> + do {
>> + cfs_rq = cfs_rq_of(se);
>> +
>> + /*
>> + * Another entity will be selected at next pick.
>> + * Single entity on cfs_rq can never be ineligible.
>> + */
>> + if (!entity_eligible(cfs_rq, se))
>> + break;
>> +
>> se->vruntime = se->deadline;
>
> Setting vruntime = deadline zeros out lag. Does this cause fairness
> drift with repeated yields? We explicitly recalculate vlag after
> adjustment to preserve EEVDF invariants.
We only push deadline when the entity is eligible. Ineligible entity
will break out above. Also I don't get how adding a penalty to an
entity in the cgroup hierarchy of the yielding task when there are
other runnable tasks considered as "preserve(ing) EEVDF invariants".
>
>> se->deadline += calc_delta_fair(se->slice, se);
>> - }
>> +
>> + /*
>> + * If we have more than one runnable task queued below
>> + * this cfs_rq, the next pick will likely go for a
>> + * different entity now that we have advanced the
>> + * vruntime and the deadline of the running entity.
>> + */
>> + if (cfs_rq->h_nr_runnable > 1)
>
> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> correctly. Shouldn't the penalty apply at the LCA of yielder and
> target? Otherwise the vruntime adjustment might not affect the level
> where they actually compete.
So here is the case I'm going after - consider the following
hierarchy:
root
/ \
CG0 CG1
| |
A B
CG* are cgroups and, [A-Z]* are tasks
A decides to yield to B, and advances its deadline on CG0's timeline.
Currently, if CG0 is eligible and CG1 isn't, pick will still select
CG0 which will in-turn select task A and it'll yield again. This
cycle repeates until vruntime of CG0 turns large enough to make itself
ineligible and route the EEVDF pick to CG1.
Now consider:
root
/ \
CG0 CG1
/ \ |
A C B
Same scenario: A yields to B. A advances its vruntime and deadline
as a prt of yield. Now, why should CG0 sacrifice its fair share of
runtime for A when task B is runnable? Just because one task decided
to yield to another task in a different cgroup doesn't mean other
waiting tasks on that hierarchy suffer.
>
>> + break;
>> + } while ((se = parent_entity(se)));
>> }
>>
>> static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
>> ---
>
> Fixed one-slice penalties underperformed in our testing (dbench:
> +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> down to 1.0× based on queue size) necessary to balance effectiveness
> against starvation.
If all vCPUs of a VM are in the same cgroup - yield_to() should work
just fine. If this "target" task is not selected then either some
entity in the hierarchy, or the task is ineligible and EEVDF pick has
decided to go with something else.
It is not "starvation" but rather you've received you for fair share
of "proportional runtime" and now you wait. If you really want to
follow EEVDF maybe you compute the vlag and if it is behind the
avg_vruntime, you account it to the "target" task - that would be
in the spirit of the EEVDF algorithm.
>
>>
>> With that, I'm pretty sure there is a good chance we'll not select the
>> hierarchy that did a yield_to() unless there is a large discrepancy in
>> their weights and just advancing se->vruntime to se->deadline once isn't
>> enough to make it ineligible and you'll have to do it multiple time (at
>> which point that cgroup hierarchy needs to be studied).
>>
>> As for the problem that NEXT_BUDDY hint is used only once, you can
>> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
>> the "prev" task during schedule?
>
> That's an interesting idea. However, LAST_BUDDY was removed from the
> scheduler due to concerns about fairness and latency regressions in
> general workloads. Reintroducing it globally might regress non-vCPU
> workloads.
>
> Our approach is more targeted: apply vruntime penalties specifically
> in the yield_to() path (controlled by debugfs flag), avoiding impact
> on general scheduling. The debooster is inert unless explicitly
> enabled and rate-limited to prevent pathological overhead.
Yeah, I'm still not on board with the idea but maybe I don't see the
vision. Hope other scheduler folks can chime in.
>
>>
>>>
>>> This creates a ping-pong effect: the lock holder runs briefly, gets
>>> preempted before completing critical sections, and the yielding vCPU
>>> spins again, triggering another futile yield_to() cycle. The overhead
>>> accumulates rapidly in workloads with high lock contention.
>>>
>>> 2. KVM-side limitation:
>>>
>>> kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>>> directed yield candidate selection. However, it lacks awareness of IPI
>>> communication patterns. When a vCPU sends an IPI and spins waiting for
>>> a response (common in inter-processor synchronization), the current
>>> heuristics often fail to identify the IPI receiver as the yield target.
>>
>> Can't that be solved on the KVM end?
>
> Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
> scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
> orthogonal mechanisms:
> - Debooster: sustains yield_to() preference regardless of *who* is
> yielding to whom
> - IPI tracking: improves *which* target is selected when a vCPU spins
>
> Both showed independent gains in our testing, and combined effects
> were approximately additive.
I'll try to look at the KVM bits but I'm not familiar enough with
those bits enough to review it well :)
>
>> Also shouldn't Patch 6 be on top with a "Fixes:" tag.
>
> You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
> bugfix and should be at the top with a Fixes tag. I'll reorder it in
> v2 with:
> Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
> use a single for-loop")
Thank you.
>
>>
>>>
>>> Instead, the code may boost an unrelated vCPU based on coarse-grained
>>> preemption state, missing opportunities to accelerate actual IPI
>>> response handling. This is particularly problematic when the IPI receiver
>>> is runnable but not scheduled, as lock-holder-detection logic doesn't
>>> capture the IPI dependency relationship.
>>
>> Are you saying the yield_to() is called with an incorrect target vCPU?
>
> Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
> selection logic before yield_to() is called. Without IPI tracking, it
> relies on preemption state, which doesn't capture "vCPU waiting for
> IPI response from specific other vCPU."
>
> The IPI tracking records sender→receiver relationships at interrupt
> delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
> the IPI receiver when the sender spins (patch 9). This addresses
> scenarios where the spinning vCPU is waiting for IPI acknowledgment
> rather than lock release.
>
> Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
> - Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
> - VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs
>
> Gains are most pronounced at moderate overcommit where the IPI
> receiver is often runnable but not scheduled.
>
> Thanks again for the review and suggestions.
>
> Best regards,
> Wanpeng
--
Thanks and Regards,
Prateek
Hi Prateek,
On Thu, 13 Nov 2025 at 12:42, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/12/2025 10:24 AM, Wanpeng Li wrote:
> >>
> >> ( Only build and boot tested on top of
> >> git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> >> at commit f82a0f91493f "sched/deadline: Minor cleanup in
> >> select_task_rq_dl()" )
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b4617d631549..87560f5a18b3 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >> * which yields immediately again; without the condition the vruntime
> >> * ends up quickly running away.
> >> */
> >> - if (entity_eligible(cfs_rq, se)) {
> >> + do {
> >> + cfs_rq = cfs_rq_of(se);
> >> +
> >> + /*
> >> + * Another entity will be selected at next pick.
> >> + * Single entity on cfs_rq can never be ineligible.
> >> + */
> >> + if (!entity_eligible(cfs_rq, se))
> >> + break;
> >> +
> >> se->vruntime = se->deadline;
> >
> > Setting vruntime = deadline zeros out lag. Does this cause fairness
> > drift with repeated yields? We explicitly recalculate vlag after
> > adjustment to preserve EEVDF invariants.
>
> We only push deadline when the entity is eligible. Ineligible entity
> will break out above. Also I don't get how adding a penalty to an
> entity in the cgroup hierarchy of the yielding task when there are
> other runnable tasks considered as "preserve(ing) EEVDF invariants".
Our penalty preserves EEVDF invariants by recalculating all scheduler state:
se->vruntime = new_vruntime;
se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
This is the same update pattern used in update_curr(). The EEVDF
relationship lag = (V - v) * w remains valid—vlag becomes more
negative as vruntime increases. The presence of other runnable tasks
doesn't affect the mathematical correctness; each entity's lag is
computed independently relative to avg_vruntime.
>
> >
> >> se->deadline += calc_delta_fair(se->slice, se);
> >> - }
> >> +
> >> + /*
> >> + * If we have more than one runnable task queued below
> >> + * this cfs_rq, the next pick will likely go for a
> >> + * different entity now that we have advanced the
> >> + * vruntime and the deadline of the running entity.
> >> + */
> >> + if (cfs_rq->h_nr_runnable > 1)
> >
> > Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> > correctly. Shouldn't the penalty apply at the LCA of yielder and
> > target? Otherwise the vruntime adjustment might not affect the level
> > where they actually compete.
>
> So here is the case I'm going after - consider the following
> hierarchy:
>
> root
> / \
> CG0 CG1
> | |
> A B
>
> CG* are cgroups and, [A-Z]* are tasks
>
> A decides to yield to B, and advances its deadline on CG0's timeline.
> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> CG0 which will in-turn select task A and it'll yield again. This
> cycle repeates until vruntime of CG0 turns large enough to make itself
> ineligible and route the EEVDF pick to CG1.
Yes, natural convergence works, but requires multiple cycles. Your
h_nr_runnable > 1 stops propagation when another entity might be
picked, but "might" depends on vruntime ordering which needs time to
develop. Our penalty forces immediate ineligibility at the LCA. One
penalty application vs N natural yield cycles.
>
> Now consider:
>
>
> root
> / \
> CG0 CG1
> / \ |
> A C B
>
> Same scenario: A yields to B. A advances its vruntime and deadline
> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> runtime for A when task B is runnable? Just because one task decided
> to yield to another task in a different cgroup doesn't mean other
> waiting tasks on that hierarchy suffer.
You're right that C suffers unfairly if it's independent work. This is
a known tradeoff. The rationale: when A spins on B's lock, we apply
the penalty at the LCA (root in your example) because that's where A
and B compete. This ensures B gets scheduled. The side effect is C
loses CPU time even though it's not involved in the dependency. In
practice: VMs typically put all vCPUs in one cgroup—no independent C
exists. If C exists and is affected by the same lock, the penalty
helps overall progress. If C is truly independent, it loses one
scheduling slice worth of time.
>
> >
> >> + break;
> >> + } while ((se = parent_entity(se)));
> >> }
> >>
> >> static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> >> ---
> >
> > Fixed one-slice penalties underperformed in our testing (dbench:
> > +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> > down to 1.0× based on queue size) necessary to balance effectiveness
> > against starvation.
>
> If all vCPUs of a VM are in the same cgroup - yield_to() should work
> just fine. If this "target" task is not selected then either some
> entity in the hierarchy, or the task is ineligible and EEVDF pick has
> decided to go with something else.
>
> It is not "starvation" but rather you've received you for fair share
> of "proportional runtime" and now you wait. If you really want to
> follow EEVDF maybe you compute the vlag and if it is behind the
> avg_vruntime, you account it to the "target" task - that would be
> in the spirit of the EEVDF algorithm.
You're right about the terminology—it's priority inversion, not
starvation. On crediting the target: this is philosophically
interesting but has practical issues. 1) Only helps if the target's
vlag < 0 (already lagging). If the lock holder is ahead (vlag > 0), no
effect. 2) Doesn't prevent the yielder from being re-picked at the LCA
if it's still most eligible. Accounting-wise: the spinner consumes
real CPU cycles. Our penalty charges that consumption. Crediting the
target gives service it didn't receive—arguably less consistent with
proportional fairness.
Regards,
Wanpeng
Hello Wanpeng,
On 11/13/2025 2:03 PM, Wanpeng Li wrote:
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b4617d631549..87560f5a18b3 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>>> * which yields immediately again; without the condition the vruntime
>>>> * ends up quickly running away.
>>>> */
>>>> - if (entity_eligible(cfs_rq, se)) {
>>>> + do {
>>>> + cfs_rq = cfs_rq_of(se);
>>>> +
>>>> + /*
>>>> + * Another entity will be selected at next pick.
>>>> + * Single entity on cfs_rq can never be ineligible.
>>>> + */
>>>> + if (!entity_eligible(cfs_rq, se))
>>>> + break;
>>>> +
>>>> se->vruntime = se->deadline;
>>>
>>> Setting vruntime = deadline zeros out lag. Does this cause fairness
>>> drift with repeated yields? We explicitly recalculate vlag after
>>> adjustment to preserve EEVDF invariants.
>>
>> We only push deadline when the entity is eligible. Ineligible entity
>> will break out above. Also I don't get how adding a penalty to an
>> entity in the cgroup hierarchy of the yielding task when there are
>> other runnable tasks considered as "preserve(ing) EEVDF invariants".
>
> Our penalty preserves EEVDF invariants by recalculating all scheduler state:
> se->vruntime = new_vruntime;
> se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
So your exact implementation in yield_deboost_apply_penalty() is:
> + new_vruntime = se_y_lca->vruntime + penalty;
> +
> + /* Validity check */
> + if (new_vruntime <= se_y_lca->vruntime)
> + return;
> +
> + se_y_lca->vruntime = new_vruntime;
You've updated this vruntime to something that you've seen fit based on
your performance data - better performance is not necessarily fair.
update_curr() uses:
/* Time elapsed. */
delta_exec = now - se->exec_start;
se->exec_start = now;
curr->vruntime += calc_delta_fair(delta_exec, curr);
"delta_exec" is based on the amount of time entity has run as opposed
to the penalty calculation which simply advances the vruntime by half a
slice because someone in the hierarchy decided to yield.
Also assume the vCPU yielding and the target is on the same cgroup -
you'll advance the vruntime of task in yield_deboost_apply_penalty() and
then again in yield_task_fair()?
> + se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> + se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
There is no point in setting vlag for a running entity
> + update_min_vruntime(cfs_rq_common);
> This is the same update pattern used in update_curr(). The EEVDF
> relationship lag = (V - v) * w remains valid—vlag becomes more
> negative as vruntime increases.
Sure "V" just moves to the new avg_vruntime() to give the 0-lag
point but modifying the vruntime arbitrarily doesn't seem fair to
me.
> The presence of other runnable tasks
> doesn't affect the mathematical correctness; each entity's lag is
> computed independently relative to avg_vruntime.
>
>>
>>>
>>>> se->deadline += calc_delta_fair(se->slice, se);
>>>> - }
>>>> +
>>>> + /*
>>>> + * If we have more than one runnable task queued below
>>>> + * this cfs_rq, the next pick will likely go for a
>>>> + * different entity now that we have advanced the
>>>> + * vruntime and the deadline of the running entity.
>>>> + */
>>>> + if (cfs_rq->h_nr_runnable > 1)
>>>
>>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
>>> correctly. Shouldn't the penalty apply at the LCA of yielder and
>>> target? Otherwise the vruntime adjustment might not affect the level
>>> where they actually compete.
>>
>> So here is the case I'm going after - consider the following
>> hierarchy:
>>
>> root
>> / \
>> CG0 CG1
>> | |
>> A B
>>
>> CG* are cgroups and, [A-Z]* are tasks
>>
>> A decides to yield to B, and advances its deadline on CG0's timeline.
>> Currently, if CG0 is eligible and CG1 isn't, pick will still select
>> CG0 which will in-turn select task A and it'll yield again. This
>> cycle repeates until vruntime of CG0 turns large enough to make itself
>> ineligible and route the EEVDF pick to CG1.
>
> Yes, natural convergence works, but requires multiple cycles. Your
> h_nr_runnable > 1 stops propagation when another entity might be
> picked, but "might" depends on vruntime ordering which needs time to
> develop. Our penalty forces immediate ineligibility at the LCA. One
> penalty application vs N natural yield cycles.
>
>>
>> Now consider:
>>
>>
>> root
>> / \
>> CG0 CG1
>> / \ |
>> A C B
>>
>> Same scenario: A yields to B. A advances its vruntime and deadline
>> as a prt of yield. Now, why should CG0 sacrifice its fair share of
>> runtime for A when task B is runnable? Just because one task decided
>> to yield to another task in a different cgroup doesn't mean other
>> waiting tasks on that hierarchy suffer.
>
> You're right that C suffers unfairly if it's independent work. This is
> a known tradeoff.
So KVM is only one of the user of yield_to(). This whole debouncer
infrastructure seems to be over complicating all this. If anything
is yielding across cgroup boundary - that seems like bad
configuration and if necessary, the previous suggestion does stuff
fairly. I don't mind accounting the lost time in
yield_to_task_fair() and account it to target task but apart from
that, I don't think any of it is "fair".
Again, maybe it is only me and everyone else sees the vision having
dealt with virtualization.
--
Thanks and Regards,
Prateek
Hi Prateek,
On Thu, 13 Nov 2025 at 17:48, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/13/2025 2:03 PM, Wanpeng Li wrote:
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index b4617d631549..87560f5a18b3 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>>> * which yields immediately again; without the condition the vruntime
> >>>> * ends up quickly running away.
> >>>> */
> >>>> - if (entity_eligible(cfs_rq, se)) {
> >>>> + do {
> >>>> + cfs_rq = cfs_rq_of(se);
> >>>> +
> >>>> + /*
> >>>> + * Another entity will be selected at next pick.
> >>>> + * Single entity on cfs_rq can never be ineligible.
> >>>> + */
> >>>> + if (!entity_eligible(cfs_rq, se))
> >>>> + break;
> >>>> +
> >>>> se->vruntime = se->deadline;
> >>>
> >>> Setting vruntime = deadline zeros out lag. Does this cause fairness
> >>> drift with repeated yields? We explicitly recalculate vlag after
> >>> adjustment to preserve EEVDF invariants.
> >>
> >> We only push deadline when the entity is eligible. Ineligible entity
> >> will break out above. Also I don't get how adding a penalty to an
> >> entity in the cgroup hierarchy of the yielding task when there are
> >> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> >
> > Our penalty preserves EEVDF invariants by recalculating all scheduler state:
> > se->vruntime = new_vruntime;
> > se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> > se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> > update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
>
> So your exact implementation in yield_deboost_apply_penalty() is:
>
> > + new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > + /* Validity check */
> > + if (new_vruntime <= se_y_lca->vruntime)
> > + return;
> > +
> > + se_y_lca->vruntime = new_vruntime;
>
> You've updated this vruntime to something that you've seen fit based on
> your performance data - better performance is not necessarily fair.
>
> update_curr() uses:
>
> /* Time elapsed. */
> delta_exec = now - se->exec_start;
> se->exec_start = now;
>
> curr->vruntime += calc_delta_fair(delta_exec, curr);
>
>
> "delta_exec" is based on the amount of time entity has run as opposed
> to the penalty calculation which simply advances the vruntime by half a
> slice because someone in the hierarchy decided to yield.
CFS already separates time accounting from policy enforcement.
place_entity() modifies vruntime based on lag without time
passage—it's placement policy, not time accounting. Similarly,
yield_task_fair() advances the deadline without consuming time—policy
to trigger reschedule. Our penalty follows this established pattern:
bounded vruntime adjustment to implement yield_to() semantics in
hierarchical scheduling. Time accounting ( update_curr ) and
scheduling policy (placement, yielding, penalties) are distinct
mechanisms in CFS.
>
> Also assume the vCPU yielding and the target is on the same cgroup -
> you'll advance the vruntime of task in yield_deboost_apply_penalty() and
> then again in yield_task_fair()?
This is deliberate. When tasks share the same cgroup, they need both
hierarchy-level and leaf-level adjustments.
yield_deboost_apply_penalty() positions the task in cgroup timeline
(affects picking at that level), while yield_task_fair() advances the
deadline (triggers immediate reschedule). Without both, same-cgroup
yield loses effectiveness—the task would be repicked despite yielding.
The double adjustment ensures yield works at both the task level and
across hierarchy levels. This matches CFS's multi-level scheduling
philosophy.
>
>
> > + se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> > + se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
>
> There is no point in setting vlag for a running entity
Maintaining invariants when modifying scheduler state is standard
practice throughout fair.c. reweight_entity() updates vlag for curr
when changing weights to preserve the lag relationship. We follow the
same principle—when artificially advancing vruntime, recalculate vlag
to maintain vlag = V - v . This prevents inconsistency when the entity
later dequeues. It's defensive correctness at negligible cost. The
alternative—leaving vlag stale—risks subtle bugs when scheduler state
assumptions are violated.
>
> > + update_min_vruntime(cfs_rq_common);
>
> > This is the same update pattern used in update_curr(). The EEVDF
> > relationship lag = (V - v) * w remains valid—vlag becomes more
> > negative as vruntime increases.
>
> Sure "V" just moves to the new avg_vruntime() to give the 0-lag
> point but modifying the vruntime arbitrarily doesn't seem fair to
> me.
yield_to() API explicitly requests directed unfairness. CFS already
implements unfairness mechanisms: nice values, cgroup weights,
set_next_buddy() immediate preference. Without our mechanism,
yield_to() silently fails across cgroups—buddy hints vanish at
hierarchy boundaries where EEVDF makes independent decisions. We make
the documented API functional. The real question: should yield_to()
work in production environments (nested cgroups)? If yes, vruntime
adjustment is necessary. If not, deprecate the API.
>
> > The presence of other runnable tasks
> > doesn't affect the mathematical correctness; each entity's lag is
> > computed independently relative to avg_vruntime.
> >
> >>
> >>>
> >>>> se->deadline += calc_delta_fair(se->slice, se);
> >>>> - }
> >>>> +
> >>>> + /*
> >>>> + * If we have more than one runnable task queued below
> >>>> + * this cfs_rq, the next pick will likely go for a
> >>>> + * different entity now that we have advanced the
> >>>> + * vruntime and the deadline of the running entity.
> >>>> + */
> >>>> + if (cfs_rq->h_nr_runnable > 1)
> >>>
> >>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> >>> correctly. Shouldn't the penalty apply at the LCA of yielder and
> >>> target? Otherwise the vruntime adjustment might not affect the level
> >>> where they actually compete.
> >>
> >> So here is the case I'm going after - consider the following
> >> hierarchy:
> >>
> >> root
> >> / \
> >> CG0 CG1
> >> | |
> >> A B
> >>
> >> CG* are cgroups and, [A-Z]* are tasks
> >>
> >> A decides to yield to B, and advances its deadline on CG0's timeline.
> >> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> >> CG0 which will in-turn select task A and it'll yield again. This
> >> cycle repeates until vruntime of CG0 turns large enough to make itself
> >> ineligible and route the EEVDF pick to CG1.
> >
> > Yes, natural convergence works, but requires multiple cycles. Your
> > h_nr_runnable > 1 stops propagation when another entity might be
> > picked, but "might" depends on vruntime ordering which needs time to
> > develop. Our penalty forces immediate ineligibility at the LCA. One
> > penalty application vs N natural yield cycles.
> >
> >>
> >> Now consider:
> >>
> >>
> >> root
> >> / \
> >> CG0 CG1
> >> / \ |
> >> A C B
> >>
> >> Same scenario: A yields to B. A advances its vruntime and deadline
> >> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> >> runtime for A when task B is runnable? Just because one task decided
> >> to yield to another task in a different cgroup doesn't mean other
> >> waiting tasks on that hierarchy suffer.
> >
> > You're right that C suffers unfairly if it's independent work. This is
> > a known tradeoff.
>
> So KVM is only one of the user of yield_to(). This whole debouncer
> infrastructure seems to be over complicating all this. If anything
> is yielding across cgroup boundary - that seems like bad
> configuration and if necessary, the previous suggestion does stuff
> fairly. I don't mind accounting the lost time in
> yield_to_task_fair() and account it to target task but apart from
> that, I don't think any of it is "fair".
Time-transfer fails fundamentally: lock holders often have higher
vruntime (ran more), so crediting them backwards doesn't change EEVDF
pick order. Our penalty pushes yielder back—effective regardless. The
infrastructure addresses real measured problems: rate limiting
prevents overhead, debounce stops ping-pong accumulation, LCA
targeting fixes hierarchy picking. Nested cgroups are production
standard (systemd, containers, cloud)—not misconfiguration.
Performance gains prove yield_to was broken. Open to simplifications,
but they must actually solve the hierarchical scheduling problem.
Regards,
Wanpeng
Hello Wanpeng, On 11/12/2025 10:24 AM, Wanpeng Li wrote: >>> Problem Statement >>> ----------------- >>> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks >>> held by other vCPUs that are not currently running. The kernel's >>> paravirtual spinlock support detects these situations and calls yield_to() >>> to boost the lock holder, allowing it to run and release the lock. >>> >>> However, the current implementation has two critical limitations: >>> >>> 1. Scheduler-side limitation: >>> >>> yield_to_task_fair() relies solely on set_next_buddy() to provide >>> preference to the target vCPU. This buddy mechanism only offers >>> immediate, transient preference. Once the buddy hint expires (typically >>> after one scheduling decision), the yielding vCPU may preempt the target >>> again, especially in nested cgroup hierarchies where vruntime domains >>> differ. >> >> So what you are saying is there are configurations out there where vCPUs >> of same guest are put in different cgroups? Why? Does the use case >> warrant enabling the cpu controller for the subtree? Are you running > > You're right to question this. The problematic scenario occurs with > nested cgroup hierarchies, which is common when VMs are deployed with > cgroup-based resource management. Even when all vCPUs of a single > guest are in the same leaf cgroup, that leaf sits under parent cgroups > with their own vruntime domains. > > The issue manifests when: > - set_next_buddy() provides preference at the leaf level > - But vruntime competition happens at parent levels If that is the case, then NEXT_BUDDY is in-eligible as a result of its vruntime being higher that the weighted averages of other entity. Won't this break fairness? Let me go look at the series and come back. > - The buddy hint gets "diluted" when pick_task_fair() walks up the hierarchy > -- Thanks and Regards, Prateek
Hi Prateek, On Wed, 12 Nov 2025 at 14:07, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello Wanpeng, > > On 11/12/2025 10:24 AM, Wanpeng Li wrote: > >>> Problem Statement > >>> ----------------- > >>> > >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks > >>> held by other vCPUs that are not currently running. The kernel's > >>> paravirtual spinlock support detects these situations and calls yield_to() > >>> to boost the lock holder, allowing it to run and release the lock. > >>> > >>> However, the current implementation has two critical limitations: > >>> > >>> 1. Scheduler-side limitation: > >>> > >>> yield_to_task_fair() relies solely on set_next_buddy() to provide > >>> preference to the target vCPU. This buddy mechanism only offers > >>> immediate, transient preference. Once the buddy hint expires (typically > >>> after one scheduling decision), the yielding vCPU may preempt the target > >>> again, especially in nested cgroup hierarchies where vruntime domains > >>> differ. > >> > >> So what you are saying is there are configurations out there where vCPUs > >> of same guest are put in different cgroups? Why? Does the use case > >> warrant enabling the cpu controller for the subtree? Are you running > > > > You're right to question this. The problematic scenario occurs with > > nested cgroup hierarchies, which is common when VMs are deployed with > > cgroup-based resource management. Even when all vCPUs of a single > > guest are in the same leaf cgroup, that leaf sits under parent cgroups > > with their own vruntime domains. > > > > The issue manifests when: > > - set_next_buddy() provides preference at the leaf level > > - But vruntime competition happens at parent levels > > If that is the case, then NEXT_BUDDY is in-eligible as a result of its > vruntime being higher that the weighted averages of other entity. > Won't this break fairness? Yes, it does break strict vruntime fairness temporarily. That's intentional. The problem: buddy expires after one pick, then vruntime wins → ping-pong. The spinning vCPU wastes CPU while the lock holder stays preempted. The fix applies a bounded vruntime penalty to the yielder at the cgroup LCA level: Bounds: * Rate limited: 6ms minimum interval between deboosting * Queue-adaptive caps: 6.0× gran for 2-task ping-pong, decays to 1.0× gran for large queues (12+) * Debounce: 600µs window detects A→B→A reverse patterns and reduces penalty * Hierarchy-aware: Applied at LCA, so same-cgroup yields have localized impact Why acceptable: Current behavior is already unfair—wasting CPU on spinning instead of productive work. Bounded vruntime penalty lets the lock holder complete faster, reducing overall waste. The scheduler still converges to fairness—the penalty just gives the boosted task sustained advantage until it finishes the critical section. Runtime toggle available via /sys/kernel/debug/sched/sched_vcpu_debooster_enabled if degradation observed. Dbench results show net throughput wins (+6-14%) outweigh the temporary fairness deviation. Regards, Wanpeng
© 2016 - 2025 Red Hat, Inc.