[v1] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

[PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Wanpeng Li 3 months ago

From: Wanpeng Li <wanpengli@tencent.com>

This series addresses long-standing yield_to() inefficiencies in
virtualized environments through two complementary mechanisms: a vCPU
debooster in the scheduler and IPI-aware directed yield in KVM.

Problem Statement
-----------------

In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running. The kernel's
paravirtual spinlock support detects these situations and calls yield_to()
to boost the lock holder, allowing it to run and release the lock.

However, the current implementation has two critical limitations:

1. Scheduler-side limitation:

   yield_to_task_fair() relies solely on set_next_buddy() to provide
   preference to the target vCPU. This buddy mechanism only offers
   immediate, transient preference. Once the buddy hint expires (typically
   after one scheduling decision), the yielding vCPU may preempt the target
   again, especially in nested cgroup hierarchies where vruntime domains
   differ.

   This creates a ping-pong effect: the lock holder runs briefly, gets
   preempted before completing critical sections, and the yielding vCPU
   spins again, triggering another futile yield_to() cycle. The overhead
   accumulates rapidly in workloads with high lock contention.

2. KVM-side limitation:

   kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
   directed yield candidate selection. However, it lacks awareness of IPI
   communication patterns. When a vCPU sends an IPI and spins waiting for
   a response (common in inter-processor synchronization), the current
   heuristics often fail to identify the IPI receiver as the yield target.

   Instead, the code may boost an unrelated vCPU based on coarse-grained
   preemption state, missing opportunities to accelerate actual IPI
   response handling. This is particularly problematic when the IPI receiver
   is runnable but not scheduled, as lock-holder-detection logic doesn't
   capture the IPI dependency relationship.

Combined, these issues cause excessive lock hold times, cache thrashing,
and degraded throughput in overcommitted environments, particularly
affecting workloads with fine-grained synchronization patterns.

Solution Overview
-----------------

The series introduces two orthogonal improvements that work synergistically:

Part 1: Scheduler vCPU Debooster (patches 1-5)

Augment yield_to_task_fair() with bounded vruntime penalties to provide
sustained preference beyond the buddy mechanism. When a vCPU yields to a
target, apply a carefully tuned vruntime penalty to the yielding vCPU,
ensuring the target maintains scheduling advantage for longer periods.

The mechanism is EEVDF-aware and cgroup-hierarchy-aware:

- Locate the lowest common ancestor (LCA) in the cgroup hierarchy where
  both the yielding and target tasks coexist. This ensures vruntime
  adjustments occur at the correct hierarchy level, maintaining fairness
  across cgroup boundaries.

- Update EEVDF scheduler fields (vruntime, deadline, vlag) atomically to
  keep the scheduler state consistent. The penalty shifts the yielding
  task's virtual deadline forward, allowing the target to run.

- Apply queue-size-adaptive penalties that scale from 6.0× scheduling
  granularity for 2-task scenarios (strong preference) down to 1.0× for
  large queues (>12 tasks), balancing preference against starvation risks.

- Implement reverse-pair debouncing: when task A yields to B, then B yields
  to A within a short window (~600us), downscale the penalty to prevent
  ping-pong oscillation.

- Rate-limit penalty application to 6ms intervals to prevent pathological
  overhead when yields occur at very high frequency.

The debooster works *with* the buddy mechanism rather than replacing it:
set_next_buddy() provides immediate preference for the next scheduling
decision, while the vruntime penalty sustains that preference over
subsequent decisions. This dual approach proves especially effective in
nested cgroup scenarios where buddy hints alone are insufficient.

Part 2: KVM IPI-Aware Directed Yield (patches 6-10)

Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
directed yield candidate selection. Track sender/receiver relationships
when IPIs are delivered and use this information to prioritize yield
targets.

The tracking mechanism:

- Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
  common case for inter-processor synchronization). When exactly one
  destination vCPU receives an IPI, record the sender→receiver relationship
  with a monotonic timestamp.

  In high VM density scenarios, software-based IPI tracking through interrupt
  delivery interception becomes particularly valuable. It captures precise
  sender/receiver relationships that can be leveraged for intelligent
  scheduling decisions, providing performance benefits that complement or
  even exceed hardware-accelerated interrupt delivery in overcommitted
  environments.

- Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
  per-vCPU ipi_context structure is carefully designed to avoid cache line
  bouncing.

- Implements a short recency window (50ms default) to avoid stale IPI
  information inflating boost priority on throughput-sensitive workloads.
  Old IPI relationships are naturally aged out.

- Clears IPI context on EOI with two-stage precision: unconditionally clear
  the receiver's context (it processed the interrupt), but only clear the
  sender's pending flag if the receiver matches and the IPI is recent. This
  prevents unrelated EOIs from prematurely clearing valid IPI state.

The candidate selection follows a priority hierarchy:

  Priority 1: Confirmed IPI receiver
    If the spinning vCPU recently sent an IPI to another vCPU and that IPI
    is still pending (within the recency window), unconditionally boost the
    receiver. This directly addresses the "spinning on IPI response" case.

  Priority 2: Fast pending interrupt
    Leverage arch-specific kvm_arch_dy_has_pending_interrupt() for
    compatibility with existing optimizations.

  Priority 3: Preempted in kernel mode
    Fall back to traditional preemption-based logic when yield_to_kernel_mode
    is requested, ensuring compatibility with existing workloads.

A two-round fallback mechanism provides a safety net: if the first round
with strict IPI-aware selection finds no eligible candidate (e.g., due to
missed IPI context or transient runnable set changes), a second round
applies relaxed selection gated only by preemption state. This is
controlled by the enable_relaxed_boost module parameter (default on).

Implementation Details
----------------------

Both mechanisms are designed for minimal overhead and runtime control:

- All locking occurs under existing rq->lock or per-vCPU locks; no new
  lock contention is introduced.

- Penalty calculations use integer arithmetic with overflow protection.

- IPI tracking uses monotonic timestamps (ktime_get_mono_fast_ns()) for
  efficient, race-free recency checks.

Advantages over paravirtualization approaches:

- No guest OS modification required: This solution operates entirely within
  the host kernel, providing transparent optimization without guest kernel
  changes or recompilation.

- Guest OS agnostic: Works uniformly across Linux, Windows, and other guest
  operating systems, unlike PV TLB shootdown which requires guest-side
  paravirtual driver support.

- Broader applicability: Captures IPI patterns from all synchronization
  primitives (spinlocks, RCU, smp_call_function, etc.), not limited to
  specific paravirtualized operations like TLB shootdown.

- Deployment simplicity: Existing VM images benefit immediately without
  guest kernel updates, critical for production environments with diverse
  guest OS versions and configurations.

- Runtime controls allow disabling features if needed:
  * /sys/kernel/debug/sched/sched_vcpu_debooster_enabled
  * /sys/module/kvm/parameters/ipi_tracking_enabled
  * /sys/module/kvm/parameters/enable_relaxed_boost

- The infrastructure is incrementally introduced: early patches add inert
  scaffolding that can be verified for zero performance impact before
  activation.

Performance Results
-------------------

Test environment: Intel Xeon, 16 physical cores, 16 vCPUs per VM

Dbench 16 clients per VM (filesystem metadata operations):
  2 VMs: +14.4% throughput (lock contention reduction)
  3 VMs:  +9.8% throughput
  4 VMs:  +6.7% throughput

PARSEC Dedup benchmark, simlarge input (memory-intensive):
  2 VMs: +47.1% throughput (IPI-heavy synchronization)
  3 VMs: +28.1% throughput
  4 VMs:  +1.7% throughput

PARSEC VIPS benchmark, simlarge input (compute-intensive):
  2 VMs: +26.2% throughput (balanced sync and compute)
  3 VMs: +12.7% throughput
  4 VMs:  +6.0% throughput

Analysis:

- Gains are most pronounced at moderate overcommit (2-3 VMs). At this level,
  contention is significant enough to benefit from better yield behavior,
  but context switch overhead remains manageable.

- Dedup shows the strongest improvement (+47.1% at 2 VMs) due to its
  IPI-heavy synchronization patterns. The IPI-aware directed yield
  precisely targets the bottleneck.

- At 4 VMs (heavier overcommit), gains diminish as general CPU contention
  dominates. However, performance never regresses, indicating the mechanisms
  gracefully degrade.

- In certain high-density, resource overcommitted deployment scenarios, the 
  performance benefits of APICv can be constrained by scheduling and contention 
  patterns. In such cases, software-based IPI tracking serves as a complementary 
  optimization path, offering targeted scheduling hints without relying on disabling 
  APICv. The practical choice should be evaluated and balanced against workload 
  characteristics and platform configuration.

- Dbench benefits primarily from the scheduler-side debooster, as its lock
  patterns involve less IPI spinning and more direct lock holder boosting.

The performance gains stem from three factors:

1. Lock holders receive sustained CPU time to complete critical sections,
   reducing overall lock hold duration and cascading contention.

2. IPI receivers are promptly scheduled when senders spin, minimizing IPI
   response latency and reducing wasted spin cycles.

3. Better cache utilization results from reduced context switching between
   lock waiters and holders.

Patch Organization
------------------

The series is organized for incremental review and bisectability:

Patches 1-5: Scheduler vCPU debooster

  Patch 1: Add infrastructure (per-rq tracking, sysctl, debugfs entry)
           Infrastructure is inert; no functional change.

  Patch 2: Add rate-limiting and validation helpers
           Static functions with comprehensive safety checks.

  Patch 3: Add cgroup LCA finder for hierarchical yield
           Implements CONFIG_FAIR_GROUP_SCHED-aware LCA location.

  Patch 4: Add penalty calculation and application logic
           Core algorithms with queue-size adaptation and debouncing.

  Patch 5: Wire up yield deboost in yield_to_task_fair()
           Activation patch. Includes Dbench performance data.

Patches 6-10: KVM IPI-aware directed yield

  Patch 6: Fix last_boosted_vcpu index assignment bug
           Standalone bugfix for existing code.

  Patch 7: Add IPI tracking infrastructure
           Per-vCPU context, module parameters, helper functions.
           Infrastructure is inert until activated.

  Patch 8: Integrate IPI tracking with LAPIC interrupt delivery
           Hook into kvm_irq_delivery_to_apic() and EOI handling.

  Patch 9: Implement IPI-aware directed yield candidate selection
           Replace candidate selection logic with priority-based approach.
           Includes PARSEC performance data.

  Patch 10: Add relaxed boost as safety net
            Two-round fallback mechanism for robustness.

Each patch compiles and boots independently. Performance data is presented
where the relevant mechanism becomes active (patches 5 and 9).

Testing
-------

Workloads tested:

- Dbench (filesystem metadata stress)
- PARSEC benchmarks (Dedup, VIPS, Ferret, Blackscholes)
- Kernel compilation (make -j16 in each VM)

No regressions observed on any configuration. The mechanisms show neutral
to positive impact across diverse workloads.

Future Work
-----------

Potential extensions beyond this series:

- Adaptive recency window: dynamically adjust ipi_window_ns based on
  observed workload patterns.

- Extended tracking: consider multi-round IPI patterns (A→B→C→A).

- Cross-NUMA awareness: penalty scaling based on NUMA distances.

These are intentionally deferred to keep this series focused and reviewable.

Wanpeng Li (10):
  sched: Add vCPU debooster infrastructure
  sched/fair: Add rate-limiting and validation helpers
  sched/fair: Add cgroup LCA finder for hierarchical yield
  sched/fair: Add penalty calculation and application logic
  sched/fair: Wire up yield deboost in yield_to_task_fair()
  KVM: Fix last_boosted_vcpu index assignment bug
  KVM: x86: Add IPI tracking infrastructure
  KVM: x86/lapic: Integrate IPI tracking with interrupt delivery
  KVM: Implement IPI-aware directed yield candidate selection
  KVM: Relaxed boost as safety net

 arch/x86/include/asm/kvm_host.h |   8 +
 arch/x86/kvm/lapic.c            | 172 +++++++++++++++-
 arch/x86/kvm/x86.c              |   6 +
 arch/x86/kvm/x86.h              |   4 +
 include/linux/kvm_host.h        |   1 +
 kernel/sched/core.c             |   7 +-
 kernel/sched/debug.c            |   3 +
 kernel/sched/fair.c             | 336 ++++++++++++++++++++++++++++++++
 kernel/sched/sched.h            |   9 +
 virt/kvm/kvm_main.c             |  81 +++++++-
 10 files changed, 611 insertions(+), 16 deletions(-)

-- 
2.43.0

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Christian Borntraeger 2 months, 4 weeks ago

Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
> 
> Problem Statement
> -----------------
> 
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
> 
> However, the current implementation has two critical limitations:
> 
> 1. Scheduler-side limitation:
> 
>     yield_to_task_fair() relies solely on set_next_buddy() to provide
>     preference to the target vCPU. This buddy mechanism only offers
>     immediate, transient preference. Once the buddy hint expires (typically
>     after one scheduling decision), the yielding vCPU may preempt the target
>     again, especially in nested cgroup hierarchies where vruntime domains
>     differ.
> 
>     This creates a ping-pong effect: the lock holder runs briefly, gets
>     preempted before completing critical sections, and the yielding vCPU
>     spins again, triggering another futile yield_to() cycle. The overhead
>     accumulates rapidly in workloads with high lock contention.

I can certainly confirm that on s390 we do see that yield_to does not always
work as expected. Our spinlock code is lock holder aware so our KVM always yield
correctly but often enought the hint is ignored our bounced back as you describe.
So I am certainly interested in that part.

I need to look more closely into the other part.

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Wanpeng Li 2 months, 3 weeks ago

Hi Christian,

On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> >     yield_to_task_fair() relies solely on set_next_buddy() to provide
> >     preference to the target vCPU. This buddy mechanism only offers
> >     immediate, transient preference. Once the buddy hint expires (typically
> >     after one scheduling decision), the yielding vCPU may preempt the target
> >     again, especially in nested cgroup hierarchies where vruntime domains
> >     differ.
> >
> >     This creates a ping-pong effect: the lock holder runs briefly, gets
> >     preempted before completing critical sections, and the yielding vCPU
> >     spins again, triggering another futile yield_to() cycle. The overhead
> >     accumulates rapidly in workloads with high lock contention.
>
> I can certainly confirm that on s390 we do see that yield_to does not always
> work as expected. Our spinlock code is lock holder aware so our KVM always yield
> correctly but often enought the hint is ignored our bounced back as you describe.
> So I am certainly interested in that part.
>
> I need to look more closely into the other part.

Thanks for the confirmation and interest! It's valuable to hear that
s390 observes similar yield_to() behavior where the hint gets ignored
or bounced back despite correct lock holder identification.

Since your spinlock code is already lock-holder-aware and KVM yields
to the correct target, the scheduler-side improvements (patches 1-5)
should directly address the ping-pong issue you're seeing. The
vruntime penalties are designed to sustain the preference beyond the
transient buddy hint, which should reduce the bouncing effect.

Best regards,
Wanpeng

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Christian Borntraeger 2 months, 3 weeks ago

Am 12.11.25 um 06:01 schrieb Wanpeng Li:
> Hi Christian,
> 
> On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
> <borntraeger@linux.ibm.com> wrote:
>>
>> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
>>> From: Wanpeng Li <wanpengli@tencent.com>
>>>
>>> This series addresses long-standing yield_to() inefficiencies in
>>> virtualized environments through two complementary mechanisms: a vCPU
>>> debooster in the scheduler and IPI-aware directed yield in KVM.
>>>
>>> Problem Statement
>>> -----------------
>>>
>>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
>>> held by other vCPUs that are not currently running. The kernel's
>>> paravirtual spinlock support detects these situations and calls yield_to()
>>> to boost the lock holder, allowing it to run and release the lock.
>>>
>>> However, the current implementation has two critical limitations:
>>>
>>> 1. Scheduler-side limitation:
>>>
>>>      yield_to_task_fair() relies solely on set_next_buddy() to provide
>>>      preference to the target vCPU. This buddy mechanism only offers
>>>      immediate, transient preference. Once the buddy hint expires (typically
>>>      after one scheduling decision), the yielding vCPU may preempt the target
>>>      again, especially in nested cgroup hierarchies where vruntime domains
>>>      differ.
>>>
>>>      This creates a ping-pong effect: the lock holder runs briefly, gets
>>>      preempted before completing critical sections, and the yielding vCPU
>>>      spins again, triggering another futile yield_to() cycle. The overhead
>>>      accumulates rapidly in workloads with high lock contention.
>>
>> I can certainly confirm that on s390 we do see that yield_to does not always
>> work as expected. Our spinlock code is lock holder aware so our KVM always yield
>> correctly but often enought the hint is ignored our bounced back as you describe.
>> So I am certainly interested in that part.
>>
>> I need to look more closely into the other part.
> 
> Thanks for the confirmation and interest! It's valuable to hear that
> s390 observes similar yield_to() behavior where the hint gets ignored
> or bounced back despite correct lock holder identification.
> 
> Since your spinlock code is already lock-holder-aware and KVM yields
> to the correct target, the scheduler-side improvements (patches 1-5)
> should directly address the ping-pong issue you're seeing. The
> vruntime penalties are designed to sustain the preference beyond the
> transient buddy hint, which should reduce the bouncing effect.

So we will play a bit with the first patches and check for performance improvements.

I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do
see "more than count" numbers of the yield hypercalls with that testcase (as before).
Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment
out your rate limit code I hit exactly the 4000000.
Can you maybe outline a bit why the rate limit is important and needed?

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Wanpeng Li 2 months, 3 weeks ago

Hi Christian，

On Tue, 18 Nov 2025 at 16:12, Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> Am 12.11.25 um 06:01 schrieb Wanpeng Li:
> > Hi Christian,
> >
> > On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
> > <borntraeger@linux.ibm.com> wrote:
> >>
> >> Am 10.11.25 um 04:32 schrieb Wanpeng Li:
> >>> From: Wanpeng Li <wanpengli@tencent.com>
> >>>
> >>> This series addresses long-standing yield_to() inefficiencies in
> >>> virtualized environments through two complementary mechanisms: a vCPU
> >>> debooster in the scheduler and IPI-aware directed yield in KVM.
> >>>
> >>> Problem Statement
> >>> -----------------
> >>>
> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> >>> held by other vCPUs that are not currently running. The kernel's
> >>> paravirtual spinlock support detects these situations and calls yield_to()
> >>> to boost the lock holder, allowing it to run and release the lock.
> >>>
> >>> However, the current implementation has two critical limitations:
> >>>
> >>> 1. Scheduler-side limitation:
> >>>
> >>>      yield_to_task_fair() relies solely on set_next_buddy() to provide
> >>>      preference to the target vCPU. This buddy mechanism only offers
> >>>      immediate, transient preference. Once the buddy hint expires (typically
> >>>      after one scheduling decision), the yielding vCPU may preempt the target
> >>>      again, especially in nested cgroup hierarchies where vruntime domains
> >>>      differ.
> >>>
> >>>      This creates a ping-pong effect: the lock holder runs briefly, gets
> >>>      preempted before completing critical sections, and the yielding vCPU
> >>>      spins again, triggering another futile yield_to() cycle. The overhead
> >>>      accumulates rapidly in workloads with high lock contention.
> >>
> >> I can certainly confirm that on s390 we do see that yield_to does not always
> >> work as expected. Our spinlock code is lock holder aware so our KVM always yield
> >> correctly but often enought the hint is ignored our bounced back as you describe.
> >> So I am certainly interested in that part.
> >>
> >> I need to look more closely into the other part.
> >
> > Thanks for the confirmation and interest! It's valuable to hear that
> > s390 observes similar yield_to() behavior where the hint gets ignored
> > or bounced back despite correct lock holder identification.
> >
> > Since your spinlock code is already lock-holder-aware and KVM yields
> > to the correct target, the scheduler-side improvements (patches 1-5)
> > should directly address the ping-pong issue you're seeing. The
> > vruntime penalties are designed to sustain the preference beyond the
> > transient buddy hint, which should reduce the bouncing effect.
>
> So we will play a bit with the first patches and check for performance improvements.
>
> I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do
> see "more than count" numbers of the yield hypercalls with that testcase (as before).
> Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment
> out your rate limit code I hit exactly the 4000000.
> Can you maybe outline a bit why the rate limit is important and needed?

Good catch! The 10× inflation is actually expected behavior. The key
insight is that rate limit filters penalty applications, not yield
hypercalls. In your ping-pong test with 4M counter increments, PLE
hardware fires multiple times per lock acquisition (roughly 10 times
based on your numbers), and each triggers kvm_vcpu_on_spin() . Without
rate limit, every yield immediately applies vruntime penalty. In tight
ping-pong, this causes over-penalization where the skip vCPU becomes
so deprioritized it effectively starves, which paradoxically
neutralizes the debooster effect. You see "exactly 4M" not because
it's working optimally, but because excessive penalties create a
pathological equilibrium where subsequent yields are suppressed by
starvation. With a 6ms rate limit, all 40M hypercalls still occur (PLE
still fires), but only the first yield in each burst applies a penalty
while subsequent ones are filtered. This gives you roughly 4M
penalties (one per actual lock acquisition) instead of 40M, providing
sustained advantage without over-penalization. The 6ms threshold was
empirically tuned as roughly 2× typical timeslice to filter intra-lock
PLE bursts while preserving responsiveness to legitimate contention.
Your test validates the design by showing rate limit prevents penalty
amplification even in the tightest ping-pong scenario.

I'll post v2 after the merge window with code comments addressing this
and other review feedback, which should be more suitable for
performance evaluation.

Wanpeng

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by K Prateek Nayak 2 months, 4 weeks ago

Hello Wanpeng,

I haven't looked at the entire series and the penalty calculation math
but I've a few questions looking at the cover-letter.

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
> 
> Problem Statement
> -----------------
> 
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
> 
> However, the current implementation has two critical limitations:
> 
> 1. Scheduler-side limitation:
> 
>    yield_to_task_fair() relies solely on set_next_buddy() to provide
>    preference to the target vCPU. This buddy mechanism only offers
>    immediate, transient preference. Once the buddy hint expires (typically
>    after one scheduling decision), the yielding vCPU may preempt the target
>    again, especially in nested cgroup hierarchies where vruntime domains
>    differ.

So what you are saying is there are configurations out there where vCPUs
of same guest are put in different cgroups? Why? Does the use case
warrant enabling the cpu controller for the subtree? Are you running
with the "NEXT_BUDDY" sched feat enabled?

If they are in the same cgroup, the recent optimizations/fixes to
yield_task_fair() in queue:sched/core should help remedy some of the
problems you might be seeing.

For multiple cgroups, perhaps you can extend yield_task_fair() to do:

( Only build and boot tested on top of
    git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
  at commit f82a0f91493f "sched/deadline: Minor cleanup in
  select_task_rq_dl()" )

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4617d631549..87560f5a18b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
 	 * which yields immediately again; without the condition the vruntime
 	 * ends up quickly running away.
 	 */
-	if (entity_eligible(cfs_rq, se)) {
+	do {
+		cfs_rq = cfs_rq_of(se);
+
+		/*
+		 * Another entity will be selected at next pick.
+		 * Single entity on cfs_rq can never be ineligible.
+		 */
+		if (!entity_eligible(cfs_rq, se))
+			break;
+
 		se->vruntime = se->deadline;
 		se->deadline += calc_delta_fair(se->slice, se);
-	}
+
+		/*
+		 * If we have more than one runnable task queued below
+		 * this cfs_rq, the next pick will likely go for a
+		 * different entity now that we have advanced the
+		 * vruntime and the deadline of the running entity.
+		 */
+		if (cfs_rq->h_nr_runnable > 1)
+			break;
+	} while ((se = parent_entity(se)));
 }
 
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
---

With that, I'm pretty sure there is a good chance we'll not select the
hierarchy that did a yield_to() unless there is a large discrepancy in
their weights and just advancing se->vruntime to se->deadline once isn't
enough to make it ineligible and you'll have to do it multiple time (at
which point that cgroup hierarchy needs to be studied).

As for the problem that NEXT_BUDDY hint is used only once, you can
perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
the "prev" task during schedule?

> 
>    This creates a ping-pong effect: the lock holder runs briefly, gets
>    preempted before completing critical sections, and the yielding vCPU
>    spins again, triggering another futile yield_to() cycle. The overhead
>    accumulates rapidly in workloads with high lock contention.
> 
> 2. KVM-side limitation:
> 
>    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>    directed yield candidate selection. However, it lacks awareness of IPI
>    communication patterns. When a vCPU sends an IPI and spins waiting for
>    a response (common in inter-processor synchronization), the current
>    heuristics often fail to identify the IPI receiver as the yield target.

Can't that be solved on the KVM end? Also shouldn't Patch 6 be on top
with a "Fixes:" tag.

> 
>    Instead, the code may boost an unrelated vCPU based on coarse-grained
>    preemption state, missing opportunities to accelerate actual IPI
>    response handling. This is particularly problematic when the IPI receiver
>    is runnable but not scheduled, as lock-holder-detection logic doesn't
>    capture the IPI dependency relationship.

Are you saying the yield_to() is called with an incorrect target vCPU?

> 
> Combined, these issues cause excessive lock hold times, cache thrashing,
> and degraded throughput in overcommitted environments, particularly
> affecting workloads with fine-grained synchronization patterns.
> 
-- 
Thanks and Regards,
Prateek

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Wanpeng Li 2 months, 3 weeks ago

Hi Prateek,

On Tue, 11 Nov 2025 at 14:28, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> I haven't looked at the entire series and the penalty calculation math
> but I've a few questions looking at the cover-letter.

Thanks for the review and the thoughtful questions.

>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> >    yield_to_task_fair() relies solely on set_next_buddy() to provide
> >    preference to the target vCPU. This buddy mechanism only offers
> >    immediate, transient preference. Once the buddy hint expires (typically
> >    after one scheduling decision), the yielding vCPU may preempt the target
> >    again, especially in nested cgroup hierarchies where vruntime domains
> >    differ.
>
> So what you are saying is there are configurations out there where vCPUs
> of same guest are put in different cgroups? Why? Does the use case
> warrant enabling the cpu controller for the subtree? Are you running

You're right to question this. The problematic scenario occurs with
nested cgroup hierarchies, which is common when VMs are deployed with
cgroup-based resource management. Even when all vCPUs of a single
guest are in the same leaf cgroup, that leaf sits under parent cgroups
with their own vruntime domains.

The issue manifests when:
   - set_next_buddy() provides preference at the leaf level
   - But vruntime competition happens at parent levels
   - The buddy hint gets "diluted" when pick_task_fair() walks up the hierarchy

The cpu controller is typically enabled in these deployments for quota
enforcement and weight-based sharing. That said, the debooster
mechanism is designed to be general-purpose: it handles any scenario
where yield_to() crosses cgroup boundaries, whether due to nested
hierarchies or sibling cgroups.

> with the "NEXT_BUDDY" sched feat enabled?

Yes, NEXT_BUDDY is enabled. The problem is that set_next_buddy()
provides only immediate, transient preference. Once the buddy hint is
consumed (typically after one pick_next_task_fair() call), the
yielding vCPU can preempt the target again if their vruntime values
haven't diverged sufficiently.

>
> If they are in the same cgroup, the recent optimizations/fixes to
> yield_task_fair() in queue:sched/core should help remedy some of the
> problems you might be seeing.

Agreed - the recent yield_task_fair() improvements in queue:sched/core
(EEVDF-based vruntime = deadline with hierarchical walk) are valuable.
However, our patchset focuses on yield_to() rather than yield(), which
has different semantics:
   - yield_task_fair(): "I voluntarily give up CPU, pick someone else"
→ Recent improvements handle this well with hierarchical walk
   - yield_to_task_fair(): "I want *this specific task* to run
instead" → Requires finding the LCA of yielder and target, then
applying penalties at that level to influence their relative
competition

The debooster extends yield_to() to handle cross-cgroup scenarios
where the yielder and target may be in different subtrees.

>
> For multiple cgroups, perhaps you can extend yield_task_fair() to do:

Thanks for the suggestion. Your hierarchical walk approach shares
similarities with our implementation. A few questions on the details:

>
> ( Only build and boot tested on top of
>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
>   select_task_rq_dl()" )
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b4617d631549..87560f5a18b3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>          * which yields immediately again; without the condition the vruntime
>          * ends up quickly running away.
>          */
> -       if (entity_eligible(cfs_rq, se)) {
> +       do {
> +               cfs_rq = cfs_rq_of(se);
> +
> +               /*
> +                * Another entity will be selected at next pick.
> +                * Single entity on cfs_rq can never be ineligible.
> +                */
> +               if (!entity_eligible(cfs_rq, se))
> +                       break;
> +
>                 se->vruntime = se->deadline;

Setting vruntime = deadline zeros out lag. Does this cause fairness
drift with repeated yields? We explicitly recalculate vlag after
adjustment to preserve EEVDF invariants.

>                 se->deadline += calc_delta_fair(se->slice, se);
> -       }
> +
> +               /*
> +                * If we have more than one runnable task queued below
> +                * this cfs_rq, the next pick will likely go for a
> +                * different entity now that we have advanced the
> +                * vruntime and the deadline of the running entity.
> +                */
> +               if (cfs_rq->h_nr_runnable > 1)

Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
correctly. Shouldn't the penalty apply at the LCA of yielder and
target? Otherwise the vruntime adjustment might not affect the level
where they actually compete.

> +                       break;
> +       } while ((se = parent_entity(se)));
>  }
>
>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> ---

Fixed one-slice penalties underperformed in our testing (dbench:
+14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
down to 1.0× based on queue size) necessary to balance effectiveness
against starvation.

>
> With that, I'm pretty sure there is a good chance we'll not select the
> hierarchy that did a yield_to() unless there is a large discrepancy in
> their weights and just advancing se->vruntime to se->deadline once isn't
> enough to make it ineligible and you'll have to do it multiple time (at
> which point that cgroup hierarchy needs to be studied).
>
> As for the problem that NEXT_BUDDY hint is used only once, you can
> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
> the "prev" task during schedule?

That's an interesting idea. However, LAST_BUDDY was removed from the
scheduler due to concerns about fairness and latency regressions in
general workloads. Reintroducing it globally might regress non-vCPU
workloads.

Our approach is more targeted: apply vruntime penalties specifically
in the yield_to() path (controlled by debugfs flag), avoiding impact
on general scheduling. The debooster is inert unless explicitly
enabled and rate-limited to prevent pathological overhead.

>
> >
> >    This creates a ping-pong effect: the lock holder runs briefly, gets
> >    preempted before completing critical sections, and the yielding vCPU
> >    spins again, triggering another futile yield_to() cycle. The overhead
> >    accumulates rapidly in workloads with high lock contention.
> >
> > 2. KVM-side limitation:
> >
> >    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
> >    directed yield candidate selection. However, it lacks awareness of IPI
> >    communication patterns. When a vCPU sends an IPI and spins waiting for
> >    a response (common in inter-processor synchronization), the current
> >    heuristics often fail to identify the IPI receiver as the yield target.
>
> Can't that be solved on the KVM end?

Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
orthogonal mechanisms:
   - Debooster: sustains yield_to() preference regardless of *who* is
yielding to whom
   - IPI tracking: improves *which* target is selected when a vCPU spins

Both showed independent gains in our testing, and combined effects
were approximately additive.

> Also shouldn't Patch 6 be on top with a "Fixes:" tag.

You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
bugfix and should be at the top with a Fixes tag. I'll reorder it in
v2 with:
Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
use a single for-loop")

>
> >
> >    Instead, the code may boost an unrelated vCPU based on coarse-grained
> >    preemption state, missing opportunities to accelerate actual IPI
> >    response handling. This is particularly problematic when the IPI receiver
> >    is runnable but not scheduled, as lock-holder-detection logic doesn't
> >    capture the IPI dependency relationship.
>
> Are you saying the yield_to() is called with an incorrect target vCPU?

Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
selection logic before yield_to() is called. Without IPI tracking, it
relies on preemption state, which doesn't capture "vCPU waiting for
IPI response from specific other vCPU."

The IPI tracking records sender→receiver relationships at interrupt
delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
the IPI receiver when the sender spins (patch 9). This addresses
scenarios where the spinning vCPU is waiting for IPI acknowledgment
rather than lock release.

Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
   - Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
   - VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs

Gains are most pronounced at moderate overcommit where the IPI
receiver is often runnable but not scheduled.

Thanks again for the review and suggestions.

Best regards,
Wanpeng

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by K Prateek Nayak 2 months, 3 weeks ago

Hello Wanpeng,

On 11/12/2025 10:24 AM, Wanpeng Li wrote:
>>
>> ( Only build and boot tested on top of
>>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
>>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
>>   select_task_rq_dl()" )
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b4617d631549..87560f5a18b3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>          * which yields immediately again; without the condition the vruntime
>>          * ends up quickly running away.
>>          */
>> -       if (entity_eligible(cfs_rq, se)) {
>> +       do {
>> +               cfs_rq = cfs_rq_of(se);
>> +
>> +               /*
>> +                * Another entity will be selected at next pick.
>> +                * Single entity on cfs_rq can never be ineligible.
>> +                */
>> +               if (!entity_eligible(cfs_rq, se))
>> +                       break;
>> +
>>                 se->vruntime = se->deadline;
> 
> Setting vruntime = deadline zeros out lag. Does this cause fairness
> drift with repeated yields? We explicitly recalculate vlag after
> adjustment to preserve EEVDF invariants.

We only push deadline when the entity is eligible. Ineligible entity
will break out above. Also I don't get how adding a penalty to an
entity in the cgroup hierarchy of the yielding task when there are
other runnable tasks considered as "preserve(ing) EEVDF invariants".

> 
>>                 se->deadline += calc_delta_fair(se->slice, se);
>> -       }
>> +
>> +               /*
>> +                * If we have more than one runnable task queued below
>> +                * this cfs_rq, the next pick will likely go for a
>> +                * different entity now that we have advanced the
>> +                * vruntime and the deadline of the running entity.
>> +                */
>> +               if (cfs_rq->h_nr_runnable > 1)
> 
> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> correctly. Shouldn't the penalty apply at the LCA of yielder and
> target? Otherwise the vruntime adjustment might not affect the level
> where they actually compete.

So here is the case I'm going after - consider the following
hierarchy:

     root
    /    \
  CG0   CG1
   |     |
   A     B

  CG* are cgroups and, [A-Z]* are tasks

A decides to yield to B, and advances its deadline on CG0's timeline.
Currently, if CG0 is eligible and CG1 isn't, pick will still select
CG0 which will in-turn select task A and it'll yield again. This
cycle repeates until vruntime of CG0 turns large enough to make itself
ineligible and route the EEVDF pick to CG1.

Now consider:


       root
      /    \
    CG0   CG1
   /   \   |
  A     C  B

Same scenario: A yields to B. A advances its vruntime and deadline
as a prt of yield. Now, why should CG0 sacrifice its fair share of
runtime for A when task B is runnable? Just because one task decided
to yield to another task in a different cgroup doesn't mean other
waiting tasks on that hierarchy suffer.

> 
>> +                       break;
>> +       } while ((se = parent_entity(se)));
>>  }
>>
>>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
>> ---
> 
> Fixed one-slice penalties underperformed in our testing (dbench:
> +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> down to 1.0× based on queue size) necessary to balance effectiveness
> against starvation.

If all vCPUs of a VM are in the same cgroup - yield_to() should work
just fine. If this "target" task is not selected then either some
entity in the hierarchy, or the task is ineligible and EEVDF pick has
decided to go with something else.

It is not "starvation" but rather you've received you for fair share
of "proportional runtime" and now you wait. If you really want to
follow EEVDF maybe you compute the vlag and if it is behind the
avg_vruntime, you account it to the "target" task - that would be
in the spirit of the EEVDF algorithm.

> 
>>
>> With that, I'm pretty sure there is a good chance we'll not select the
>> hierarchy that did a yield_to() unless there is a large discrepancy in
>> their weights and just advancing se->vruntime to se->deadline once isn't
>> enough to make it ineligible and you'll have to do it multiple time (at
>> which point that cgroup hierarchy needs to be studied).
>>
>> As for the problem that NEXT_BUDDY hint is used only once, you can
>> perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
>> the "prev" task during schedule?
> 
> That's an interesting idea. However, LAST_BUDDY was removed from the
> scheduler due to concerns about fairness and latency regressions in
> general workloads. Reintroducing it globally might regress non-vCPU
> workloads.
> 
> Our approach is more targeted: apply vruntime penalties specifically
> in the yield_to() path (controlled by debugfs flag), avoiding impact
> on general scheduling. The debooster is inert unless explicitly
> enabled and rate-limited to prevent pathological overhead.

Yeah, I'm still not on board with the idea but maybe I don't see the
vision. Hope other scheduler folks can chime in.

> 
>>
>>>
>>>    This creates a ping-pong effect: the lock holder runs briefly, gets
>>>    preempted before completing critical sections, and the yielding vCPU
>>>    spins again, triggering another futile yield_to() cycle. The overhead
>>>    accumulates rapidly in workloads with high lock contention.
>>>
>>> 2. KVM-side limitation:
>>>
>>>    kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
>>>    directed yield candidate selection. However, it lacks awareness of IPI
>>>    communication patterns. When a vCPU sends an IPI and spins waiting for
>>>    a response (common in inter-processor synchronization), the current
>>>    heuristics often fail to identify the IPI receiver as the yield target.
>>
>> Can't that be solved on the KVM end?
> 
> Yes, the IPI tracking is entirely KVM-side (patches 6-10). The
> scheduler-side debooster (patches 1-5) and KVM-side IPI tracking are
> orthogonal mechanisms:
>    - Debooster: sustains yield_to() preference regardless of *who* is
> yielding to whom
>    - IPI tracking: improves *which* target is selected when a vCPU spins
> 
> Both showed independent gains in our testing, and combined effects
> were approximately additive.

I'll try to look at the KVM bits but I'm not familiar enough with
those bits enough to review it well :)

> 
>> Also shouldn't Patch 6 be on top with a "Fixes:" tag.
> 
> You're right. Patch 6 (last_boosted_vcpu bug fix) is a standalone
> bugfix and should be at the top with a Fixes tag. I'll reorder it in
> v2 with:
> Fixes: 7e513617da71 ("KVM: Rework core loop of kvm_vcpu_on_spin() to
> use a single for-loop")

Thank you.

> 
>>
>>>
>>>    Instead, the code may boost an unrelated vCPU based on coarse-grained
>>>    preemption state, missing opportunities to accelerate actual IPI
>>>    response handling. This is particularly problematic when the IPI receiver
>>>    is runnable but not scheduled, as lock-holder-detection logic doesn't
>>>    capture the IPI dependency relationship.
>>
>> Are you saying the yield_to() is called with an incorrect target vCPU?
> 
> Yes - more precisely, the issue is in kvm_vcpu_on_spin()'s target
> selection logic before yield_to() is called. Without IPI tracking, it
> relies on preemption state, which doesn't capture "vCPU waiting for
> IPI response from specific other vCPU."
> 
> The IPI tracking records sender→receiver relationships at interrupt
> delivery time (patch 8), enabling kvm_vcpu_on_spin() to directly boost
> the IPI receiver when the sender spins (patch 9). This addresses
> scenarios where the spinning vCPU is waiting for IPI acknowledgment
> rather than lock release.
> 
> Performance (16 pCPU host, 16 vCPUs/VM, PARSEC workloads):
>    - Dedup: +47.1%/+28.1%/+1.7% for 2/3/4 VMs
>    - VIPS: +26.2%/+12.7%/+6.0% for 2/3/4 VMs
> 
> Gains are most pronounced at moderate overcommit where the IPI
> receiver is often runnable but not scheduled.
> 
> Thanks again for the review and suggestions.
> 
> Best regards,
> Wanpeng

-- 
Thanks and Regards,
Prateek

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Wanpeng Li 2 months, 3 weeks ago

Hi Prateek,

On Thu, 13 Nov 2025 at 12:42, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/12/2025 10:24 AM, Wanpeng Li wrote:
> >>
> >> ( Only build and boot tested on top of
> >>     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> >>   at commit f82a0f91493f "sched/deadline: Minor cleanup in
> >>   select_task_rq_dl()" )
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b4617d631549..87560f5a18b3 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>          * which yields immediately again; without the condition the vruntime
> >>          * ends up quickly running away.
> >>          */
> >> -       if (entity_eligible(cfs_rq, se)) {
> >> +       do {
> >> +               cfs_rq = cfs_rq_of(se);
> >> +
> >> +               /*
> >> +                * Another entity will be selected at next pick.
> >> +                * Single entity on cfs_rq can never be ineligible.
> >> +                */
> >> +               if (!entity_eligible(cfs_rq, se))
> >> +                       break;
> >> +
> >>                 se->vruntime = se->deadline;
> >
> > Setting vruntime = deadline zeros out lag. Does this cause fairness
> > drift with repeated yields? We explicitly recalculate vlag after
> > adjustment to preserve EEVDF invariants.
>
> We only push deadline when the entity is eligible. Ineligible entity
> will break out above. Also I don't get how adding a penalty to an
> entity in the cgroup hierarchy of the yielding task when there are
> other runnable tasks considered as "preserve(ing) EEVDF invariants".

Our penalty preserves EEVDF invariants by recalculating all scheduler state:
   se->vruntime = new_vruntime;
   se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
   se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
   update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
This is the same update pattern used in update_curr(). The EEVDF
relationship lag = (V - v) * w remains valid—vlag becomes more
negative as vruntime increases. The presence of other runnable tasks
doesn't affect the mathematical correctness; each entity's lag is
computed independently relative to avg_vruntime.

>
> >
> >>                 se->deadline += calc_delta_fair(se->slice, se);
> >> -       }
> >> +
> >> +               /*
> >> +                * If we have more than one runnable task queued below
> >> +                * this cfs_rq, the next pick will likely go for a
> >> +                * different entity now that we have advanced the
> >> +                * vruntime and the deadline of the running entity.
> >> +                */
> >> +               if (cfs_rq->h_nr_runnable > 1)
> >
> > Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> > correctly. Shouldn't the penalty apply at the LCA of yielder and
> > target? Otherwise the vruntime adjustment might not affect the level
> > where they actually compete.
>
> So here is the case I'm going after - consider the following
> hierarchy:
>
>      root
>     /    \
>   CG0   CG1
>    |     |
>    A     B
>
>   CG* are cgroups and, [A-Z]* are tasks
>
> A decides to yield to B, and advances its deadline on CG0's timeline.
> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> CG0 which will in-turn select task A and it'll yield again. This
> cycle repeates until vruntime of CG0 turns large enough to make itself
> ineligible and route the EEVDF pick to CG1.

Yes, natural convergence works, but requires multiple cycles. Your
h_nr_runnable > 1 stops propagation when another entity might be
picked, but "might" depends on vruntime ordering which needs time to
develop. Our penalty forces immediate ineligibility at the LCA. One
penalty application vs N natural yield cycles.

>
> Now consider:
>
>
>        root
>       /    \
>     CG0   CG1
>    /   \   |
>   A     C  B
>
> Same scenario: A yields to B. A advances its vruntime and deadline
> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> runtime for A when task B is runnable? Just because one task decided
> to yield to another task in a different cgroup doesn't mean other
> waiting tasks on that hierarchy suffer.

You're right that C suffers unfairly if it's independent work. This is
a known tradeoff. The rationale: when A spins on B's lock, we apply
the penalty at the LCA (root in your example) because that's where A
and B compete. This ensures B gets scheduled. The side effect is C
loses CPU time even though it's not involved in the dependency. In
practice: VMs typically put all vCPUs in one cgroup—no independent C
exists. If C exists and is affected by the same lock, the penalty
helps overall progress. If C is truly independent, it loses one
scheduling slice worth of time.

>
> >
> >> +                       break;
> >> +       } while ((se = parent_entity(se)));
> >>  }
> >>
> >>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> >> ---
> >
> > Fixed one-slice penalties underperformed in our testing (dbench:
> > +14.4%/+9.8%/+6.7% for 2/3/4 VMs). We found adaptive scaling (6.0×
> > down to 1.0× based on queue size) necessary to balance effectiveness
> > against starvation.
>
> If all vCPUs of a VM are in the same cgroup - yield_to() should work
> just fine. If this "target" task is not selected then either some
> entity in the hierarchy, or the task is ineligible and EEVDF pick has
> decided to go with something else.
>
> It is not "starvation" but rather you've received you for fair share
> of "proportional runtime" and now you wait. If you really want to
> follow EEVDF maybe you compute the vlag and if it is behind the
> avg_vruntime, you account it to the "target" task - that would be
> in the spirit of the EEVDF algorithm.

You're right about the terminology—it's priority inversion, not
starvation. On crediting the target: this is philosophically
interesting but has practical issues. 1) Only helps if the target's
vlag < 0 (already lagging). If the lock holder is ahead (vlag > 0), no
effect. 2) Doesn't prevent the yielder from being re-picked at the LCA
if it's still most eligible. Accounting-wise: the spinner consumes
real CPU cycles. Our penalty charges that consumption. Crediting the
target gives service it didn't receive—arguably less consistent with
proportional fairness.

Regards,
Wanpeng

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by K Prateek Nayak 2 months, 3 weeks ago

Hello Wanpeng,

On 11/13/2025 2:03 PM, Wanpeng Li wrote:
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b4617d631549..87560f5a18b3 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
>>>>          * which yields immediately again; without the condition the vruntime
>>>>          * ends up quickly running away.
>>>>          */
>>>> -       if (entity_eligible(cfs_rq, se)) {
>>>> +       do {
>>>> +               cfs_rq = cfs_rq_of(se);
>>>> +
>>>> +               /*
>>>> +                * Another entity will be selected at next pick.
>>>> +                * Single entity on cfs_rq can never be ineligible.
>>>> +                */
>>>> +               if (!entity_eligible(cfs_rq, se))
>>>> +                       break;
>>>> +
>>>>                 se->vruntime = se->deadline;
>>>
>>> Setting vruntime = deadline zeros out lag. Does this cause fairness
>>> drift with repeated yields? We explicitly recalculate vlag after
>>> adjustment to preserve EEVDF invariants.
>>
>> We only push deadline when the entity is eligible. Ineligible entity
>> will break out above. Also I don't get how adding a penalty to an
>> entity in the cgroup hierarchy of the yielding task when there are
>> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> 
> Our penalty preserves EEVDF invariants by recalculating all scheduler state:
>    se->vruntime = new_vruntime;
>    se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
>    se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
>    update_min_vruntime(cfs_rq); // maintains cfs_rq consistency

So your exact implementation in yield_deboost_apply_penalty() is:

> +	new_vruntime = se_y_lca->vruntime + penalty;
> +
> +	/* Validity check */
> +	if (new_vruntime <= se_y_lca->vruntime)
> +		return;
> +
> +	se_y_lca->vruntime = new_vruntime;

You've updated this vruntime to something that you've seen fit based on
your performance data - better performance is not necessarily fair.

update_curr() uses:

    /* Time elapsed. */
    delta_exec = now - se->exec_start;
    se->exec_start = now;

    curr->vruntime += calc_delta_fair(delta_exec, curr);


"delta_exec" is based on the amount of time entity has run as opposed
to the penalty calculation which simply advances the vruntime by half a
slice because someone in the hierarchy decided to yield.

Also assume the vCPU yielding and the target is on the same cgroup -
you'll advance the vruntime of task in yield_deboost_apply_penalty() and
then again in yield_task_fair()?


> +	se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> +	se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;

There is no point in setting vlag for a running entity

> +	update_min_vruntime(cfs_rq_common);

> This is the same update pattern used in update_curr(). The EEVDF
> relationship lag = (V - v) * w remains valid—vlag becomes more
> negative as vruntime increases.

Sure "V" just moves to the new avg_vruntime() to give the 0-lag
point but modifying the vruntime arbitrarily doesn't seem fair to
me.

> The presence of other runnable tasks
> doesn't affect the mathematical correctness; each entity's lag is
> computed independently relative to avg_vruntime.
> 
>>
>>>
>>>>                 se->deadline += calc_delta_fair(se->slice, se);
>>>> -       }
>>>> +
>>>> +               /*
>>>> +                * If we have more than one runnable task queued below
>>>> +                * this cfs_rq, the next pick will likely go for a
>>>> +                * different entity now that we have advanced the
>>>> +                * vruntime and the deadline of the running entity.
>>>> +                */
>>>> +               if (cfs_rq->h_nr_runnable > 1)
>>>
>>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
>>> correctly. Shouldn't the penalty apply at the LCA of yielder and
>>> target? Otherwise the vruntime adjustment might not affect the level
>>> where they actually compete.
>>
>> So here is the case I'm going after - consider the following
>> hierarchy:
>>
>>      root
>>     /    \
>>   CG0   CG1
>>    |     |
>>    A     B
>>
>>   CG* are cgroups and, [A-Z]* are tasks
>>
>> A decides to yield to B, and advances its deadline on CG0's timeline.
>> Currently, if CG0 is eligible and CG1 isn't, pick will still select
>> CG0 which will in-turn select task A and it'll yield again. This
>> cycle repeates until vruntime of CG0 turns large enough to make itself
>> ineligible and route the EEVDF pick to CG1.
> 
> Yes, natural convergence works, but requires multiple cycles. Your
> h_nr_runnable > 1 stops propagation when another entity might be
> picked, but "might" depends on vruntime ordering which needs time to
> develop. Our penalty forces immediate ineligibility at the LCA. One
> penalty application vs N natural yield cycles.
> 
>>
>> Now consider:
>>
>>
>>        root
>>       /    \
>>     CG0   CG1
>>    /   \   |
>>   A     C  B
>>
>> Same scenario: A yields to B. A advances its vruntime and deadline
>> as a prt of yield. Now, why should CG0 sacrifice its fair share of
>> runtime for A when task B is runnable? Just because one task decided
>> to yield to another task in a different cgroup doesn't mean other
>> waiting tasks on that hierarchy suffer.
> 
> You're right that C suffers unfairly if it's independent work. This is
> a known tradeoff.

So KVM is only one of the user of yield_to(). This whole debouncer
infrastructure seems to be over complicating all this. If anything
is yielding across cgroup boundary - that seems like bad
configuration and if necessary, the previous suggestion does stuff
fairly. I don't mind accounting the lost time in
yield_to_task_fair() and account it to target task but apart from
that, I don't think any of it is "fair".

Again, maybe it is only me and everyone else sees the vision having
dealt with virtualization.

-- 
Thanks and Regards,
Prateek

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Wanpeng Li 2 months, 3 weeks ago

Hi Prateek,

On Thu, 13 Nov 2025 at 17:48, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/13/2025 2:03 PM, Wanpeng Li wrote:
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index b4617d631549..87560f5a18b3 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
> >>>>          * which yields immediately again; without the condition the vruntime
> >>>>          * ends up quickly running away.
> >>>>          */
> >>>> -       if (entity_eligible(cfs_rq, se)) {
> >>>> +       do {
> >>>> +               cfs_rq = cfs_rq_of(se);
> >>>> +
> >>>> +               /*
> >>>> +                * Another entity will be selected at next pick.
> >>>> +                * Single entity on cfs_rq can never be ineligible.
> >>>> +                */
> >>>> +               if (!entity_eligible(cfs_rq, se))
> >>>> +                       break;
> >>>> +
> >>>>                 se->vruntime = se->deadline;
> >>>
> >>> Setting vruntime = deadline zeros out lag. Does this cause fairness
> >>> drift with repeated yields? We explicitly recalculate vlag after
> >>> adjustment to preserve EEVDF invariants.
> >>
> >> We only push deadline when the entity is eligible. Ineligible entity
> >> will break out above. Also I don't get how adding a penalty to an
> >> entity in the cgroup hierarchy of the yielding task when there are
> >> other runnable tasks considered as "preserve(ing) EEVDF invariants".
> >
> > Our penalty preserves EEVDF invariants by recalculating all scheduler state:
> >    se->vruntime = new_vruntime;
> >    se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> >    se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> >    update_min_vruntime(cfs_rq); // maintains cfs_rq consistency
>
> So your exact implementation in yield_deboost_apply_penalty() is:
>
> > +     new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > +     /* Validity check */
> > +     if (new_vruntime <= se_y_lca->vruntime)
> > +             return;
> > +
> > +     se_y_lca->vruntime = new_vruntime;
>
> You've updated this vruntime to something that you've seen fit based on
> your performance data - better performance is not necessarily fair.
>
> update_curr() uses:
>
>     /* Time elapsed. */
>     delta_exec = now - se->exec_start;
>     se->exec_start = now;
>
>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>
>
> "delta_exec" is based on the amount of time entity has run as opposed
> to the penalty calculation which simply advances the vruntime by half a
> slice because someone in the hierarchy decided to yield.

CFS already separates time accounting from policy enforcement.
place_entity() modifies vruntime based on lag without time
passage—it's placement policy, not time accounting. Similarly,
yield_task_fair() advances the deadline without consuming time—policy
to trigger reschedule. Our penalty follows this established pattern:
bounded vruntime adjustment to implement yield_to() semantics in
hierarchical scheduling. Time accounting ( update_curr ) and
scheduling policy (placement, yielding, penalties) are distinct
mechanisms in CFS.

>
> Also assume the vCPU yielding and the target is on the same cgroup -
> you'll advance the vruntime of task in yield_deboost_apply_penalty() and
> then again in yield_task_fair()?

This is deliberate. When tasks share the same cgroup, they need both
hierarchy-level and leaf-level adjustments.
yield_deboost_apply_penalty() positions the task in cgroup timeline
(affects picking at that level), while yield_task_fair() advances the
deadline (triggers immediate reschedule). Without both, same-cgroup
yield loses effectiveness—the task would be repicked despite yielding.
The double adjustment ensures yield works at both the task level and
across hierarchy levels. This matches CFS's multi-level scheduling
philosophy.

>
>
> > +     se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
> > +     se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
>
> There is no point in setting vlag for a running entity

Maintaining invariants when modifying scheduler state is standard
practice throughout fair.c. reweight_entity() updates vlag for curr
when changing weights to preserve the lag relationship. We follow the
same principle—when artificially advancing vruntime, recalculate vlag
to maintain vlag = V - v . This prevents inconsistency when the entity
later dequeues. It's defensive correctness at negligible cost. The
alternative—leaving vlag stale—risks subtle bugs when scheduler state
assumptions are violated.

>
> > +     update_min_vruntime(cfs_rq_common);
>
> > This is the same update pattern used in update_curr(). The EEVDF
> > relationship lag = (V - v) * w remains valid—vlag becomes more
> > negative as vruntime increases.
>
> Sure "V" just moves to the new avg_vruntime() to give the 0-lag
> point but modifying the vruntime arbitrarily doesn't seem fair to
> me.

yield_to() API explicitly requests directed unfairness. CFS already
implements unfairness mechanisms: nice values, cgroup weights,
set_next_buddy() immediate preference. Without our mechanism,
yield_to() silently fails across cgroups—buddy hints vanish at
hierarchy boundaries where EEVDF makes independent decisions. We make
the documented API functional. The real question: should yield_to()
work in production environments (nested cgroups)? If yes, vruntime
adjustment is necessary. If not, deprecate the API.

>
> > The presence of other runnable tasks
> > doesn't affect the mathematical correctness; each entity's lag is
> > computed independently relative to avg_vruntime.
> >
> >>
> >>>
> >>>>                 se->deadline += calc_delta_fair(se->slice, se);
> >>>> -       }
> >>>> +
> >>>> +               /*
> >>>> +                * If we have more than one runnable task queued below
> >>>> +                * this cfs_rq, the next pick will likely go for a
> >>>> +                * different entity now that we have advanced the
> >>>> +                * vruntime and the deadline of the running entity.
> >>>> +                */
> >>>> +               if (cfs_rq->h_nr_runnable > 1)
> >>>
> >>> Stopping at h_nr_runnable > 1 may not handle cross-cgroup yield_to()
> >>> correctly. Shouldn't the penalty apply at the LCA of yielder and
> >>> target? Otherwise the vruntime adjustment might not affect the level
> >>> where they actually compete.
> >>
> >> So here is the case I'm going after - consider the following
> >> hierarchy:
> >>
> >>      root
> >>     /    \
> >>   CG0   CG1
> >>    |     |
> >>    A     B
> >>
> >>   CG* are cgroups and, [A-Z]* are tasks
> >>
> >> A decides to yield to B, and advances its deadline on CG0's timeline.
> >> Currently, if CG0 is eligible and CG1 isn't, pick will still select
> >> CG0 which will in-turn select task A and it'll yield again. This
> >> cycle repeates until vruntime of CG0 turns large enough to make itself
> >> ineligible and route the EEVDF pick to CG1.
> >
> > Yes, natural convergence works, but requires multiple cycles. Your
> > h_nr_runnable > 1 stops propagation when another entity might be
> > picked, but "might" depends on vruntime ordering which needs time to
> > develop. Our penalty forces immediate ineligibility at the LCA. One
> > penalty application vs N natural yield cycles.
> >
> >>
> >> Now consider:
> >>
> >>
> >>        root
> >>       /    \
> >>     CG0   CG1
> >>    /   \   |
> >>   A     C  B
> >>
> >> Same scenario: A yields to B. A advances its vruntime and deadline
> >> as a prt of yield. Now, why should CG0 sacrifice its fair share of
> >> runtime for A when task B is runnable? Just because one task decided
> >> to yield to another task in a different cgroup doesn't mean other
> >> waiting tasks on that hierarchy suffer.
> >
> > You're right that C suffers unfairly if it's independent work. This is
> > a known tradeoff.
>
> So KVM is only one of the user of yield_to(). This whole debouncer
> infrastructure seems to be over complicating all this. If anything
> is yielding across cgroup boundary - that seems like bad
> configuration and if necessary, the previous suggestion does stuff
> fairly. I don't mind accounting the lost time in
> yield_to_task_fair() and account it to target task but apart from
> that, I don't think any of it is "fair".

Time-transfer fails fundamentally: lock holders often have higher
vruntime (ran more), so crediting them backwards doesn't change EEVDF
pick order. Our penalty pushes yielder back—effective regardless. The
infrastructure addresses real measured problems: rate limiting
prevents overhead, debounce stops ping-pong accumulation, LCA
targeting fixes hierarchy picking. Nested cgroups are production
standard (systemd, containers, cloud)—not misconfiguration.
Performance gains prove yield_to was broken. Open to simplifications,
but they must actually solve the hierarchical scheduling problem.

Regards,
Wanpeng

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by K Prateek Nayak 2 months, 3 weeks ago

Hello Wanpeng,

On 11/12/2025 10:24 AM, Wanpeng Li wrote:
>>> Problem Statement
>>> -----------------
>>>
>>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
>>> held by other vCPUs that are not currently running. The kernel's
>>> paravirtual spinlock support detects these situations and calls yield_to()
>>> to boost the lock holder, allowing it to run and release the lock.
>>>
>>> However, the current implementation has two critical limitations:
>>>
>>> 1. Scheduler-side limitation:
>>>
>>>    yield_to_task_fair() relies solely on set_next_buddy() to provide
>>>    preference to the target vCPU. This buddy mechanism only offers
>>>    immediate, transient preference. Once the buddy hint expires (typically
>>>    after one scheduling decision), the yielding vCPU may preempt the target
>>>    again, especially in nested cgroup hierarchies where vruntime domains
>>>    differ.
>>
>> So what you are saying is there are configurations out there where vCPUs
>> of same guest are put in different cgroups? Why? Does the use case
>> warrant enabling the cpu controller for the subtree? Are you running
> 
> You're right to question this. The problematic scenario occurs with
> nested cgroup hierarchies, which is common when VMs are deployed with
> cgroup-based resource management. Even when all vCPUs of a single
> guest are in the same leaf cgroup, that leaf sits under parent cgroups
> with their own vruntime domains.
> 
> The issue manifests when:
>    - set_next_buddy() provides preference at the leaf level
>    - But vruntime competition happens at parent levels

If that is the case, then NEXT_BUDDY is in-eligible as a result of its
vruntime being higher that the weighted averages of other entity.
Won't this break fairness?

Let me go look at the series and come back.

>    - The buddy hint gets "diluted" when pick_task_fair() walks up the hierarchy
> 
-- 
Thanks and Regards,
Prateek

Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

Posted by Wanpeng Li 2 months, 3 weeks ago

Hi Prateek,

On Wed, 12 Nov 2025 at 14:07, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Wanpeng,
>
> On 11/12/2025 10:24 AM, Wanpeng Li wrote:
> >>> Problem Statement
> >>> -----------------
> >>>
> >>> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> >>> held by other vCPUs that are not currently running. The kernel's
> >>> paravirtual spinlock support detects these situations and calls yield_to()
> >>> to boost the lock holder, allowing it to run and release the lock.
> >>>
> >>> However, the current implementation has two critical limitations:
> >>>
> >>> 1. Scheduler-side limitation:
> >>>
> >>>    yield_to_task_fair() relies solely on set_next_buddy() to provide
> >>>    preference to the target vCPU. This buddy mechanism only offers
> >>>    immediate, transient preference. Once the buddy hint expires (typically
> >>>    after one scheduling decision), the yielding vCPU may preempt the target
> >>>    again, especially in nested cgroup hierarchies where vruntime domains
> >>>    differ.
> >>
> >> So what you are saying is there are configurations out there where vCPUs
> >> of same guest are put in different cgroups? Why? Does the use case
> >> warrant enabling the cpu controller for the subtree? Are you running
> >
> > You're right to question this. The problematic scenario occurs with
> > nested cgroup hierarchies, which is common when VMs are deployed with
> > cgroup-based resource management. Even when all vCPUs of a single
> > guest are in the same leaf cgroup, that leaf sits under parent cgroups
> > with their own vruntime domains.
> >
> > The issue manifests when:
> >    - set_next_buddy() provides preference at the leaf level
> >    - But vruntime competition happens at parent levels
>
> If that is the case, then NEXT_BUDDY is in-eligible as a result of its
> vruntime being higher that the weighted averages of other entity.
> Won't this break fairness?

Yes, it does break strict vruntime fairness temporarily. That's
intentional. The problem: buddy expires after one pick, then vruntime
wins → ping-pong. The spinning vCPU wastes CPU while the lock holder
stays preempted. The fix applies a bounded vruntime penalty to the
yielder at the cgroup LCA level:

Bounds:
  * Rate limited: 6ms minimum interval between deboosting
  * Queue-adaptive caps: 6.0× gran for 2-task ping-pong, decays to
1.0× gran for large queues (12+)
  * Debounce: 600µs window detects A→B→A reverse patterns and reduces penalty
  * Hierarchy-aware: Applied at LCA, so same-cgroup yields have localized impact

Why acceptable: Current behavior is already unfair—wasting CPU on
spinning instead of productive work. Bounded vruntime penalty lets the
lock holder complete faster, reducing overall waste. The scheduler
still converges to fairness—the penalty just gives the boosted task
sustained advantage until it finishes the critical section. Runtime
toggle available via
/sys/kernel/debug/sched/sched_vcpu_debooster_enabled if degradation
observed. Dbench results show net throughput wins (+6-14%) outweigh
the temporary fairness deviation.

Regards,
Wanpeng