sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

[PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 2 months, 1 week ago

In the virtualized environment, often there is vCPU overcommit. i.e. sum
of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
(managed by host aka pCPU). 

When many guests ask for CPU at the same time, host/hypervisor would
fail to satisfy that ask and has to preempt one vCPU to run another. If
the guests co-ordinate and ask for less CPU overall, that reduces the
vCPU threads in host, and vCPU preemption goes down.

Steal time is an indication of the underlying contention. Based on that,
if the guests reduce the vCPU request that proportionally, it would achieve
the desired outcome.

The added advantage is, it would reduce the lockholder preemption.
A vCPU maybe holding a spinlock, but still could get preempted. Such cases
will reduce since there is less vCPU preemption and lockholder will run to
completion since it would have disabled preemption in the guest.
Workload could run with time-slice extention to reduce lockholder
preemption for userspace locks, and this could help reduce lockholder
preemption even for kernelspace due to vCPU preemption.

Currently there is no infra in scheduler which moves away the task from
some CPUs without breaking the userspace affinities. CPU hotplug,
isolated CPUset would achieve moving the task off some CPUs at runtime,
But if some task is affined to specific CPUs, taking those CPUs away
results in affinity list being reset. That breaks the user affinities,
Since this is driven by scheduler rather than user doing so, can't do
that. So need a new infra. It would be better if it is lightweight.

Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For the host kernel, there is no steal time, so no changes to its preferred
CPUs. So series would affect only the guest kernels.

Current series implements a simple steal time monitor, which
reduces/increases the number of cores by 1 depending on the steal time.
It also implements a very simple method to avoid oscillations. If there
is need a need for more complex mechanisms for these, then doing them
via a steal time governors maybe an idea. One needs to enable the
feature STEAL_MONITOR to see the steal time values being processed and
preferred CPUs being set correctly. In most of the systems where there
is no steal time, preferred CPUs will be same as online CPUs.

I will attach the irqbalance patch which detects the changes in this
mask and re-adjusts the irq affinities. Series doesn't address when
irqbalance=n. Assuming many distros have irqbalance=y by default.

Discussion at LPC 2025:
https://www.youtube.com/watch?v=sZKpHVUUy1g

*** Please provide your suggestions and comments ***

=====================================================================
Patch Layout:
PATCH    01: Remove stale schedstats. Independent of the series.
PATCH 02-04: Introduce cpu_preferred_mask.
PATCH 05-09: Make scheduler aware of this mask.
PATCH    10: Push the current task in sched_tick if cpu is non-preferred.
PATCH    11: Add a new schedstat.
PATCH    12: Add a new sched feature: STEAL_MONITOR
PATCH 13-17: Periodically calculating steal time and take appropriate
             action.

======================================================================
Performance Numbers:
baseline: tip/master at 8a5f70eb7e4f (Merge branch into tip/master: 'x86/tdx')

on PowerPC: powerVM hypervisor:
+++++++++
Daytrader
+++++++++ 
It is a database workload which simulates stock live trading.
There are two VMs. The same workload is run in both VMs at the same time.
VM1 is bigger than VM2.

Note: VM1 sees 20% steal time, and VM2 sees 10% steal time with
baseline.


(with series: STEAL_MONITOR=y and Default debug steal_mon values)
On VM1:
			baseline		with_series
Throughput		1x			1.3x 
On VM2:
                        baseline                with_series
Throughput              1x                      1.1x


(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
                        baseline                with_series
Throughput:             1x                      1.45x
On VM2:
                        baseline                with_series
Throughput:             1x                      1.13x

Verdict: Shows good improvement with default values. Even better when
tuned the debug knobs.

+++++++++
Hackbench 
+++++++++
(with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
On VM1:
			baseline		with_series
10 groups		10.3			 8.5
30 groups		40.8			25.5
60 groups		77.2			47.8

on VM2:
			baseline		with_series
10 groups		 8.4			 7.5
30 groups		25.3			19.8
60 groups		41.7			36.3

Verdict: With tuned values, shows very good improvement.

==========================================================================
Since v1:
- A new name - Preferred CPUs and cpu_preferred_mask
  I had initially used the name as "Usable CPUs", but this seemed
  better. I thought of pv_preferred too, but left it as it could be too long.

- Arch independent code. Everything happens in scheduler. steal time is
  generic construct and this would help avoid each architecture doing the
  same thing more or less. Dropped powerpc code.

- Removed hacks around wakeups. Made it as part of available_idle_cpu
  which take care of many of the wakeup decisions. same for rt code.

- Implement a work function to calculate the steal times and enforce the
  policy decisions. This ensures sched_tick doesn't suffer any major
  latency.

- Steal time computation is gated with sched feature STEAL_MONITOR to
  avoid any overheads in systems which don't have vCPU overcommit.
  Feature is disabled by default.

- CPU_CAPACITY=1 was not considered since one needs the state of all CPUs
  which have this special value. Computing that in hotpath is not ideal.

- Using cpuset was not considered since it was quite tricky, given there
  is different versions and cgroups is natively user driven.

v1: https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/#t
earlier versions: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/

TODO:
- Splicing of CPUs across NUMA nodes when CPUs aren't split equally.
- irq affinity when irqbalance=n. Not sure if this is worth.
- Avoid running any unbound housekeeping work on non-preferred CPUs 
  such as in find_new_ilb. Tried, but showed a little regression in 
  no noise case. So didn't consider.
- This currently works for kernel built with CONFIG_SCHED_SMT. Didn't
  want to sprinkle too many ifdefs there. Not sure if there is any
  system which needs this feature but !SMT. If so, let me know.
  Seeing those ifdefs makes me wonder, Maybe we could cleanup
  CONFIG_SCHED_SMT with cpumask_of(cpu) in case  of !SMT?
- Performance numbers in KVM with x86, s390. 

Sorry for sending it this late. This series is the one which is meant
for discussion at OSPM 2026.


Shrikanth Hegde (17):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/rt: Select a preferred CPU for wakeup and pulling rt task
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  sched/feature: Add STEAL_MONITOR feature
  sched/core: Introduce a simple steal monitor
  sched/core: Compute steal values at regular intervals
  sched/core: Handle steal values and mark CPUs as preferred
  sched/core: Mark the direction of steal values to avoid oscillations
  sched/debug: Add debug knobs for steal monitor

 .../ABI/testing/sysfs-devices-system-cpu      |  11 +
 Documentation/scheduler/sched-arch.rst        |  48 ++++
 Documentation/scheduler/sched-debug.rst       |  27 +++
 drivers/base/cpu.c                            |  12 +
 include/linux/cpumask.h                       |  22 ++
 include/linux/sched.h                         |   4 +-
 kernel/cpu.c                                  |   6 +
 kernel/sched/core.c                           | 219 +++++++++++++++++-
 kernel/sched/cpupri.c                         |   4 +
 kernel/sched/debug.c                          |  10 +-
 kernel/sched/fair.c                           |   8 +-
 kernel/sched/features.h                       |   3 +
 kernel/sched/rt.c                             |   4 +
 kernel/sched/sched.h                          |  41 ++++
 14 files changed, 409 insertions(+), 10 deletions(-)

-- 
2.47.3

Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 2 months ago


On 4/8/26 12:49 AM, Shrikanth Hegde wrote:
> In the virtualized environment, often there is vCPU overcommit. i.e. sum
> of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
> (managed by host aka pCPU).


Patch to write custom CPUs into preferred CPUs.

This might help one echo specific CPUs based on their hardware
topology. This could be used to find out the different kind
of patterns across HWs and kind of arch specific hooks one might need
if generic STEAL_MONITOR can't cater to all needs.

Note: This disables the generic steal when custom mask is provided and
enables it once empty mask is echoed.

---
  drivers/base/cpu.c    | 54 ++++++++++++++++++++++++++++++++++++++++++-
  include/linux/sched.h |  3 +++
  kernel/sched/core.c   |  4 ++++
  3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 0a6cf37f2001..133f28b15906 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -392,12 +392,64 @@ static int cpu_uevent(const struct device *dev, 
struct kobj_uevent_env *env)
  #endif

  #ifdef CONFIG_PARAVIRT
+static ssize_t preferred_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t count)
+{
+	cpumask_var_t temp_mask;
+	int retval = 0;
+	int cpu;
+
+	if (!alloc_cpumask_var(&temp_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	retval = cpulist_parse(buf, temp_mask);
+	if (retval)
+		goto free_mask;
+
+	/* ALL cpus can't be marked as paravirt */
+	if (cpumask_equal(temp_mask, cpu_online_mask)) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+	if (cpumask_weight(temp_mask) > num_online_cpus()) {
+		retval = -EINVAL;
+		goto free_mask;
+	}
+
+	/* Echoing > means all CPUs are preferred and Enables generic steal 
monitor */
+	if (cpumask_empty(temp_mask)) {
+		static_branch_disable(&disable_generic_steal_mon);
+		cpumask_copy((struct cpumask *)&__cpu_preferred_mask, cpu_online_mask);
+
+	} else {
+		/*
+		 * Explicit Specification of Usable CPUs and Disables generic steal
+		 * monitor
+		 */
+		static_branch_enable(&disable_generic_steal_mon);
+		cpumask_copy((struct cpumask *)&__cpu_preferred_mask, temp_mask);
+
+		/* Enable tick on nohz_full cpu */
+		for_each_cpu_andnot(cpu, cpu_online_mask, temp_mask) {
+			if (tick_nohz_full_cpu(cpu))
+				tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+		}
+	}
+
+	retval = count;
+
+free_mask:
+	free_cpumask_var(temp_mask);
+	return retval;
+}
+
  static ssize_t preferred_show(struct device *dev,
  			      struct device_attribute *attr, char *buf)
  {
  	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_preferred_mask));
  }
-static DEVICE_ATTR_RO(preferred);
+static DEVICE_ATTR_RW(preferred);
  #endif

  const struct bus_type cpu_subsys = {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6c0d5d36f21c..3760c8047ffe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2515,4 +2515,7 @@ extern void migrate_enable(void);

  DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())

+#ifdef CONFIG_PARAVIRT
+DECLARE_STATIC_KEY_FALSE(disable_generic_steal_mon);
+#endif
  #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb9110f95ebf..680da55070f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11339,6 +11339,7 @@ void sched_push_current_non_preferred_cpu(struct 
rq *rq)
  }

  struct steal_monitor_t steal_mon;
+DEFINE_STATIC_KEY_FALSE(disable_generic_steal_mon);

  void sched_init_steal_monitor(void)
  {
@@ -11428,6 +11429,9 @@ void sched_trigger_steal_computation(int cpu)
  	if (likely(cpu != first_hk_cpu))
  		return;

+	if (static_branch_unlikely(&disable_generic_steal_mon))
+		return;
+
  	/*
  	 * Since everything is updated by first housekeeping CPU,
  	 * There is no need for complex syncronization.
-- 
2.47.3

Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Hillf Danton 2 months, 1 week ago

On Wed,  8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote:
> In the virtualized environment, often there is vCPU overcommit. i.e. sum
> of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
> (managed by host aka pCPU). 
> 
> When many guests ask for CPU at the same time, host/hypervisor would
> fail to satisfy that ask and has to preempt one vCPU to run another. If
> the guests co-ordinate and ask for less CPU overall, that reduces the
> vCPU threads in host, and vCPU preemption goes down.
> 
> Steal time is an indication of the underlying contention. Based on that,
> if the guests reduce the vCPU request that proportionally, it would achieve
> the desired outcome.
> 
> The added advantage is, it would reduce the lockholder preemption.
> A vCPU maybe holding a spinlock, but still could get preempted. Such cases
> will reduce since there is less vCPU preemption and lockholder will run to
> completion since it would have disabled preemption in the guest.
> Workload could run with time-slice extention to reduce lockholder
> preemption for userspace locks, and this could help reduce lockholder
> preemption even for kernelspace due to vCPU preemption.
> 
> Currently there is no infra in scheduler which moves away the task from
> some CPUs without breaking the userspace affinities. CPU hotplug,
> isolated CPUset would achieve moving the task off some CPUs at runtime,
> But if some task is affined to specific CPUs, taking those CPUs away
> results in affinity list being reset. That breaks the user affinities,
> Since this is driven by scheduler rather than user doing so, can't do
> that. So need a new infra. It would be better if it is lightweight.
> 
> Core idea is:
> - Maintain set of CPUs which can be used by workload. It is denoted as
>   cpu_preferred_mask
> - Periodically compute the steal time. If steal time is high/low based
>   on the thresholds, either reduce/increase the preferred CPUs.
> - If a CPU is marked as non-preferred, push the task running on it if
>   possible.
> - Use this CPU state in wakeup and load balance to ensure tasks run
>   within preferred CPUs.
> 
> For the host kernel, there is no steal time, so no changes to its preferred
> CPUs. So series would affect only the guest kernels.
> 
Changes are added to guest in order to detect if pCPU is overloaded, and if
that is true (I mean it is layer violation), why not ask the pCPU governor,
hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back
if necessary.

> Current series implements a simple steal time monitor, which
> reduces/increases the number of cores by 1 depending on the steal time.
> It also implements a very simple method to avoid oscillations. If there
> is need a need for more complex mechanisms for these, then doing them
> via a steal time governors maybe an idea. One needs to enable the
> feature STEAL_MONITOR to see the steal time values being processed and
> preferred CPUs being set correctly. In most of the systems where there
> is no steal time, preferred CPUs will be same as online CPUs.
> 
> I will attach the irqbalance patch which detects the changes in this
> mask and re-adjusts the irq affinities. Series doesn't address when
> irqbalance=n. Assuming many distros have irqbalance=y by default.
> 
> Discussion at LPC 2025:
> https://www.youtube.com/watch?v=sZKpHVUUy1g
> 
> *** Please provide your suggestions and comments ***
> 
> =====================================================================
> Patch Layout:
> PATCH    01: Remove stale schedstats. Independent of the series.
> PATCH 02-04: Introduce cpu_preferred_mask.
> PATCH 05-09: Make scheduler aware of this mask.
> PATCH    10: Push the current task in sched_tick if cpu is non-preferred.
> PATCH    11: Add a new schedstat.
> PATCH    12: Add a new sched feature: STEAL_MONITOR
> PATCH 13-17: Periodically calculating steal time and take appropriate
>              action.
> 
> ======================================================================
> Performance Numbers:
> baseline: tip/master at 8a5f70eb7e4f (Merge branch into tip/master: 'x86/tdx')
> 
> on PowerPC: powerVM hypervisor:
> +++++++++
> Daytrader
> +++++++++ 
> It is a database workload which simulates stock live trading.
> There are two VMs. The same workload is run in both VMs at the same time.
> VM1 is bigger than VM2.
> 
> Note: VM1 sees 20% steal time, and VM2 sees 10% steal time with
> baseline.
> 
> 
> (with series: STEAL_MONITOR=y and Default debug steal_mon values)
> On VM1:
> 			baseline		with_series
> Throughput		1x			1.3x 
> On VM2:
>                         baseline                with_series
> Throughput              1x                      1.1x
> 
> 
> (with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
> On VM1:
>                         baseline                with_series
> Throughput:             1x                      1.45x
> On VM2:
>                         baseline                with_series
> Throughput:             1x                      1.13x
> 
> Verdict: Shows good improvement with default values. Even better when
> tuned the debug knobs.
> 
> +++++++++
> Hackbench 
> +++++++++
> (with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
> On VM1:
> 			baseline		with_series
> 10 groups		10.3			 8.5
> 30 groups		40.8			25.5
> 60 groups		77.2			47.8
> 
> on VM2:
> 			baseline		with_series
> 10 groups		 8.4			 7.5
> 30 groups		25.3			19.8
> 60 groups		41.7			36.3
> 
> Verdict: With tuned values, shows very good improvement.
> 
> ==========================================================================
> Since v1:
> - A new name - Preferred CPUs and cpu_preferred_mask
>   I had initially used the name as "Usable CPUs", but this seemed
>   better. I thought of pv_preferred too, but left it as it could be too long.
> 
> - Arch independent code. Everything happens in scheduler. steal time is
>   generic construct and this would help avoid each architecture doing the
>   same thing more or less. Dropped powerpc code.
> 
> - Removed hacks around wakeups. Made it as part of available_idle_cpu
>   which take care of many of the wakeup decisions. same for rt code.
> 
> - Implement a work function to calculate the steal times and enforce the
>   policy decisions. This ensures sched_tick doesn't suffer any major
>   latency.
> 
> - Steal time computation is gated with sched feature STEAL_MONITOR to
>   avoid any overheads in systems which don't have vCPU overcommit.
>   Feature is disabled by default.
> 
> - CPU_CAPACITY=1 was not considered since one needs the state of all CPUs
>   which have this special value. Computing that in hotpath is not ideal.
> 
> - Using cpuset was not considered since it was quite tricky, given there
>   is different versions and cgroups is natively user driven.
> 
> v1: https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/#t
> earlier versions: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@linux.ibm.com/
> 
> TODO:
> - Splicing of CPUs across NUMA nodes when CPUs aren't split equally.
> - irq affinity when irqbalance=n. Not sure if this is worth.
> - Avoid running any unbound housekeeping work on non-preferred CPUs 
>   such as in find_new_ilb. Tried, but showed a little regression in 
>   no noise case. So didn't consider.
> - This currently works for kernel built with CONFIG_SCHED_SMT. Didn't
>   want to sprinkle too many ifdefs there. Not sure if there is any
>   system which needs this feature but !SMT. If so, let me know.
>   Seeing those ifdefs makes me wonder, Maybe we could cleanup
>   CONFIG_SCHED_SMT with cpumask_of(cpu) in case  of !SMT?
> - Performance numbers in KVM with x86, s390. 
> 
> Sorry for sending it this late. This series is the one which is meant
> for discussion at OSPM 2026.
> 
> 
> Shrikanth Hegde (17):
>   sched/debug: Remove unused schedstats
>   sched/docs: Document cpu_preferred_mask and Preferred CPU concept
>   cpumask: Introduce cpu_preferred_mask
>   sysfs: Add preferred CPU file
>   sched/core: allow only preferred CPUs in is_cpu_allowed
>   sched/fair: Select preferred CPU at wakeup when possible
>   sched/fair: load balance only among preferred CPUs
>   sched/rt: Select a preferred CPU for wakeup and pulling rt task
>   sched/core: Keep tick on non-preferred CPUs until tasks are out
>   sched/core: Push current task from non preferred CPU
>   sched/debug: Add migration stats due to non preferred CPUs
>   sched/feature: Add STEAL_MONITOR feature
>   sched/core: Introduce a simple steal monitor
>   sched/core: Compute steal values at regular intervals
>   sched/core: Handle steal values and mark CPUs as preferred
>   sched/core: Mark the direction of steal values to avoid oscillations
>   sched/debug: Add debug knobs for steal monitor
> 
>  .../ABI/testing/sysfs-devices-system-cpu      |  11 +
>  Documentation/scheduler/sched-arch.rst        |  48 ++++
>  Documentation/scheduler/sched-debug.rst       |  27 +++
>  drivers/base/cpu.c                            |  12 +
>  include/linux/cpumask.h                       |  22 ++
>  include/linux/sched.h                         |   4 +-
>  kernel/cpu.c                                  |   6 +
>  kernel/sched/core.c                           | 219 +++++++++++++++++-
>  kernel/sched/cpupri.c                         |   4 +
>  kernel/sched/debug.c                          |  10 +-
>  kernel/sched/fair.c                           |   8 +-
>  kernel/sched/features.h                       |   3 +
>  kernel/sched/rt.c                             |   4 +
>  kernel/sched/sched.h                          |  41 ++++
>  14 files changed, 409 insertions(+), 10 deletions(-)
> 
> -- 
> 2.47.3
> 
>

Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 2 months, 1 week ago

Hi Hillf.

On 4/8/26 3:44 PM, Hillf Danton wrote:
> On Wed,  8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote:

>> Core idea is:
>> - Maintain set of CPUs which can be used by workload. It is denoted as
>>    cpu_preferred_mask
>> - Periodically compute the steal time. If steal time is high/low based
>>    on the thresholds, either reduce/increase the preferred CPUs.
>> - If a CPU is marked as non-preferred, push the task running on it if
>>    possible.
>> - Use this CPU state in wakeup and load balance to ensure tasks run
>>    within preferred CPUs.
>>
>> For the host kernel, there is no steal time, so no changes to its preferred
>> CPUs. So series would affect only the guest kernels.
>>
> Changes are added to guest in order to detect if pCPU is overloaded, and if
> that is true (I mean it is layer violation), why not ask the pCPU governor,
> hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back
> if necessary.
> 

AFAIK, there in no information in the host scheduler on what
each vCPU is running. It maybe holding a mutex, spinlock with irq disabled
or maybe in interrupt context. Moving/migrating the vCPUs threads without
that knowledge will hurt the guest. And it has to ensure fairness.

This has to work across different archs, some have linux as hypervisor, some
has non-linux hypervisor such as powerpc, s390.

Steal time in guest is common construct in all archs. I don't think such
commonality exists in host schedulers.

If done in guest, guest actually knows what it is running and whats more important.
It can make better decisions IMHO.

Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Hillf Danton 2 months ago

On Wed, 8 Apr 2026 19:19:05 +0530 Shrikanth Hegde wrote:
>On 4/8/26 3:44 PM, Hillf Danton wrote:
>> On Wed,  8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote:
>>> Core idea is:
>>> - Maintain set of CPUs which can be used by workload. It is denoted as
>>>    cpu_preferred_mask
>>> - Periodically compute the steal time. If steal time is high/low based
>>>    on the thresholds, either reduce/increase the preferred CPUs.
>>> - If a CPU is marked as non-preferred, push the task running on it if
>>>    possible.
>>> - Use this CPU state in wakeup and load balance to ensure tasks run
>>>    within preferred CPUs.
>>>
>>> For the host kernel, there is no steal time, so no changes to its preferred
>>> CPUs. So series would affect only the guest kernels.
>>>
>> Changes are added to guest in order to detect if pCPU is overloaded, and if
>> that is true (I mean it is layer violation), why not ask the pCPU governor,
>> hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back
>> if necessary.
>> 
>
> AFAIK, there in no information in the host scheduler on what
> each vCPU is running. It maybe holding a mutex, spinlock with irq disabled

This is what layer means (particularly in the data center environment).

> or maybe in interrupt context. Moving/migrating the vCPUs threads without
> that knowledge will hurt the guest. And it has to ensure fairness.
> 
We have to pay the cost for vCPU.

> This has to work across different archs, some have linux as hypervisor, some
> has non-linux hypervisor such as powerpc, s390.
> 
Yeah, in the car cockpit product environment in Shenzhen Linux, Android and
XYZ guests run on QNX, and your steal time approach looks half baked.

> Steal time in guest is common construct in all archs. I don't think such
> commonality exists in host schedulers.
> 
> If done in guest, guest actually knows what it is running and whats more important.
> It can make better decisions IMHO.

Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 2 months ago

Hi Hillf.

On 4/9/26 10:45 AM, Hillf Danton wrote:
> On Wed, 8 Apr 2026 19:19:05 +0530 Shrikanth Hegde wrote:
>> On 4/8/26 3:44 PM, Hillf Danton wrote:
>>> On Wed,  8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote:
>>>> Core idea is:
>>>> - Maintain set of CPUs which can be used by workload. It is denoted as
>>>>     cpu_preferred_mask
>>>> - Periodically compute the steal time. If steal time is high/low based
>>>>     on the thresholds, either reduce/increase the preferred CPUs.
>>>> - If a CPU is marked as non-preferred, push the task running on it if
>>>>     possible.
>>>> - Use this CPU state in wakeup and load balance to ensure tasks run
>>>>     within preferred CPUs.
>>>>
>>>> For the host kernel, there is no steal time, so no changes to its preferred
>>>> CPUs. So series would affect only the guest kernels.
>>>>
>>> Changes are added to guest in order to detect if pCPU is overloaded, and if
>>> that is true (I mean it is layer violation), why not ask the pCPU governor,
>>> hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back
>>> if necessary.
>>>
>>
>> AFAIK, there in no information in the host scheduler on what
>> each vCPU is running. It maybe holding a mutex, spinlock with irq disabled
> 
> This is what layer means (particularly in the data center environment).
> 

Host / hypervisor scheduler
- Schedules vCPU threads as opaque entities.
Has no visibility into:
- whether a vCPU is holding a spinlock
- whether IRQs are disabled
- whether a guest mutex is contended
- guest scheduler state
Can only ensure fairness between vCPUs

Guest scheduler
Knows exact task‑level semantics
- lock ownership
- preemption state
- affinity constraints.
But does not control pCPUs directly, unless there is vCPU pinning.

Steal time is precisely the contract boundary between those layers:
So, This is not a layer violation. Guest is acting on its CPUs based on
the hint which host already provides.

Actual layer violation would be:
- host peeking into guest scheduler data
- host deciding which guest vCPUs are “important”
- host understanding guest locks or IRQ state

Or I am not understanding what you mean by layer violation.
If so, please explain to me.

Today, why is steal time is being reported?
So that guest/host can make appropriate decision. right?

When you see high steal values, You have two choices.
Either increase the underlying resource by re-partitioning the host
with more cores or reduce the incoming request from guest such that
host can meet. If the host is already at max cores,
then there is only option.

One could, say with series high steal values may not be seen, how will system
admin re-size the host. Just look at preferred vs online. If they are not same
then there was contention and preferred became subset of online. We might have
update the documentation of steal time section.

>> or maybe in interrupt context. Moving/migrating the vCPUs threads without
>> that knowledge will hurt the guest. And it has to ensure fairness.
>>
> We have to pay the cost for vCPU.
> 
>> This has to work across different archs, some have linux as hypervisor, some
>> has non-linux hypervisor such as powerpc, s390.
>>
> Yeah, in the car cockpit product environment in Shenzhen Linux, Android and
> XYZ guests run on QNX, and your steal time approach looks half baked.
> 

They likely don't have this problem. IIUC, they would prefer deterministic behavior in
automotive hypervisors. Having steal time brings unbounded latency.

If the guests are not linux, then yes. Same logic will have to be there in each guest.
But that problem exists in other direction too. You have to inform the host somehow, which of my
vCPU threads are important. That is going to be way more complex in IMHO.
Even in linux we don't have that interface today. And then repeat the same in other non-linux
guest. One could say that is even worse.

If the guests are all indeed linux, then solution would work just fine.

Just re-iterate:
- For host kernel - No Change as it can't have steal time construct. Minimal overhead.
- Guests don't have steal time - No functional change. Minimal overhead.
- Guest with steal time - NO_STEAL_MONITOR - No functional change. Minimal overhead.
- Guest with steal time - STEAL_MONITOR - Functional changes - Steal driven vCPU backoff.

>> Steal time in guest is common construct in all archs. I don't think such
>> commonality exists in host schedulers.
>>
>> If done in guest, guest actually knows what it is running and whats more important.
>> It can make better decisions IMHO.

Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 2 months, 1 week ago

> I will attach the irqbalance patch which detects the changes in this
> mask and re-adjusts the irq affinities. Series doesn't address when
> irqbalance=n. Assuming many distros have irqbalance=y by default.
> 

Subject: [PATCH] irqbalance: Check for changes in cpu_preferred_mask

---
  cputree.c    | 28 +++++++++++++++++++++++++++-
  irqbalance.c |  2 ++
  irqbalance.h |  1 +
  3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/cputree.c b/cputree.c
index 9baa264..1db3422 100644
--- a/cputree.c
+++ b/cputree.c
@@ -56,6 +56,11 @@ cpumask_t banned_cpus;
  
  cpumask_t cpu_online_map;
  
+/* This can dynamically change. If any change in mask detect
+ * and trigger a rebuild
+ */
+cpumask_t cpu_preferred_mask;
+
  /*
     it's convenient to have the complement of banned_cpus available so that
     the AND operator can be used to mask out unwanted cpus
@@ -506,15 +511,36 @@ void clear_work_stats(void)
  	for_each_object(numa_nodes, clear_obj_stats, NULL);
  }
  
+void parse_preferred_cpus(void)
+{
+	cpumask_t preferred;
+	char *path = NULL;
+
+	path = "/sys/devices/system/cpu/preferred";
+	cpus_clear(preferred);
+	process_one_line(path, get_mask_from_cpulist, &preferred);
+
+	/* Did anything change compared to earlier */
+	if (!cpus_equal(preferred, cpu_preferred_mask)) {
+		log(TO_CONSOLE, LOG_INFO, "cpu preferred mask changed\n");
+		need_rebuild = 1;
+	}
+
+	cpus_copy(cpu_preferred_mask, preferred);
+}
  
  void parse_cpu_tree(void)
  {
  	DIR *dir;
  	struct dirent *entry;
+	char buffer[4096];
  
  	setup_banned_cpus();
  
-	cpus_complement(unbanned_cpus, banned_cpus);
+	cpus_andnot(unbanned_cpus, cpu_preferred_mask, banned_cpus);
+
+	cpumask_scnprintf(buffer, 4096, unbanned_cpus);
+	log(TO_CONSOLE, LOG_INFO, "Unbanned CPUs: %s\n", buffer);
  
  	dir = opendir("/sys/devices/system/cpu");
  	if (!dir)
diff --git a/irqbalance.c b/irqbalance.c
index f80244c..f3d46b8 100644
--- a/irqbalance.c
+++ b/irqbalance.c
@@ -229,6 +229,7 @@ static void parse_command_line(int argc, char **argv)
  static void build_object_tree(void)
  {
  	build_numa_node_list();
+	parse_preferred_cpus();
  	parse_cpu_tree();
  	rebuild_irq_db();
  }
@@ -275,6 +276,7 @@ gboolean scan(gpointer data __attribute__((unused)))
  	log(TO_CONSOLE, LOG_INFO, "\n\n\n-----------------------------------------------------------------------------\n");
  	clear_work_stats();
  	parse_proc_interrupts();
+	parse_preferred_cpus();
  
  
  	/* cope with cpu hotplug -- detected during /proc/interrupts parsing */
diff --git a/irqbalance.h b/irqbalance.h
index 47e40cc..593b183 100644
--- a/irqbalance.h
+++ b/irqbalance.h
@@ -57,6 +57,7 @@ void migrate_irq_obj(struct topo_obj *from, struct topo_obj *to, struct irq_info
  void activate_mappings(void);
  void clear_cpu_tree(void);
  void free_cpu_topo(gpointer data);
+extern void parse_preferred_cpus(void);
  /*===================NEW BALANCER FUNCTIONS============================*/
  
  /*
-- 
2.47.3