sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

[PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 4 weeks ago

This version is after the OSPM26 Discussion[1]. There was 
a good discussion around this problem and there were feedback on some
of the implementation bits. Some of them have been tried/implemented
and few have been deferred. 

*** Review and feedback is much appreciated!! ***

[1]:https://youtu.be/adxUKFPlOp0

Briefly, Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
  cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
  on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
  possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
  within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

==========================================================================
Note: This series expect dependent series mentioned below applied on
base (tip/master) 
base: 4d034938b6b1 ("Merge branch into tip/master: 'x86/tdx'")
Dependent series: https://lore.kernel.org/all/20260513133934.380347-1-sshegde@linux.ibm.com/#t

==========================================================================
Changes since v2[2]:

- Introduce a new config CONFIG_PREFERRED_CPU and make user select
  the config for this feature. This was suggested by Yury Norov.
  This removes the dependency from PARAVIRT which would make s390
  folks happy.

- With CONFIG_PREFERRED_CPU=n, preferred state is same as online state.

- With CONFIG_PREFERRED_CPU=y, always maintain a design construct such
  that preferred is always a subset of online.

- Create a debugfs folder called steal_monitor in sched. Move away from
  sched_feat since there is no easier way to call additional code when
  doing enable/disable. This is essential when one disables the feature
  and preferred now has to be same as online to maintain that construct.

- With feature=off, preferred state is same on online state. Feature is
  still based on static key to avoid any runtime overhead.

- Prevent the ifdeffery spread to many file. Now the ifdeffery is spread
  mainly to */sched.h and cpumask.h, debug.c. Some ifdeffery have been kept
  to avoid code bloat and introducing debug files when config=n.

- Using active mask instead of using preferred mask. (One of the ideas
  suggested). This is was tried. When there is high steal time,
  a CPU marked as not-active isn't available for workload which pins
  them. That would break user affinities. 
  Also there is heavy use of it and it is well known too. So decided
  not to use it.

- Support the feature for CONFIG_SCHED_SMT=y. Note that some would have
  interpreted my comment as supporting smt or not. It was actually
  CONFIG_SCHED_SMT=n(which is rare btw). It was due to ifdeffery around
  cpu_smt_mask which was not pretty. 
  With the effort of removing the ifdeffery around it [3], this series
  supports CONFIG_SCHED_SMT=n too.

- Introduce arch specific handling for inc/dec preferred CPUs. This was
  a ask from s390 as it may have good hint from HW on which specific
  CPUs to take out. I hoping current hooks would work for s390. Please
  let me know if it works or not.

- Added comments around O(N2) complexity in rare cases for
  select_fallback_rq. (Yury Norov)

- irqbalance=n was considered as not important. It was quite hard to
  send interrupt on non-preferred CPUs as well. There was patch sent[4] as
  reply to previous version which covers irqbalance=y.

- Performance numbers from v2 (x86, powerpc, s390) showed nice
  improvements in some cases without any major regression. Numbers are
  expected to similar for this series.

==========================================================================
TODO/OPEN Questions: 

- SCHED_EXT is still pending. I tried adding few checks in
  scx_idle_test_and_clear_cpu, pick_idle_cpu_in_node and push the
  sched_ext task in tick. But it hasn't still worked with scx_simple.
  I will try to figure it out. But i may need help since
  I am yet wade deeper waters in sched_ext.

- Use PELT kind of signal to smoothen the steal time. This may help
  avoid oscillations. Current one works to certain extent.

- NUMA splicing when dec/inc preferred CPUs. Left it as of now as simple
  method works quite well. NUMA splicing is going to be heavy.
  Is it really necessary? Are there common topology with weird CPU
  distributions across NUMA?

- Consider not changing state of isolcpus, since one usually pins the
  workload on them anyways. Not typical use case though.

- Corner cases when there are multiple VM's and each may have only one
  Core. Are those cases worth taking a look?

- Add cpumask_check at appropriate places.

- Currently it works if all the guests enable the feature. If not one
  guest may take advantage of other. Is that to be fixed? Since this has
  to be enabled by admins, is that a valid concern still?

[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3]: https://lore.kernel.org/all/20260506110052.9974-1-sshegde@linux.ibm.com/#t
[4]: https://lore.kernel.org/all/8beafb01-f891-4b13-8eae-c6f3face7001@linux.ibm.com/


PS: There were several suggestions in OSPM discussion; some have been
incorporated, whichever have been intentionally deferred are mentioned
such as sched_ext and rest might have been overlooked. 

Please let me know if any specific suggestion should be prioritized
or reconsidered. Please review.

Shrikanth Hegde (20):
  sched/debug: Remove unused schedstats
  sched/docs: Document cpu_preferred_mask and Preferred CPU concept
  kconfig: Provide PREFERRED_CPU option
  cpumask: Introduce cpu_preferred_mask
  sysfs: Add preferred CPU file
  sched/core: allow only preferred CPUs in is_cpu_allowed
  sched/fair: Select preferred CPU at wakeup when possible
  sched/fair: load balance only among preferred CPUs
  sched/rt: Select a preferred CPU for wakeup and pulling rt task
  sched/core: Keep tick on non-preferred CPUs until tasks are out
  sched/core: Push current task from non preferred CPU
  sched/debug: Add migration stats due to non preferred CPUs
  sched/debug: Create debugfs folder steal_monitor
  sched/debug: Provide debugfs to enable/disable steal monitor
  sched/core: Introduce a simple steal monitor
  sched/core: Compute steal values at regular intervals
  sched/core: Introduce default arch handling code for inc/dec preferred
    CPUs
  sched/core: Handle steal values and mark CPUs as preferred
  sched/core: Mark the direction of steal values to avoid oscillations
  sched/debug: Add debug knobs for steal monitor

 .../ABI/testing/sysfs-devices-system-cpu      |  11 +
 Documentation/scheduler/sched-arch.rst        |  49 ++++
 Documentation/scheduler/sched-debug.rst       |  32 +++
 drivers/base/cpu.c                            |   8 +
 include/linux/cpumask.h                       |  21 +-
 include/linux/sched.h                         |  21 +-
 kernel/Kconfig.preempt                        |  13 +
 kernel/cpu.c                                  |  16 ++
 kernel/sched/core.c                           | 255 +++++++++++++++++-
 kernel/sched/cpupri.c                         |   1 +
 kernel/sched/debug.c                          |  51 +++-
 kernel/sched/fair.c                           |   6 +-
 kernel/sched/rt.c                             |   4 +
 kernel/sched/sched.h                          |  27 ++
 14 files changed, 505 insertions(+), 10 deletions(-)

-- 
2.47.3

Re: [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 2 days, 16 hours ago

On 5/14/26 8:51 PM, Shrikanth Hegde wrote:

> TODO/OPEN Questions:
> 
> - SCHED_EXT is still pending. I tried adding few checks in
>    scx_idle_test_and_clear_cpu, pick_idle_cpu_in_node and push the
>    sched_ext task in tick. But it hasn't still worked with scx_simple.
>    I will try to figure it out. But i may need help since
>    I am yet wade deeper waters in sched_ext.
> 

In addition to pushing the task out at sched_tick,
Did some hacking around wake up path scx_select_cpu_dfl to make
it aware of cpu_preferred state. That didn't make it work.
then tried to Push the task out in enqueue_task_scx.
That makes it somewhat work, but still has warning with scx_simple.

Given there are many different scx_ flavours, it might make sense add a new one
which is aware of preferred state and can manage it without any change to the
default implementations using custom select_cpu/enqueue/dequeue etc.

So I am thinking sched_ext bit will have to be pursued later as i don't have
much knowledge there still.

Hacked Code that was tried for scx_simple.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a92a479720d..4009ca1ec027 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11361,7 +11361,8 @@ void sched_push_current_non_preferred_cpu(struct rq *rq)

         /* Push only if it is FAIR/RT class */
         if (push_task->sched_class != &fair_sched_class &&
-           push_task->sched_class != &rt_sched_class)
+           push_task->sched_class != &rt_sched_class &&
+           push_task->sched_class != &ext_sched_class)
                 return;

         if (kthread_is_per_cpu(push_task) ||
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5a3233ead1a..a7a87f849e03 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2043,6 +2043,11 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
                 goto out;
         }

+       if (!cpu_preferred(rq->cpu)) {
+               sched_push_current_non_preferred_cpu(rq);
+               return;
+       }
+
         set_task_runnable(rq, p);
         p->scx.flags |= SCX_TASK_QUEUED;
         rq->scx.nr_running++;

Re: [PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

Posted by Shrikanth Hegde 1 week, 1 day ago


On 5/14/26 8:51 PM, Shrikanth Hegde wrote:
> This version is after the OSPM26 Discussion[1]. There was
> a good discussion around this problem and there were feedback on some
> of the implementation bits. Some of them have been tried/implemented
> and few have been deferred.
> 
> *** Review and feedback is much appreciated!! ***
> 

Gentle Ping. Please review.