Very briefly,
- Maintain set of CPUs which can be used by workload. It is denoted as
cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
within preferred CPUs.
For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].
*** Please review and provide your feedback!! ***
[1]:https://youtu.be/adxUKFPlOp0
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3] v3: https://lore.kernel.org/all/20260514152204.481115-1-sshegde@linux.ibm.com/
v3->v4:
- Made preferred subset of active instead of online. (K Prateek Nayak,
Peter Zijlstra)
- Dropped RT patch
- Decided generic sched_ext change doesn't make sense. Hence it has to
be custom sched_ext with its select_cpu, enqeue/dequeue etc. This will
be done later.
- changes to is_cpu_allowed/select_fallback_rq to avoid N**2 (K Prateek
Nayak). There is encoding of two bits of information there. Let me
know if this needs to split up into two.
- Add concurrency protection for enabling/disabling steal monitor (Ilya
Leoshkevich)
- Dropped tmp_mask and reset steal_monitor state (Ilya Leoshkevich)
- Added a few cpumask_check (Yury Norov)
- Picked up tag for patch 1. (Thanks to K Prateek Nayak)
- Decided not to put too much complexity for numa splicing.
There is no major TODO item at this point. There are few minor additions
which maybe good to do provided numbers show its worth. Performance
numbers are expected to be same as v2.
base: tip/sched/core at
c095741713d1 ("sched/fair: Fix newidle vs core-sched")
Shrikanth Hegde (20):
sched/debug: Remove unused schedstats
sched/docs: Document cpu_preferred_mask and Preferred CPU concept
kconfig: Provide PREFERRED_CPU option
cpumask: Introduce cpu_preferred_mask
sysfs: Add preferred CPU file
sched/core: allow only preferred CPUs in is_cpu_allowed
sched/fair: Select preferred CPU at wakeup when possible
sched/fair: load balance only among preferred CPUs
sched/core: Keep tick on non-preferred CPUs until tasks are out
sched/core: Push current task from non preferred CPU
sched/debug: Add migration stats due to non preferred CPUs
sched/debug: Create debugfs folder steal monitor
sched/debug: Provide debugfs to enable/disable steal monitor
sched/core: Introduce a simple steal monitor
sched/core: Compute steal values at regular intervals
sched/core: Introduce default arch handling code for inc/dec preferred
CPUs
sched/core: Handle steal values and mark CPUs as preferred
sched/core: Mark the direction of steal values to avoid oscillations
sched/debug: Add debug knobs for steal monitor
sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs
.../ABI/testing/sysfs-devices-system-cpu | 11 +
Documentation/scheduler/sched-arch.rst | 49 ++++
Documentation/scheduler/sched-debug.rst | 34 +++
drivers/base/cpu.c | 8 +
include/linux/cpumask.h | 21 +-
include/linux/sched.h | 20 +-
kernel/Kconfig.preempt | 13 +
kernel/cpu.c | 14 +
kernel/sched/core.c | 269 +++++++++++++++++-
kernel/sched/debug.c | 55 +++-
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 42 +++
12 files changed, 531 insertions(+), 9 deletions(-)
--
2.47.3