steal tasks to improve CPU utilization

[RFC PATCH v5 0/9] steal tasks to improve CPU utilization

Posted by Chen Jinghuang 2 weeks, 1 day ago

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates.  To minimize search time, steal the first migratable
task that is found when the bitmap is traversed.  For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.

Patch 1 defines the sparsemask type and its operations.

Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.

Patches 5 and 6 refactor existing code for a cleaner merge of later
  patches

Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.

Patch 9 adds schedstats for comparing the new behavior to the old, and
  provided as a convenience for developers only, not for integration.

The patch series is based on kernel 7.0.0-rc1. It compiles, boots, and
runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
and CONFIG_PREEMPT.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code.  In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

  steal - number of times a task is stolen from another CPU.

X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
hackbench <grps> process 100000

  baseline
  grps  time   %busy  sched   idle    wake   steal
  1     2.182  20.00  35876   17905   17958  0
  2     2.391  39.00  67753   33808   33921  0
  3     2.871  47.00  100944  48966   51538  0
  4     2.928  62.00  114489  55171   59059  0
  8     4.852  83.00  219907  92961   121703 0

  new
  grps  time   %busy  sched   idle    wake   steal   %speedup
  1     2.229  18.00  45450   22691   22751  52      -2.1
  2     2.123  40.00  49975   24977   24990  6       12.6
  3     2.690  61.00  56118   22641   32780  9073    6.7
  4     2.828  80.00  37927   12828   24165  8442    3.5
  8     4.120  95.00  85929   8613    57858  11098   17.8

Elapsed time improves by 17.8, and CPU busy utilization is up
by 1 to 18% hitting 95% at peak load. 

Note that all test results presented below are based on the 
NO_DELAY_DEQUEUE implementation. Although I have implemented the necessary
adaptations to support DELAY_DEQUEUE, I observed a noticeable performance
regression in hackbench when both DELAY_DEQUEUE and SCHED_STEAL are enabled
simultaneously, specifically in heavily overloaded scenarios where the
number of tasks far exceeds the number of CPUs. Any suggestions on how to
address this would be appreciated.

---

Changes from v1 to v2:
  - Remove stray find_time hunk from patch 5
  - Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT
  - Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2
  - Steal iff avg_idle exceeds the cost of stealing

Changes from v2 to v3:
  - Update series for kernel 4.20.  Context changes only.

Changes from v3 to v4:
  - Avoid 64-bit division on 32-bit processors in compute_skid()
  - Replace IF_SMP with inline functions to set idle_stamp
  - Push ZALLOC_MASK body into calling function
  - Set rq->cfs_overload_cpus in update_top_cache_domain instead of
    cpu_attach_domain
  - Rewrite sparsemask iterator for complete inlining
  - Cull and clean up sparsemask functions and moved all into
    sched/sparsemask.h

Changes from v4 to v5:
  - Adapt overload_set and overload_clear to support the DELAY_DEQUEUE
    feature.
  - Rebase onto kernel v7.0.0-rc1 and refresh all benchmark results.
  - Remove the previous Patch 9 ("sched/fair: disable stealing if too many 
    NUMA nodes").

Steve Sistare (9):
  sched: Provide sparsemask, a reduced contention bitmap
  sched/topology: Provide hooks to allocate data shared per LLC
  sched/topology: Provide cfs_overload_cpus bitmap
  sched/fair: Dynamically update cfs_overload_cpus
  sched/fair: Hoist idle_stamp up from idle_balance
  sched/fair: Generalize the detach_task interface
  sched/fair: Provide can_migrate_task_llc
  sched/fair: Steal work from an overloaded CPU when CPU goes idle
  sched/fair: Provide idle search schedstats

 include/linux/sched/topology.h |   1 +
 kernel/sched/core.c            |  31 ++-
 kernel/sched/fair.c            | 340 +++++++++++++++++++++++++++++++--
 kernel/sched/features.h        |   6 +
 kernel/sched/sched.h           |  11 ++
 kernel/sched/sparsemask.h      | 210 ++++++++++++++++++++
 kernel/sched/stats.c           |   9 +
 kernel/sched/stats.h           |  13 ++
 kernel/sched/topology.c        |  96 +++++++++-
 9 files changed, 694 insertions(+), 23 deletions(-)
 create mode 100644 kernel/sched/sparsemask.h

-- 
2.34.1

Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization

Posted by Peter Zijlstra 2 weeks ago

On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates.  To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed.  For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.

This makes no sense. It is the task of newidle to get more work -- if
that is failing for you, then we should fix that, not build a second way
to get tasks.

Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization

Posted by chenjinghuang 1 week ago

On 3/20/2026 4:53 PM, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates.  To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed.  For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
> 
> This makes no sense. It is the task of newidle to get more work -- if
> that is failing for you, then we should fix that, not build a second way
> to get tasks.
> 
> 
> 
Hi Peter,

That is a very valid point. However, profiling data indicates that
newidle_balance() still incurs excessive scanning overhead. Sharing the
expensive tick balance path is becoming an issue on high-core-count
systems.

As highlighted in recent ILB threads, profiling sqlite on a 224-CPU
Sapphire Rapids shows:

https://lore.kernel.org/all/cover.1690273854.git.yu.c.chen@intel.com/

6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance 
5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats

Walking the sd hierarchy in update_sd_lb_stats alone consumes over 5% of
CPU cycles. If we spend all that time scanning just to pull a tiny,
short-lived task, the search overhead dwarfs the actual runtime. It's a net
loss.

To mitigate this cost, I'd like to check if you are open to these two
directions:

1.Goal: As Tim questioned in the thread above: 

"Do we always have to find the busiest group and pull from it? Would a
relatively busy group be enough?"

Instead of chasing absolute fairness, shouldn't newidle_balance()
prioritize fast task acquisition? A task-stealing mechanism can be
effective in this regard.

2.Refactoring: Instead of a standalone fix, we could track LLC overload in
sched_domain_shared and add a lightweight fast path atop newidle_balance(),
gated by rq->avg_idle.

Do you think refactoring along these lines to integrate into the existing
framework makes sense?

Thanks, Chen Jinghuang

Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization

Posted by Valentin Schneider 1 week ago

On 20/03/26 09:53, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates.  To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed.  For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>
> This makes no sense. It is the task of newidle to get more work -- if
> that is failing for you, then we should fix that, not build a second way
> to get tasks.

Yeah I don't recall exactly how it was stitched together, but IIRC Steve's
patches had stealing as the default newidle balance up until we got to NUMA
balancing because it acted funny there.