Cache Aware Scheduling

[PATCH v3 00/21] Cache Aware Scheduling

Posted by Tim Chen 1 month, 2 weeks ago

This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data within
the same Last Level Cache (LLC) domain. By improving cache locality,
the scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].

This initial implementation treats threads within the same process as
entities that are likely to share data. During load balancing, the
scheduler attempts to aggregate such threads onto the same LLC domain
whenever possible.

Most of the feedback received on v2 has been addressed. There were
discussions around grouping tasks using mechanisms other than process
membership. While we agree that more flexible grouping is desirable, this
series intentionally focuses on establishing the basic process-based
grouping first, with alternative grouping mechanisms to be explored
in a follow-on series. As a step in that direction, cache aware
scheduling statistics have been separated from the mm structure into a
new sched_cache_stats structure. Thanks for the many useful feedbacks
at LPC 2025 and for v2, we'd like to create another separate thread to
discuss the possible user interfaces.

The load balancing algorithms remain largely unchanged. The main
changes in v3 are:

1. Cache-aware scheduling is skipped after repeated load balance
failures (up to cache_nice_tries). This avoids repeatedly attempting
cache-aware migrations when no movable tasks prefer the destination
LLC.

2. The busiest runqueue is no longer sorted to select tasks that prefer
the destination LLC. This sorting was costly, and equivalent
behavior can be achieved by skipping tasks that do not prefer the
destination LLC during cache-aware migrations.

3. The calculation of the LLC ID switches to using
sched_domain_topology_level data directly that simplifies
the ID derivation.

4. Accounting of the number of tasks preferring each LLC is now kept in
the lowest-level sched domain per CPU. This simplifies handling of
LLC resizing and changes in the number of LLC domains.

Test results:
The patch series was applied and tested on v6.19-rc3.
See: https://github.com/timcchen1298/linux/commits/cache_aware_v3

The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[2]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
Due to length constraints, data without much difference with baseline is not
presented.

Sapphire Rapids:
[hackbench pipe]
case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  3.19)  +29.06 (  3.31)*
threads-pipe-2          2-groups         1.00 (  9.61)  +19.19 (  0.55)*
threads-pipe-2          4-groups         1.00 (  6.69)  +15.02 (  1.34)*
threads-pipe-2          8-groups         1.00 (  1.83)  +25.59 (  1.46)*
threads-pipe-4          1-groups         1.00 (  3.41)  +28.63 (  1.17)*
threads-pipe-4          2-groups         1.00 ( 15.62)  +19.51 (  0.82)
threads-pipe-4          4-groups         1.00 (  0.19)  +27.05 (  0.74)*
threads-pipe-4          8-groups         1.00 (  4.32)   +5.64 (  3.18)
threads-pipe-8          1-groups         1.00 (  0.44)  +24.68 (  0.49)*
threads-pipe-8          2-groups         1.00 (  2.03)  +23.76 (  0.52)*
threads-pipe-8          4-groups         1.00 (  3.77)   +7.16 (  1.58)
threads-pipe-8          8-groups         1.00 (  4.53)   +6.88 (  2.36)
threads-pipe-16         1-groups         1.00 (  1.71)  +28.46 (  0.68)*
threads-pipe-16         2-groups         1.00 (  4.25)   -0.23 (  0.97)
threads-pipe-16         4-groups         1.00 (  0.64)   -0.95 (  3.74)
threads-pipe-16         8-groups         1.00 (  1.23)   +1.77 (  0.31)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows overall improvements, while
the 99th percentile request latency exhibits increased some run-to-run
variance. The cache-aware scheduling logic, which scans all online CPUs
to identify the hottest LLC, may be the root cause of the elevated
request latency. It delays the task from returning to user space
due to the costly task_cache_work(). This issue should be mitigated by
restricting the scan to a limited set of NUMA nodes [3], and the fix is
planned to be integrated after the current version is in good shape.

99th Wakeup Latencies	Base (mean±std)      Compare (mean±std)   Change
--------------------------------------------------------------------------------
thread = 2		13.33(1.15)          13.00(1.73)          +2.48%
thread = 4		12.33(1.53)          9.67(1.53)           +21.57%
thread = 8		10.00(0.00)          10.67(0.58)          -6.70%
thread = 16		10.00(1.00)          9.33(0.58)           +6.70%
thread = 32		10.33(0.58)          9.67(1.53)           +6.39%
thread = 64		10.33(0.58)          9.33(1.53)           +9.68%
thread = 128		12.67(0.58)          12.00(0.00)          +5.29%

run-to-run variance regress at 1 messager + 8 worker:
Request Latencies 99.0th  3981.33(260.16)    4877.33(1880.57)     -22.51%

[chacha200]
Time reduced by 20%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 8, 16 respectively.
Exclude the result with large run-to-run variance, 20% ~ 50%
improvement is observed when the system is underloaded:

case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  4.04)  +47.22 (  4.77)*
threads-pipe-2          2-groups         1.00 (  5.04)  +33.79 (  8.92)*
threads-pipe-2          4-groups         1.00 (  5.82)   +5.93 (  7.97)
threads-pipe-2          8-groups         1.00 ( 16.15)   -4.11 (  6.85)
threads-pipe-4          1-groups         1.00 (  7.28)  +50.43 (  2.39)*
threads-pipe-4          2-groups         1.00 ( 10.77)   -4.31 (  7.71)
threads-pipe-4          4-groups         1.00 ( 11.16)   +8.12 ( 11.21)
threads-pipe-4          8-groups         1.00 ( 12.79)  -10.10 ( 12.92)
threads-pipe-8          1-groups         1.00 (  5.57)   -1.50 (  6.55)
threads-pipe-8          2-groups         1.00 ( 10.72)   +0.69 (  6.38)
threads-pipe-8          4-groups         1.00 (  7.04)  +19.70 (  5.58)*
threads-pipe-8          8-groups         1.00 (  7.11)  +27.46 (  2.34)*
threads-pipe-16         1-groups         1.00 (  2.86)  -12.82 (  8.97)
threads-pipe-16         2-groups         1.00 (  8.55)   +2.96 (  1.65)
threads-pipe-16         4-groups         1.00 (  5.12)  +20.49 (  5.33)*
threads-pipe-16         8-groups         1.00 (  3.23)   +9.06 (  2.87)

[chacha200]
baseline:
Host time spent: 51432ms

sched_cache:
Host time spent: 28664ms

Time reduced by 45%

[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin
[3] https://lore.kernel.org/all/865b852e3fdef6561c9e0a5be9a94aec8a68cdea.1760206683.git.tim.c.chen@linux.intel.com/

Change history:
**v3 Changes:**
1. Cache-aware scheduling is skipped after repeated load balance
   failures (up to cache_nice_tries). This avoids repeatedly attempting
   cache-aware migrations when no movable tasks prefer the destination
   LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
   the destination LLC. This sorting was costly, and equivalent
   behavior can be achieved by skipping tasks that do not prefer the
   destination LLC during cache-aware migrations.
3. Accounting of the number of tasks preferring each LLC is now kept in
   the lowest-level sched domain per CPU. This simplifies handling of
   LLC resizing and changes in the number of LLC domains.
4. Other changes from v2 are detailed in each patch's change log.

 
**v2 Changes:**
v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/
1. Align NUMA balancing and cache affinity by
   prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
   size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
   directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
   (see individual patch change log).

**v1**
v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/

Chen Yu (10):
  sched/cache: Record per LLC utilization to guide cache aware
    scheduling decisions
  sched/cache: Introduce helper functions to enforce LLC migration
    policy
  sched/cache: Make LLC id continuous
  sched/cache: Disable cache aware scheduling for processes with high
    thread counts
  sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
  sched/cache: Allow the user space to turn on and off cache aware
    scheduling
  sched/cache: Add user control to adjust the aggressiveness of
    cache-aware scheduling
  -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
    for each process via proc fs
  -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
    balance statistics

Peter Zijlstra (Intel) (1):
  sched/cache: Introduce infrastructure for cache-aware load balancing

Tim Chen (10):
  sched/cache: Assign preferred LLC ID to processes
  sched/cache: Track LLC-preferred tasks per runqueue
  sched/cache: Introduce per CPU's tasks LLC preference counter
  sched/cache: Calculate the percpu sd task LLC preference
  sched/cache: Count tasks prefering destination LLC in a sched group
  sched/cache: Check local_group only once in update_sg_lb_stats()
  sched/cache: Prioritize tasks preferring destination LLC during
    balancing
  sched/cache: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/cache: Handle moving single tasks to/from their preferred LLC
  sched/cache: Respect LLC preference in task migration and detach

 fs/proc/base.c                 |   31 +
 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   43 ++
 include/linux/sched.h          |   32 +
 include/linux/sched/topology.h |    8 +
 include/trace/events/sched.h   |   79 +++
 init/Kconfig                   |   11 +
 init/init_task.c               |    3 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   11 +
 kernel/sched/debug.c           |   55 ++
 kernel/sched/fair.c            | 1088 +++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   44 ++
 kernel/sched/topology.c        |  194 +++++-
 14 files changed, 1598 insertions(+), 28 deletions(-)

-- 
2.32.0

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/10/26 14:18, Tim Chen wrote:
> This patch series introduces infrastructure for cache-aware load
> balancing, with the goal of co-locating tasks that share data within
> the same Last Level Cache (LLC) domain. By improving cache locality,
> the scheduler can reduce cache bouncing and cache misses, ultimately
> improving data access efficiency. The design builds on the initial
> prototype from Peter [1].
> 
> This initial implementation treats threads within the same process as
> entities that are likely to share data. During load balancing, the

This is a very aggressive assumption. From what I've seen, only few tasks truly
share data. Lumping everything in a process together is an easy way to
classify, but I think we can do better.

> scheduler attempts to aggregate such threads onto the same LLC domain
> whenever possible.

I admit yet to look fully at the series. But I must ask, why are you deferring
to load balance and not looking at wake up path? LB should be for corrections.
When wake up path is doing wrong decision all the time, LB (which is super slow
to react) is too late to start grouping tasks? What am I missing?

In my head Core Scheduling is already doing what we want. We just need to
extend it to be a bit more relaxed (best effort rather than completely strict
for security reasons today). This will be a lot more flexible and will allow
tasks to be co-located from the get-go. And it will defer the responsibility of
tagging to userspace. If they do better or worse, it's on them :) It seems you
already hit a corner case where the grouping was a bad idea and doing some
magic with thread numbers to alleviate it.

FWIW I have come across cases on mobile world were co-locating on a cluster or
a 'big' core with big L2 cache can benefit a small group of tasks. So the
concept is generally beneficial as cache hierarchies are not symmetrical in
more systems now. Even on symmetrical systems, there can be cases made where
two small data dependent task can benefit from packing on a single CPU.

I know this changes the direction being made here; but I strongly believe the
right way is to extend wake up path rather than lump it solely in LB (IIUC).

Note I am looking at NETLINK to enable our proposed Sched QoS library to listen
to critical events like a process being created and tasks being forked to auto
tag them. Userspace would be easily able to tag individual tasks as
co-dependent or ask for a whole process to be tagged as such (assign the same
cookie to all forked tasks for that process). We should not need to do any
magic in the kernel then other than provide the mechanisms to shoot themselves
in the foot (or do better ;-))

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Peter Zijlstra 1 month, 1 week ago

On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> On 02/10/26 14:18, Tim Chen wrote:
> > This patch series introduces infrastructure for cache-aware load
> > balancing, with the goal of co-locating tasks that share data within
> > the same Last Level Cache (LLC) domain. By improving cache locality,
> > the scheduler can reduce cache bouncing and cache misses, ultimately
> > improving data access efficiency. The design builds on the initial
> > prototype from Peter [1].
> > 
> > This initial implementation treats threads within the same process as
> > entities that are likely to share data. During load balancing, the
> 
> This is a very aggressive assumption. From what I've seen, only few tasks truly
> share data. Lumping everything in a process together is an easy way to
> classify, but I think we can do better.

Not without more information. And that is something we can always add
later. But like you well know, it is an uphill battle to get programs to
explain/annotate themselves.

The alternative is sampling things using the PMU, see which process is
trying to access which data, but that too is non-trivial, not to mention
it will get people really upset for consuming PMU resources.

Starting things with a simple assumption is fine. This can always be
extended. Gotta start somewhere and all that. It currently groups things
by mm_struct, but it would be fairly straight forward to allow userspace
to group tasks manually.

> > scheduler attempts to aggregate such threads onto the same LLC domain
> > whenever possible.
> 
> I admit yet to look fully at the series. But I must ask, why are you deferring
> to load balance and not looking at wake up path? LB should be for corrections.
> When wake up path is doing wrong decision all the time, LB (which is super slow
> to react) is too late to start grouping tasks? What am I missing?

There used to be wakeup steering, but I'm not sure that still exists in
this version (still need to read beyond the first few patches). It isn't
hard to add.

But I think Tim and Chen have mostly been looking at 'enterprise'
workloads.

> In my head Core Scheduling is already doing what we want. We just need to
> extend it to be a bit more relaxed (best effort rather than completely strict
> for security reasons today). This will be a lot more flexible and will allow
> tasks to be co-located from the get-go. And it will defer the responsibility of
> tagging to userspace. If they do better or worse, it's on them :) It seems you
> already hit a corner case where the grouping was a bad idea and doing some
> magic with thread numbers to alleviate it.

No, Core scheduling does completely the wrong thing. Core scheduling is
set up to do co-scheduling, because that's what was required for that
whole speculation trainwreck. And that is very much not what you want or
need here.

You simply want a preference to co-locate things that use the same data.
Which really is a completely different thing.

> FWIW I have come across cases on mobile world were co-locating on a cluster or
> a 'big' core with big L2 cache can benefit a small group of tasks. So the
> concept is generally beneficial as cache hierarchies are not symmetrical in
> more systems now. Even on symmetrical systems, there can be cases made where
> two small data dependent task can benefit from packing on a single CPU.

Sure, we all know this. pipe-bench is a prime example, it flies if you
co-locate them on the same CPU. It tanks if you pull them apart (except
SMT siblings, those are mostly good too).

> I know this changes the direction being made here; but I strongly believe the
> right way is to extend wake up path rather than lump it solely in LB (IIUC).

You're really going to need both, and LB really is the more complicated
part. On a busy/loaded system, LB will completely wreck things for you
if it doesn't play ball.

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/19/26 15:41, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > On 02/10/26 14:18, Tim Chen wrote:
> > > This patch series introduces infrastructure for cache-aware load
> > > balancing, with the goal of co-locating tasks that share data within
> > > the same Last Level Cache (LLC) domain. By improving cache locality,
> > > the scheduler can reduce cache bouncing and cache misses, ultimately
> > > improving data access efficiency. The design builds on the initial
> > > prototype from Peter [1].
> > > 
> > > This initial implementation treats threads within the same process as
> > > entities that are likely to share data. During load balancing, the
> > 
> > This is a very aggressive assumption. From what I've seen, only few tasks truly
> > share data. Lumping everything in a process together is an easy way to
> > classify, but I think we can do better.
> 
> Not without more information. And that is something we can always add
> later. But like you well know, it is an uphill battle to get programs to
> explain/annotate themselves.

Yes. I think we should be able to come up with a daemon to profile a workload
on a machine and come up with a recommendation of tasks that have data
co-dependency.

Note I strongly against programs specifying this themselves. We need to provide
a service that helps with the correct tagging - ie: it is an admin only
operation.

> 
> The alternative is sampling things using the PMU, see which process is
> trying to access which data, but that too is non-trivial, not to mention
> it will get people really upset for consuming PMU resources.

I was hoping we can tell which data structures are shared between tasks with
perf?

I am thinking this is not something that need to run continuously. But
disocvered one time off on a machine or once every update. The profiling can be
done once (on demand) I believe.

Still if someone really wants to tag all the tasks for a process to stay
together, I think this is fine if that's what they want.

> 
> Starting things with a simple assumption is fine. This can always be
> extended. Gotta start somewhere and all that. It currently groups things
> by mm_struct, but it would be fairly straight forward to allow userspace
> to group tasks manually.
> 
> > > scheduler attempts to aggregate such threads onto the same LLC domain
> > > whenever possible.
> > 
> > I admit yet to look fully at the series. But I must ask, why are you deferring
> > to load balance and not looking at wake up path? LB should be for corrections.
> > When wake up path is doing wrong decision all the time, LB (which is super slow
> > to react) is too late to start grouping tasks? What am I missing?
> 
> There used to be wakeup steering, but I'm not sure that still exists in
> this version (still need to read beyond the first few patches). It isn't
> hard to add.
> 
> But I think Tim and Chen have mostly been looking at 'enterprise'
> workloads.
> 
> > In my head Core Scheduling is already doing what we want. We just need to
> > extend it to be a bit more relaxed (best effort rather than completely strict
> > for security reasons today). This will be a lot more flexible and will allow
> > tasks to be co-located from the get-go. And it will defer the responsibility of
> > tagging to userspace. If they do better or worse, it's on them :) It seems you
> > already hit a corner case where the grouping was a bad idea and doing some
> > magic with thread numbers to alleviate it.
> 
> No, Core scheduling does completely the wrong thing. Core scheduling is
> set up to do co-scheduling, because that's what was required for that
> whole speculation trainwreck. And that is very much not what you want or
> need here.
> 
> You simply want a preference to co-locate things that use the same data.
> Which really is a completely different thing.

Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie
that needs to be co-located. Core scheduling is strict to keep them on the same
physical core, but the concept can be extended to co-locate on LLC or closest
cache?

> 
> > FWIW I have come across cases on mobile world were co-locating on a cluster or
> > a 'big' core with big L2 cache can benefit a small group of tasks. So the
> > concept is generally beneficial as cache hierarchies are not symmetrical in
> > more systems now. Even on symmetrical systems, there can be cases made where
> > two small data dependent task can benefit from packing on a single CPU.
> 
> Sure, we all know this. pipe-bench is a prime example, it flies if you
> co-locate them on the same CPU. It tanks if you pull them apart (except
> SMT siblings, those are mostly good too).

+1

> 
> > I know this changes the direction being made here; but I strongly believe the
> > right way is to extend wake up path rather than lump it solely in LB (IIUC).
> 
> You're really going to need both, and LB really is the more complicated
> part. On a busy/loaded system, LB will completely wreck things for you
> if it doesn't play ball.

Yes I wasn't advocating for wake up both only of course. But I didn't read all
the details but I saw no wake up done.

And generally as I think I have been indicating here and there; we do need to
unify the wakeup and LB decision tree. With push lb this unification become
a piece of cake if the wakeup path already handles the case. The current LB
is a big beast. And will be slow to react for many systems.

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Tim Chen 1 month, 1 week ago

On Thu, 2026-02-19 at 19:48 +0000, Qais Yousef wrote:
> On 02/19/26 15:41, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
> > > > This patch series introduces infrastructure for cache-aware load
> > > > balancing, with the goal of co-locating tasks that share data within
> > > > the same Last Level Cache (LLC) domain. By improving cache locality,
> > > > the scheduler can reduce cache bouncing and cache misses, ultimately
> > > > improving data access efficiency. The design builds on the initial
> > > > prototype from Peter [1].
> > > > 
> > > > This initial implementation treats threads within the same process as
> > > > entities that are likely to share data. During load balancing, the
> > > 
> > > This is a very aggressive assumption. From what I've seen, only few tasks truly
> > > share data. Lumping everything in a process together is an easy way to
> > > classify, but I think we can do better.
> > 
> > Not without more information. And that is something we can always add
> > later. But like you well know, it is an uphill battle to get programs to
> > explain/annotate themselves.
> 
> Yes. I think we should be able to come up with a daemon to profile a workload
> on a machine and come up with a recommendation of tasks that have data
> co-dependency.
> 
> Note I strongly against programs specifying this themselves. We need to provide
> a service that helps with the correct tagging - ie: it is an admin only
> operation.
> 
> > 
> > The alternative is sampling things using the PMU, see which process is
> > trying to access which data, but that too is non-trivial, not to mention
> > it will get people really upset for consuming PMU resources.
> 
> I was hoping we can tell which data structures are shared between tasks with
> perf?
> 
> I am thinking this is not something that need to run continuously. But
> disocvered one time off on a machine or once every update. The profiling can be
> done once (on demand) I believe.
> 
> Still if someone really wants to tag all the tasks for a process to stay
> together, I think this is fine if that's what they want.

I can envision that with tagging tasks with the same cookie that's analogous
to what we are doing for core scheduling.  Or grouping tasks by tagging a
cgroup.

> 
> > 
> > Starting things with a simple assumption is fine. This can always be
> > extended. Gotta start somewhere and all that. It currently groups things
> > by mm_struct, but it would be fairly straight forward to allow userspace
> > to group tasks manually.
> > 
> > > > scheduler attempts to aggregate such threads onto the same LLC domain
> > > > whenever possible.
> > > 
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> > 
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> > 
> > But I think Tim and Chen have mostly been looking at 'enterprise'
> > workloads.
> > 
> > > In my head Core Scheduling is already doing what we want. We just need to
> > > extend it to be a bit more relaxed (best effort rather than completely strict
> > > for security reasons today). This will be a lot more flexible and will allow
> > > tasks to be co-located from the get-go. And it will defer the responsibility of
> > > tagging to userspace. If they do better or worse, it's on them :) It seems you
> > > already hit a corner case where the grouping was a bad idea and doing some
> > > magic with thread numbers to alleviate it.
> > 
> > No, Core scheduling does completely the wrong thing. Core scheduling is
> > set up to do co-scheduling, because that's what was required for that
> > whole speculation trainwreck. And that is very much not what you want or
> > need here.
> > 
> > You simply want a preference to co-locate things that use the same data.
> > Which really is a completely different thing.
> 
> Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie
> that needs to be co-located. Core scheduling is strict to keep them on the same
> physical core, but the concept can be extended to co-locate on LLC or closest
> cache?
> 

In my understanding, core scheduling doesn't try to place the tasks
with the same cookie on the same core, but the tasks can safely
be scheduled together in SMTs on a core.

However, we can certainly use a similar cookie mechanism to indicate
tasks should be scheduled close to each other cache wise.

> > 
> > > FWIW I have come across cases on mobile world were co-locating on a cluster or
> > > a 'big' core with big L2 cache can benefit a small group of tasks. So the
> > > concept is generally beneficial as cache hierarchies are not symmetrical in
> > > more systems now. Even on symmetrical systems, there can be cases made where
> > > two small data dependent task can benefit from packing on a single CPU.
> > 
> > Sure, we all know this. pipe-bench is a prime example, it flies if you
> > co-locate them on the same CPU. It tanks if you pull them apart (except
> > SMT siblings, those are mostly good too).
> 
> +1
> 
> > 
> > > I know this changes the direction being made here; but I strongly believe the
> > > right way is to extend wake up path rather than lump it solely in LB (IIUC).
> > 
> > You're really going to need both, and LB really is the more complicated
> > part. On a busy/loaded system, LB will completely wreck things for you
> > if it doesn't play ball.
> 
> Yes I wasn't advocating for wake up both only of course. But I didn't read all
> the details but I saw no wake up done.
> 
> And generally as I think I have been indicating here and there; we do need to
> unify the wakeup and LB decision tree. With push lb this unification become
> a piece of cake if the wakeup path already handles the case. The current LB
> is a big beast. And will be slow to react for many systems.

I think as long as we have up to date information on load at the time of push
in push lb, so we don't cause over aggregation and too much load imbalance,
it will be viable to make such aggregation at wake up.

Tim

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/19/26 13:47, Tim Chen wrote:

> > > > I know this changes the direction being made here; but I strongly believe the
> > > > right way is to extend wake up path rather than lump it solely in LB (IIUC).
> > > 
> > > You're really going to need both, and LB really is the more complicated
> > > part. On a busy/loaded system, LB will completely wreck things for you
> > > if it doesn't play ball.
> > 
> > Yes I wasn't advocating for wake up both only of course. But I didn't read all
> > the details but I saw no wake up done.
> > 
> > And generally as I think I have been indicating here and there; we do need to
> > unify the wakeup and LB decision tree. With push lb this unification become
> > a piece of cake if the wakeup path already handles the case. The current LB
> > is a big beast. And will be slow to react for many systems.
> 
> I think as long as we have up to date information on load at the time of push
> in push lb, so we don't cause over aggregation and too much load imbalance,
> it will be viable to make such aggregation at wake up.

IMHO I see people are constantly tripping over task placement being too simple
and need smarter decision making process. I think Vincent's proposal is spot on
to help us handle all these situations simply with the added bonus of it being
a lot more reactive. Going down this rabbit hole is worthwhile and will benefit
us in the long run to handle more cases.

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Peter Zijlstra 1 month, 1 week ago

On Fri, Feb 20, 2026 at 03:41:27AM +0000, Qais Yousef wrote:

> IMHO I see people are constantly tripping over task placement being too simple
> and need smarter decision making process.

So at the same time we're always having trouble because its too
expensive for some.

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/20/26 09:45, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 03:41:27AM +0000, Qais Yousef wrote:
> 
> > IMHO I see people are constantly tripping over task placement being too simple
> > and need smarter decision making process.
> 
> So at the same time we're always having trouble because its too
> expensive for some.

If they don't want it, they can turn it off with a simple debugfs/sched_feat
toggle? I think our way out of this dilemma is to make it their choice. You
know, many problems can disappear if you make it another person's problem :-)
Joking aside, I am trying to implement scheduler profiles in Sched QoS so
that users can pick throughput, interactive, etc and toggle few debugfs on
their behalf. Hopefully this will help abstract the problem while still
maintain our kernel development mostly as-is. I don't think we are forced into
a choice in many cases (at kernel level). But what do I know :-)

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Chen, Yu C 1 month, 1 week ago

Hi Peter, Qais,

On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
>> On 02/10/26 14:18, Tim Chen wrote:

[ ... ]

>>
>> I admit yet to look fully at the series. But I must ask, why are you deferring
>> to load balance and not looking at wake up path? LB should be for corrections.
>> When wake up path is doing wrong decision all the time, LB (which is super slow
>> to react) is too late to start grouping tasks? What am I missing?
> 
> There used to be wakeup steering, but I'm not sure that still exists in
> this version (still need to read beyond the first few patches). It isn't
> hard to add.
> 

Please let me explain a little more about why we did this in the
load balance path. Yes, the original version implemented cache-aware
scheduling only in the wakeup path. According to our testing, this appeared
to cause some task bouncing issues across LLCs. This was due to conflicts
with the legacy load balancer, which tries to spread tasks to different 
LLCs.
So as Peter said, the load balancer should be taken care of anyway. Later,
we kept only the cache aware logic in the load balancer, and the test 
results
became much more stable, so we kept it as is. The wakeup path more or less
aggregates the wakees(threads within the same process) within the LLC in 
the
wakeup fast path, so we have not changed it for now.

Let me copy the changelog from the previous patch version:

"
In previous versions, aggregation of tasks were done in the
wake up path, without making load balancing paths aware of
LLC (Last-Level-Cache) preference. This led to the following
problems:

1) Aggregation of tasks during wake up led to load imbalance
    between LLCs
2) Load balancing tried to even out the load between LLCs
3) Wake up tasks aggregation happened at a faster rate and
    load balancing moved tasks in opposite directions, leading
    to continuous and excessive task migrations and regressions
    in benchmarks like schbench.

In this version, load balancing is made cache-aware. The main
idea of cache-aware load balancing consists of two parts:

1) Identify tasks that prefer to run on their hottest LLC and
    move them there.
2) Prevent generic load balancing from moving a task out of
    its hottest LLC.
"

thanks,
Chenyu

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/19/26 23:07, Chen, Yu C wrote:
> Hi Peter, Qais,
> 
> On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
> 
> [ ... ]
> 
> > > 
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> > 
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> > 
> 
> Please let me explain a little more about why we did this in the
> load balance path. Yes, the original version implemented cache-aware
> scheduling only in the wakeup path. According to our testing, this appeared
> to cause some task bouncing issues across LLCs. This was due to conflicts
> with the legacy load balancer, which tries to spread tasks to different
> LLCs.
> So as Peter said, the load balancer should be taken care of anyway. Later,
> we kept only the cache aware logic in the load balancer, and the test
> results

Yes, we need both. My concern is that the origin is for wake up path to keep
tasks placed correctly as most task wake up and sleep often and this is the
common case. If the decision tree is not unified, we will have problems. And
this is not a specific problem to doing placement based on memory dependency.
We need to extend the wake up path to do placement based on latency. Placement
based on energy (EAS) has the same problem too. It disabled LB altogether,
which is a problem we are trying to fix if you saw the other discussion about
overutilized handling. Load balancer can destroy energy balance easily and it
has no notion of how to distribute based on energy. This is a recurring theme
for any new task placement decision that is not purely based on load. The LB
will wreck havoc.

> became much more stable, so we kept it as is. The wakeup path more or less
> aggregates the wakees(threads within the same process) within the LLC in the
> wakeup fast path, so we have not changed it for now.

How expensive is it to use the new push lb, which unifies the decision with
wake up path, to detect these bad task placement and steer them back to the
right LLC? I think if we can construct the trigger right, we can simplify the
load balance to keep tagged tasks within the same LLC much easier. In my view
this bad task placement is just a new type of misfit where a task has strayed
from its group for whatever reason at wake up and it is not sleeping and waking
up again to be placed back with its clan - assuming the conditions has changed
to warrant the move - which the wake up path should handle anyway.

FWIW, I have been experimenting to use push lb to keep regular LB off and rely
solely on it to manage the important corner cases (including overloaded one)
- and seeing *very* promising results. But the systems I work with are small
compared to yours.

But essentially if we can construct the system to keep the wakeup path (via
regular sleep/wakeup cycle and push lb) maintain the system relatively balanced
and delay regular LB for when we need to do large intervention, we can simplify
the problem space significantly IMHO. If the LB had to kick in, then the delays
of not finding enough bandwidth to run are larger than the delays of not
sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance
and handle the small exceptions via natural sleep/wakeup cycle or push lb.

> 
> Let me copy the changelog from the previous patch version:
> 
> "
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 
> 1) Aggregation of tasks during wake up led to load imbalance
>    between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
>    load balancing moved tasks in opposite directions, leading
>    to continuous and excessive task migrations and regressions
>    in benchmarks like schbench.

Note this is an artefact of tagging all tasks belonging to the process as
co-dependent. So somehow this is a case of shooting one self in the foot
because processes with large number of tasks will create large imbalances and
will start to require special handling. I guess the question, were they really
that packed which means the steering logic needed to relax a little bit and say
hey, this is an overcommit I must spill to the other LLCs, or was it really
okay to pack them all in one LLC and LB was overzealous to kick in and needed
to be aware the new case is not really a problem that requires its
intervention?

> 
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:

I think this might work under the conditions you care about. But will be hard
to generalize. But I might need to go and read more.

Note I am mainly concerned because the wake up path can't stay based purely on
load forever and need to be able to do smarter decisions (latency being the
most important one in the horizon). And they all will hit this problem. I think
we need to find a good recipe for how to handle these problems in general.
I don't think we can extend the LB to be energy aware, latency aware, cache
aware etc without hitting a lot of hurdles. And it is too slow to react.

> 
> 1) Identify tasks that prefer to run on their hottest LLC and
>    move them there.
> 2) Prevent generic load balancing from moving a task out of
>    its hottest LLC.

Isn't this 2nd part the fix to the wake up problem you faced? 1 should
naturally be happening at wake up. And for random long running strayed tasks,
I believe push lb is an easier way to manage them.

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Chen, Yu C 1 month, 1 week ago

On 2/20/2026 11:25 AM, Qais Yousef wrote:
> On 02/19/26 23:07, Chen, Yu C wrote:
>> Hi Peter, Qais,
>>
>> On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
>>> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
>>>> On 02/10/26 14:18, Tim Chen wrote:

[ ... ]

>> became much more stable, so we kept it as is. The wakeup path more or less
>> aggregates the wakees(threads within the same process) within the LLC in the
>> wakeup fast path, so we have not changed it for now.
> 
> How expensive is it to use the new push lb, which unifies the decision with
> wake up path, to detect these bad task placement and steer them back to the
> right LLC? I think if we can construct the trigger right, we can simplify the
> load balance to keep tagged tasks within the same LLC much easier. In my view
> this bad task placement is just a new type of misfit where a task has strayed
> from its group for whatever reason at wake up and it is not sleeping and waking
> up again to be placed back with its clan - assuming the conditions has changed
> to warrant the move - which the wake up path should handle anyway.
> 
> FWIW, I have been experimenting to use push lb to keep regular LB off and rely
> solely on it to manage the important corner cases (including overloaded one)
> - and seeing *very* promising results. But the systems I work with are small
> compared to yours.
> 
> But essentially if we can construct the system to keep the wakeup path (via
> regular sleep/wakeup cycle and push lb) maintain the system relatively balanced
> and delay regular LB for when we need to do large intervention, we can simplify
> the problem space significantly IMHO. If the LB had to kick in, then the delays
> of not finding enough bandwidth to run are larger than the delays of not
> sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance
> and handle the small exceptions via natural sleep/wakeup cycle or push lb.
>

Leveraging push-lb for cache-aware task placement is interesting,
and we have considered it during LPC when Vincent and Prateek presented it.
It could be an enhancement to the basic cache-aware scheduling, IMO.
Tim has mentioned that in
https://lore.kernel.org/all/4514b6aef56d0ae144ebd56df9211c6599744633.camel@linux.intel.com/
a bouncing issue needs to be resolved if task wakeup and push-lb are
leveraged for cache-aware scheduling. They are very fast - so for 
cache-aware
scheduling, it is possible that multiple invocations of 
select_idle_sibling()
will find the same LLC suitable. Then multiple wakees are woken up on 
that LLC,
causing over-aggregation. Later, when over-aggregation is detected, several
tasks are migrated out of the LLC, which makes the LLC eligible 
again-and the
pattern repeats back and forth.

>>
>> Let me copy the changelog from the previous patch version:
>>
>> "
>> In previous versions, aggregation of tasks were done in the
>> wake up path, without making load balancing paths aware of
>> LLC (Last-Level-Cache) preference. This led to the following
>> problems:
>>
>> 1) Aggregation of tasks during wake up led to load imbalance
>>     between LLCs
>> 2) Load balancing tried to even out the load between LLCs
>> 3) Wake up tasks aggregation happened at a faster rate and
>>     load balancing moved tasks in opposite directions, leading
>>     to continuous and excessive task migrations and regressions
>>     in benchmarks like schbench.
> 
> Note this is an artefact of tagging all tasks belonging to the process as
> co-dependent. So somehow this is a case of shooting one self in the foot
> because processes with large number of tasks will create large imbalances and
> will start to require special handling. I guess the question, were they really
> that packed which means the steering logic needed to relax a little bit and say
> hey, this is an overcommit I must spill to the other LLCs, or was it really
> okay to pack them all in one LLC and LB was overzealous to kick in and needed
> to be aware the new case is not really a problem that requires its
> intervention?
> 
>>
>> In this version, load balancing is made cache-aware. The main
>> idea of cache-aware load balancing consists of two parts:
> 
> I think this might work under the conditions you care about. But will be hard
> to generalize. But I might need to go and read more.
> 
> Note I am mainly concerned because the wake up path can't stay based purely on
> load forever and need to be able to do smarter decisions (latency being the
> most important one in the horizon). And they all will hit this problem. I think
> we need to find a good recipe for how to handle these problems in general.
> I don't think we can extend the LB to be energy aware, latency aware, cache
> aware etc without hitting a lot of hurdles. And it is too slow to react.
> 
>>
>> 1) Identify tasks that prefer to run on their hottest LLC and
>>     move them there.
>> 2) Prevent generic load balancing from moving a task out of
>>     its hottest LLC.
> 
> Isn't this 2nd part the fix to the wake up problem you faced? 1 should
> naturally be happening at wake up. And for random long running strayed tasks,
> I believe push lb is an easier way to manage them.

This is doable and some logic needs to be added in wakeup/push lb to
avoid the bouncing issue mentioned above. Consider both whether do it
in task wakeup/push lb/generic lb, and the task tagging, I was thinking that
creating threads within one process appears to be a special case of tagging.
If the user chooses to create threads rather than forking new processes,
is it a higher potential for data sharing among those threads? However,
we agree that fine-grained tagging is necessary. How about this: if the
user explicitly tags tasks into a single group, the kernel can perform
aggressive task aggregation-for instance, in the wakeup/fair-push path - and
let the user accept the corresponding risks. For the default model, generic
load balancing can perform per-process task aggregation at a slower pace to
reduce the risk of false decisions and over-aggregation. We intended to 
discuss
this in a separate thread, though.

Thanks,
Chenyu

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/21/26 10:48, Chen, Yu C wrote:

> Leveraging push-lb for cache-aware task placement is interesting,
> and we have considered it during LPC when Vincent and Prateek presented it.
> It could be an enhancement to the basic cache-aware scheduling, IMO.
> Tim has mentioned that in
> https://lore.kernel.org/all/4514b6aef56d0ae144ebd56df9211c6599744633.camel@linux.intel.com/
> a bouncing issue needs to be resolved if task wakeup and push-lb are
> leveraged for cache-aware scheduling. They are very fast - so for
> cache-aware
> scheduling, it is possible that multiple invocations of
> select_idle_sibling()
> will find the same LLC suitable. Then multiple wakees are woken up on that
> LLC,
> causing over-aggregation. Later, when over-aggregation is detected, several
> tasks are migrated out of the LLC, which makes the LLC eligible again-and
> the
> pattern repeats back and forth.

I believe this is a symptom of how tagging is currently happening. I think if
we have more conservative tagging approach this will be less of a problem. But
proof is in the pudding as they say :-)

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Tim Chen 1 month, 1 week ago

On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote:
> Hi Peter, Qais,
> 
> On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
> 
> [ ... ]
> 
> > > 
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> > 
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> > 
> 
> Please let me explain a little more about why we did this in the
> load balance path. Yes, the original version implemented cache-aware
> scheduling only in the wakeup path. According to our testing, this appeared
> to cause some task bouncing issues across LLCs. This was due to conflicts
> with the legacy load balancer, which tries to spread tasks to different 
> LLCs.
> So as Peter said, the load balancer should be taken care of anyway. Later,
> we kept only the cache aware logic in the load balancer, and the test 
> results
> became much more stable, so we kept it as is. The wakeup path more or less
> aggregates the wakees(threads within the same process) within the LLC in 
> the
> wakeup fast path, so we have not changed it for now.
> 
> Let me copy the changelog from the previous patch version:
> 
> "
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 
> 1) Aggregation of tasks during wake up led to load imbalance
>     between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
>     load balancing moved tasks in opposite directions, leading
>     to continuous and excessive task migrations and regressions
>     in benchmarks like schbench.
> 
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
> 
> 1) Identify tasks that prefer to run on their hottest LLC and
>     move them there.
> 2) Prevent generic load balancing from moving a task out of
>     its hottest LLC.
> "
> 

Another reason why we moved away from doing things in the wake up
path is load imbalance consideration. Wake up path does not have
the most up to date load information in the LLC sched domains as
in the load balance path. So you may actually have everyone rushed
into each's favorite LLC and causes LLC overload.  And load balance
will have to undo this.  This led to frequent task migrations that
hurts performance.

It is better to consider LLC preference in the load balance path
so we can aggregate tasks while still keeping load imbalance under
control.

Tim

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/19/26 10:11, Tim Chen wrote:
> On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote:
> > Hi Peter, Qais,
> > 
> > On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > > On 02/10/26 14:18, Tim Chen wrote:
> > 
> > [ ... ]
> > 
> > > > 
> > > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > > to load balance and not looking at wake up path? LB should be for corrections.
> > > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > > to react) is too late to start grouping tasks? What am I missing?
> > > 
> > > There used to be wakeup steering, but I'm not sure that still exists in
> > > this version (still need to read beyond the first few patches). It isn't
> > > hard to add.
> > > 
> > 
> > Please let me explain a little more about why we did this in the
> > load balance path. Yes, the original version implemented cache-aware
> > scheduling only in the wakeup path. According to our testing, this appeared
> > to cause some task bouncing issues across LLCs. This was due to conflicts
> > with the legacy load balancer, which tries to spread tasks to different 
> > LLCs.
> > So as Peter said, the load balancer should be taken care of anyway. Later,
> > we kept only the cache aware logic in the load balancer, and the test 
> > results
> > became much more stable, so we kept it as is. The wakeup path more or less
> > aggregates the wakees(threads within the same process) within the LLC in 
> > the
> > wakeup fast path, so we have not changed it for now.
> > 
> > Let me copy the changelog from the previous patch version:
> > 
> > "
> > In previous versions, aggregation of tasks were done in the
> > wake up path, without making load balancing paths aware of
> > LLC (Last-Level-Cache) preference. This led to the following
> > problems:
> > 
> > 1) Aggregation of tasks during wake up led to load imbalance
> >     between LLCs
> > 2) Load balancing tried to even out the load between LLCs
> > 3) Wake up tasks aggregation happened at a faster rate and
> >     load balancing moved tasks in opposite directions, leading
> >     to continuous and excessive task migrations and regressions
> >     in benchmarks like schbench.
> > 
> > In this version, load balancing is made cache-aware. The main
> > idea of cache-aware load balancing consists of two parts:
> > 
> > 1) Identify tasks that prefer to run on their hottest LLC and
> >     move them there.
> > 2) Prevent generic load balancing from moving a task out of
> >     its hottest LLC.
> > "
> > 
> 
> Another reason why we moved away from doing things in the wake up
> path is load imbalance consideration. Wake up path does not have
> the most up to date load information in the LLC sched domains as
> in the load balance path. So you may actually have everyone rushed

What's the reason wake up doesn't have the latest info? Is this a limitation of
these large systems where stats updates are too expensive to do? Is it not
fixable at all?

> into each's favorite LLC and causes LLC overload.  And load balance
> will have to undo this.  This led to frequent task migrations that
> hurts performance.
> 
> It is better to consider LLC preference in the load balance path
> so we can aggregate tasks while still keeping load imbalance under
> control.
> 
> Tim

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Tim Chen 1 month, 1 week ago

On Fri, 2026-02-20 at 03:29 +0000, Qais Yousef wrote:
> On 02/19/26 10:11, Tim Chen wrote:
> > On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote:
> > > Hi Peter, Qais,
> > > 
> > > On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > > > On 02/10/26 14:18, Tim Chen wrote:
> > > 
> > > [ ... ]
> > > 
> > > > > 
> > > > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > > > to load balance and not looking at wake up path? LB should be for corrections.
> > > > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > > > to react) is too late to start grouping tasks? What am I missing?
> > > > 
> > > > There used to be wakeup steering, but I'm not sure that still exists in
> > > > this version (still need to read beyond the first few patches). It isn't
> > > > hard to add.
> > > > 
> > > 
> > > Please let me explain a little more about why we did this in the
> > > load balance path. Yes, the original version implemented cache-aware
> > > scheduling only in the wakeup path. According to our testing, this appeared
> > > to cause some task bouncing issues across LLCs. This was due to conflicts
> > > with the legacy load balancer, which tries to spread tasks to different 
> > > LLCs.
> > > So as Peter said, the load balancer should be taken care of anyway. Later,
> > > we kept only the cache aware logic in the load balancer, and the test 
> > > results
> > > became much more stable, so we kept it as is. The wakeup path more or less
> > > aggregates the wakees(threads within the same process) within the LLC in 
> > > the
> > > wakeup fast path, so we have not changed it for now.
> > > 
> > > Let me copy the changelog from the previous patch version:
> > > 
> > > "
> > > In previous versions, aggregation of tasks were done in the
> > > wake up path, without making load balancing paths aware of
> > > LLC (Last-Level-Cache) preference. This led to the following
> > > problems:
> > > 
> > > 1) Aggregation of tasks during wake up led to load imbalance
> > >     between LLCs
> > > 2) Load balancing tried to even out the load between LLCs
> > > 3) Wake up tasks aggregation happened at a faster rate and
> > >     load balancing moved tasks in opposite directions, leading
> > >     to continuous and excessive task migrations and regressions
> > >     in benchmarks like schbench.
> > > 
> > > In this version, load balancing is made cache-aware. The main
> > > idea of cache-aware load balancing consists of two parts:
> > > 
> > > 1) Identify tasks that prefer to run on their hottest LLC and
> > >     move them there.
> > > 2) Prevent generic load balancing from moving a task out of
> > >     its hottest LLC.
> > > "
> > > 
> > 
> > Another reason why we moved away from doing things in the wake up
> > path is load imbalance consideration. Wake up path does not have
> > the most up to date load information in the LLC sched domains as
> > in the load balance path. So you may actually have everyone rushed
> 
> What's the reason wake up doesn't have the latest info? Is this a limitation of
> these large systems where stats updates are too expensive to do? Is it not
> fixable at all?

You will need to sum the load for each run queue for each LLC to get
an accurate picture.  That will be too expensive on the wake up path.

Tim

> 
> > into each's favorite LLC and causes LLC overload.  And load balance
> > will have to undo this.  This led to frequent task migrations that
> > hurts performance.
> > 
> > It is better to consider LLC preference in the load balance path
> > so we can aggregate tasks while still keeping load imbalance under
> > control.
> > 
> > Tim

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/20/26 10:14, Tim Chen wrote:

> > > Another reason why we moved away from doing things in the wake up
> > > path is load imbalance consideration. Wake up path does not have
> > > the most up to date load information in the LLC sched domains as
> > > in the load balance path. So you may actually have everyone rushed
> > 
> > What's the reason wake up doesn't have the latest info? Is this a limitation of
> > these large systems where stats updates are too expensive to do? Is it not
> > fixable at all?
> 
> You will need to sum the load for each run queue for each LLC to get
> an accurate picture.  That will be too expensive on the wake up path.

I am probably missing something obvious. But it seems enqueue/dequeue + TICK
are not keeping stats enough up-to-date for wakeup path to rely on. I need to
read this code more.

I could be wrong, but as I was trying to highlight in other places, I think the
fact we tag all tasks belonging to a process as needing to stay together is
exaggerating this problem. First every process is assumed to need to stay
within the same LLC, and every task within the process. The wake up path by
design now has a more difficult job and needs to look harder compared to if the
tagging was more conservative. And I can appreciate defining and teaching
regular LB that some imbalances are okay under these situations is hard. It is
sort of overcommit situation by design.

Anyway. As I was trying to tell Peter, I am trying to think how  we can tie all
these similar stories together. I hope once we can provide sensible way to tag
tasks we can get wake up path + push lb to work easily as then we should have
a handful of tasks asking to co-locate which is much easier to manage.

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Peter Zijlstra 1 month, 1 week ago

On Fri, Feb 20, 2026 at 03:29:41AM +0000, Qais Yousef wrote:

> What's the reason wake up doesn't have the latest info? Is this a limitation of
> these large systems where stats updates are too expensive to do? Is it not
> fixable at all?

Scalability is indeed the main problem. The periodic load-balancer, by
virtue of being 'slow' has two advantages:

 - the cost of aggregating the numbers is amortized by the relative low
   frequency of aggregation

 - it can work with averages; it is less concerned with immediate
   spikes.

This obviously has the exact inverse set of problems in that it is not
able to deal with immediate/short term issues.

Anyway, we're already at the point where EAS wakeup path is getting far
too expensive for the current set of hardware. While we started with a
handful of asymmetric CPUs, we're now pushing 32 CPUs or so.

(Look at Intel Nova Lake speculation online, that's supposedly going to
get us 2 dies of 8P+16E with another 4 bonus weaklings on the south
bridge or something, for a grand total of 52 asymmetric CPUs of 3 kinds)

Then consider:

 - Intel Granite Rapids-SP at 8*86 cores for 688 cores / 1376 threads. 

 - AMD Prometheus at 2*192 cores with 384 cores / 768 threads.  These
   are silly number of CPUs. 

 - Power10, it is something like 16 sockets, 16 cores per socket, 8
   threads per core for a mere 2048 threads.

Now, these are the extreme end of the spectrum systems, 'nobody' will
actually have them, but in a few generations they'll seem small again.

So whatever we build now, will have to deal with silly numbers of CPUs.

Re: [PATCH v3 00/21] Cache Aware Scheduling

Posted by Qais Yousef 1 month, 1 week ago

On 02/20/26 10:43, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 03:29:41AM +0000, Qais Yousef wrote:
> 
> > What's the reason wake up doesn't have the latest info? Is this a limitation of
> > these large systems where stats updates are too expensive to do? Is it not
> > fixable at all?
> 
> Scalability is indeed the main problem. The periodic load-balancer, by
> virtue of being 'slow' has two advantages:
> 
>  - the cost of aggregating the numbers is amortized by the relative low
>    frequency of aggregation
> 
>  - it can work with averages; it is less concerned with immediate
>    spikes.
> 
> This obviously has the exact inverse set of problems in that it is not
> able to deal with immediate/short term issues.

Yes. And if we are to focus on providing better task placement based on QoS
(which what I think this is essentially is), we have a constant problem of two
paths producing results that are incompatible. Which is why I am trying to
stress the importance of the wake up path. I understand for this initial drop
we don't have a way to provide specific hints for tasks, but this is why we end
up with this difficult choices always - which I think we don't have to.

More on this at the bottom.

> 
> 
> Anyway, we're already at the point where EAS wakeup path is getting far
> too expensive for the current set of hardware. While we started with a
> handful of asymmetric CPUs, we're now pushing 32 CPUs or so.

Is this 32 perf domains? Expensive for what workloads? Folks can still use
performance governor and plug it to a wall if they want ;-)

> 
> (Look at Intel Nova Lake speculation online, that's supposedly going to
> get us 2 dies of 8P+16E with another 4 bonus weaklings on the south
> bridge or something, for a grand total of 52 asymmetric CPUs of 3 kinds)

Not sure if my experience matters for whatever this is supposed to be used for,
but the cost of wrong decision is really high on these topologies. It is bloody
worthwhile spending more time to select a better CPU and worthwhile to have the
push lb do frequent corrections. Not sure if you saw the other thread on one of
Vincent's patches - but I am trying to completely disable overutilized (or
regular LB) and rely on wakeup + push lb and seeing great success (and gains).
But I am carrying a number of improvements that I discussed in various places
on the list that makes this effective setup. Hopefully I'll share full findings
properly at some point.

> 
> 
> Then consider:
> 
>  - Intel Granite Rapids-SP at 8*86 cores for 688 cores / 1376 threads. 
> 
>  - AMD Prometheus at 2*192 cores with 384 cores / 768 threads.  These
>    are silly number of CPUs. 
> 
>  - Power10, it is something like 16 sockets, 16 cores per socket, 8
>    threads per core for a mere 2048 threads.
> 
> Now, these are the extreme end of the spectrum systems, 'nobody' will
> actually have them, but in a few generations they'll seem small again.
> 
> 
> So whatever we build now, will have to deal with silly numbers of CPUs.

True, but I think we ought to bite the bullet at some point. My line of
thought is that we don't have to (and actually shouldn't) make the compromise
at the kernel level. We can define the problem such that it is opt-in/opt-out
where users who find the benefit can opt-in or find a disadvantage opt-out.

Now the difficulty is that we don't have a way to describe such things, and
this is what I am trying to solve with Sched QoS library. I am writing this
now, but I think I should be able to help with this use case so that users can
describe which workload wants to benefit from co-locating and these tasks will
take the hit of harder task placement and frequent migration under loaded
scenarios - the contract being that being co-located has significant
performance impact they are happy to pay the price. Things that didn't
subscribe will work as-is.

Anyway, my major goal is to find how we can tie all these stories together as
we need to add ability to do task placement based on special requirements and
the conflict with LB is one major one that I think Vincent's proposal for push
lb is quite neat and spot on. I am not sure if you saw our LPC talk about Sched
QoS where we expanded on our overall thoughts.

In my view, this problem belongs to the same class of problems of placement
based on special requirements (latency, energy, cache etc) and hopefully we can
address along the way. But if not, it would be good to know more so we can
think how we can better incorporate it as part of the bigger story.

So far I think if this can be made to go through the wake up path and rely on
push lb; it is part of the same story. If not, then we need to think harder how
to connect things together for a coherent approach.

If I can successfully give you a way to describe the requirement of tasks needs
to be co-located so that we don't have to make the assumption in the kernel
that tasks belonging to the same process needs to stay in the same LLC, do you
think wake up + push lb works? If not, how do you see it evolving? And more
importantly, how do you view the role of regular LB in these cases? The way
I see it is that it should trigger less for the reasons you mentioned at the
top; and when it triggers it means heavy intervention is required and whatever
special task placement requirements will need to be dropped at this stage since
the push lb clearly failed to keep up and we are at a point where we need to do
heavy handed balancing work. I think this activities are more relevant to
multi-LLC systems - which has the added problem of defining when some
imbalances are okay; which I believe the difficulty being hit here with wakeup
path based approach. For single LLC systems I think this heavy handed approach
can be made unnecessary if we do it correctly.

Sorry a bit of divergent. But I am interested on how we can all move the ship
in the same direction. I think this is all part of making the wake up path
multi-modal and improve its co-ordination with LB.