fs/proc/base.c | 31 + include/linux/cacheinfo.h | 21 +- include/linux/mm_types.h | 43 ++ include/linux/sched.h | 32 + include/linux/sched/topology.h | 8 + include/trace/events/sched.h | 79 +++ init/Kconfig | 11 + init/init_task.c | 3 + kernel/fork.c | 6 + kernel/sched/core.c | 11 + kernel/sched/debug.c | 55 ++ kernel/sched/fair.c | 1088 +++++++++++++++++++++++++++++++- kernel/sched/sched.h | 44 ++ kernel/sched/topology.c | 194 +++++- 14 files changed, 1598 insertions(+), 28 deletions(-)
This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data within
the same Last Level Cache (LLC) domain. By improving cache locality,
the scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].
This initial implementation treats threads within the same process as
entities that are likely to share data. During load balancing, the
scheduler attempts to aggregate such threads onto the same LLC domain
whenever possible.
Most of the feedback received on v2 has been addressed. There were
discussions around grouping tasks using mechanisms other than process
membership. While we agree that more flexible grouping is desirable, this
series intentionally focuses on establishing the basic process-based
grouping first, with alternative grouping mechanisms to be explored
in a follow-on series. As a step in that direction, cache aware
scheduling statistics have been separated from the mm structure into a
new sched_cache_stats structure. Thanks for the many useful feedbacks
at LPC 2025 and for v2, we'd like to create another separate thread to
discuss the possible user interfaces.
The load balancing algorithms remain largely unchanged. The main
changes in v3 are:
1. Cache-aware scheduling is skipped after repeated load balance
failures (up to cache_nice_tries). This avoids repeatedly attempting
cache-aware migrations when no movable tasks prefer the destination
LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
the destination LLC. This sorting was costly, and equivalent
behavior can be achieved by skipping tasks that do not prefer the
destination LLC during cache-aware migrations.
3. The calculation of the LLC ID switches to using
sched_domain_topology_level data directly that simplifies
the ID derivation.
4. Accounting of the number of tasks preferring each LLC is now kept in
the lowest-level sched domain per CPU. This simplifies handling of
LLC resizing and changes in the number of LLC domains.
Test results:
The patch series was applied and tested on v6.19-rc3.
See: https://github.com/timcchen1298/linux/commits/cache_aware_v3
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.
The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.
hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.
[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[2]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Detail:
Due to length constraints, data without much difference with baseline is not
presented.
Sapphire Rapids:
[hackbench pipe]
case load baseline(std%) compare%( std%)
threads-pipe-2 1-groups 1.00 ( 3.19) +29.06 ( 3.31)*
threads-pipe-2 2-groups 1.00 ( 9.61) +19.19 ( 0.55)*
threads-pipe-2 4-groups 1.00 ( 6.69) +15.02 ( 1.34)*
threads-pipe-2 8-groups 1.00 ( 1.83) +25.59 ( 1.46)*
threads-pipe-4 1-groups 1.00 ( 3.41) +28.63 ( 1.17)*
threads-pipe-4 2-groups 1.00 ( 15.62) +19.51 ( 0.82)
threads-pipe-4 4-groups 1.00 ( 0.19) +27.05 ( 0.74)*
threads-pipe-4 8-groups 1.00 ( 4.32) +5.64 ( 3.18)
threads-pipe-8 1-groups 1.00 ( 0.44) +24.68 ( 0.49)*
threads-pipe-8 2-groups 1.00 ( 2.03) +23.76 ( 0.52)*
threads-pipe-8 4-groups 1.00 ( 3.77) +7.16 ( 1.58)
threads-pipe-8 8-groups 1.00 ( 4.53) +6.88 ( 2.36)
threads-pipe-16 1-groups 1.00 ( 1.71) +28.46 ( 0.68)*
threads-pipe-16 2-groups 1.00 ( 4.25) -0.23 ( 0.97)
threads-pipe-16 4-groups 1.00 ( 0.64) -0.95 ( 3.74)
threads-pipe-16 8-groups 1.00 ( 1.23) +1.77 ( 0.31)
Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.
[schbench]
The 99th percentile wakeup latency shows overall improvements, while
the 99th percentile request latency exhibits increased some run-to-run
variance. The cache-aware scheduling logic, which scans all online CPUs
to identify the hottest LLC, may be the root cause of the elevated
request latency. It delays the task from returning to user space
due to the costly task_cache_work(). This issue should be mitigated by
restricting the scan to a limited set of NUMA nodes [3], and the fix is
planned to be integrated after the current version is in good shape.
99th Wakeup Latencies Base (mean±std) Compare (mean±std) Change
--------------------------------------------------------------------------------
thread = 2 13.33(1.15) 13.00(1.73) +2.48%
thread = 4 12.33(1.53) 9.67(1.53) +21.57%
thread = 8 10.00(0.00) 10.67(0.58) -6.70%
thread = 16 10.00(1.00) 9.33(0.58) +6.70%
thread = 32 10.33(0.58) 9.67(1.53) +6.39%
thread = 64 10.33(0.58) 9.33(1.53) +9.68%
thread = 128 12.67(0.58) 12.00(0.00) +5.29%
run-to-run variance regress at 1 messager + 8 worker:
Request Latencies 99.0th 3981.33(260.16) 4877.33(1880.57) -22.51%
[chacha200]
Time reduced by 20%
Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 8, 16 respectively.
Exclude the result with large run-to-run variance, 20% ~ 50%
improvement is observed when the system is underloaded:
case load baseline(std%) compare%( std%)
threads-pipe-2 1-groups 1.00 ( 4.04) +47.22 ( 4.77)*
threads-pipe-2 2-groups 1.00 ( 5.04) +33.79 ( 8.92)*
threads-pipe-2 4-groups 1.00 ( 5.82) +5.93 ( 7.97)
threads-pipe-2 8-groups 1.00 ( 16.15) -4.11 ( 6.85)
threads-pipe-4 1-groups 1.00 ( 7.28) +50.43 ( 2.39)*
threads-pipe-4 2-groups 1.00 ( 10.77) -4.31 ( 7.71)
threads-pipe-4 4-groups 1.00 ( 11.16) +8.12 ( 11.21)
threads-pipe-4 8-groups 1.00 ( 12.79) -10.10 ( 12.92)
threads-pipe-8 1-groups 1.00 ( 5.57) -1.50 ( 6.55)
threads-pipe-8 2-groups 1.00 ( 10.72) +0.69 ( 6.38)
threads-pipe-8 4-groups 1.00 ( 7.04) +19.70 ( 5.58)*
threads-pipe-8 8-groups 1.00 ( 7.11) +27.46 ( 2.34)*
threads-pipe-16 1-groups 1.00 ( 2.86) -12.82 ( 8.97)
threads-pipe-16 2-groups 1.00 ( 8.55) +2.96 ( 1.65)
threads-pipe-16 4-groups 1.00 ( 5.12) +20.49 ( 5.33)*
threads-pipe-16 8-groups 1.00 ( 3.23) +9.06 ( 2.87)
[chacha200]
baseline:
Host time spent: 51432ms
sched_cache:
Host time spent: 28664ms
Time reduced by 45%
[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin
[3] https://lore.kernel.org/all/865b852e3fdef6561c9e0a5be9a94aec8a68cdea.1760206683.git.tim.c.chen@linux.intel.com/
Change history:
**v3 Changes:**
1. Cache-aware scheduling is skipped after repeated load balance
failures (up to cache_nice_tries). This avoids repeatedly attempting
cache-aware migrations when no movable tasks prefer the destination
LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
the destination LLC. This sorting was costly, and equivalent
behavior can be achieved by skipping tasks that do not prefer the
destination LLC during cache-aware migrations.
3. Accounting of the number of tasks preferring each LLC is now kept in
the lowest-level sched domain per CPU. This simplifies handling of
LLC resizing and changes in the number of LLC domains.
4. Other changes from v2 are detailed in each patch's change log.
**v2 Changes:**
v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/
1. Align NUMA balancing and cache affinity by
prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
(see individual patch change log).
**v1**
v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
Chen Yu (10):
sched/cache: Record per LLC utilization to guide cache aware
scheduling decisions
sched/cache: Introduce helper functions to enforce LLC migration
policy
sched/cache: Make LLC id continuous
sched/cache: Disable cache aware scheduling for processes with high
thread counts
sched/cache: Avoid cache-aware scheduling for memory-heavy processes
sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
sched/cache: Allow the user space to turn on and off cache aware
scheduling
sched/cache: Add user control to adjust the aggressiveness of
cache-aware scheduling
-- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
for each process via proc fs
-- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
balance statistics
Peter Zijlstra (Intel) (1):
sched/cache: Introduce infrastructure for cache-aware load balancing
Tim Chen (10):
sched/cache: Assign preferred LLC ID to processes
sched/cache: Track LLC-preferred tasks per runqueue
sched/cache: Introduce per CPU's tasks LLC preference counter
sched/cache: Calculate the percpu sd task LLC preference
sched/cache: Count tasks prefering destination LLC in a sched group
sched/cache: Check local_group only once in update_sg_lb_stats()
sched/cache: Prioritize tasks preferring destination LLC during
balancing
sched/cache: Add migrate_llc_task migration type for cache-aware
balancing
sched/cache: Handle moving single tasks to/from their preferred LLC
sched/cache: Respect LLC preference in task migration and detach
fs/proc/base.c | 31 +
include/linux/cacheinfo.h | 21 +-
include/linux/mm_types.h | 43 ++
include/linux/sched.h | 32 +
include/linux/sched/topology.h | 8 +
include/trace/events/sched.h | 79 +++
init/Kconfig | 11 +
init/init_task.c | 3 +
kernel/fork.c | 6 +
kernel/sched/core.c | 11 +
kernel/sched/debug.c | 55 ++
kernel/sched/fair.c | 1088 +++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 44 ++
kernel/sched/topology.c | 194 +++++-
14 files changed, 1598 insertions(+), 28 deletions(-)
--
2.32.0
On 02/10/26 14:18, Tim Chen wrote: > This patch series introduces infrastructure for cache-aware load > balancing, with the goal of co-locating tasks that share data within > the same Last Level Cache (LLC) domain. By improving cache locality, > the scheduler can reduce cache bouncing and cache misses, ultimately > improving data access efficiency. The design builds on the initial > prototype from Peter [1]. > > This initial implementation treats threads within the same process as > entities that are likely to share data. During load balancing, the This is a very aggressive assumption. From what I've seen, only few tasks truly share data. Lumping everything in a process together is an easy way to classify, but I think we can do better. > scheduler attempts to aggregate such threads onto the same LLC domain > whenever possible. I admit yet to look fully at the series. But I must ask, why are you deferring to load balance and not looking at wake up path? LB should be for corrections. When wake up path is doing wrong decision all the time, LB (which is super slow to react) is too late to start grouping tasks? What am I missing? In my head Core Scheduling is already doing what we want. We just need to extend it to be a bit more relaxed (best effort rather than completely strict for security reasons today). This will be a lot more flexible and will allow tasks to be co-located from the get-go. And it will defer the responsibility of tagging to userspace. If they do better or worse, it's on them :) It seems you already hit a corner case where the grouping was a bad idea and doing some magic with thread numbers to alleviate it. FWIW I have come across cases on mobile world were co-locating on a cluster or a 'big' core with big L2 cache can benefit a small group of tasks. So the concept is generally beneficial as cache hierarchies are not symmetrical in more systems now. Even on symmetrical systems, there can be cases made where two small data dependent task can benefit from packing on a single CPU. I know this changes the direction being made here; but I strongly believe the right way is to extend wake up path rather than lump it solely in LB (IIUC). Note I am looking at NETLINK to enable our proposed Sched QoS library to listen to critical events like a process being created and tasks being forked to auto tag them. Userspace would be easily able to tag individual tasks as co-dependent or ask for a whole process to be tagged as such (assign the same cookie to all forked tasks for that process). We should not need to do any magic in the kernel then other than provide the mechanisms to shoot themselves in the foot (or do better ;-))
On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: > On 02/10/26 14:18, Tim Chen wrote: > > This patch series introduces infrastructure for cache-aware load > > balancing, with the goal of co-locating tasks that share data within > > the same Last Level Cache (LLC) domain. By improving cache locality, > > the scheduler can reduce cache bouncing and cache misses, ultimately > > improving data access efficiency. The design builds on the initial > > prototype from Peter [1]. > > > > This initial implementation treats threads within the same process as > > entities that are likely to share data. During load balancing, the > > This is a very aggressive assumption. From what I've seen, only few tasks truly > share data. Lumping everything in a process together is an easy way to > classify, but I think we can do better. Not without more information. And that is something we can always add later. But like you well know, it is an uphill battle to get programs to explain/annotate themselves. The alternative is sampling things using the PMU, see which process is trying to access which data, but that too is non-trivial, not to mention it will get people really upset for consuming PMU resources. Starting things with a simple assumption is fine. This can always be extended. Gotta start somewhere and all that. It currently groups things by mm_struct, but it would be fairly straight forward to allow userspace to group tasks manually. > > scheduler attempts to aggregate such threads onto the same LLC domain > > whenever possible. > > I admit yet to look fully at the series. But I must ask, why are you deferring > to load balance and not looking at wake up path? LB should be for corrections. > When wake up path is doing wrong decision all the time, LB (which is super slow > to react) is too late to start grouping tasks? What am I missing? There used to be wakeup steering, but I'm not sure that still exists in this version (still need to read beyond the first few patches). It isn't hard to add. But I think Tim and Chen have mostly been looking at 'enterprise' workloads. > In my head Core Scheduling is already doing what we want. We just need to > extend it to be a bit more relaxed (best effort rather than completely strict > for security reasons today). This will be a lot more flexible and will allow > tasks to be co-located from the get-go. And it will defer the responsibility of > tagging to userspace. If they do better or worse, it's on them :) It seems you > already hit a corner case where the grouping was a bad idea and doing some > magic with thread numbers to alleviate it. No, Core scheduling does completely the wrong thing. Core scheduling is set up to do co-scheduling, because that's what was required for that whole speculation trainwreck. And that is very much not what you want or need here. You simply want a preference to co-locate things that use the same data. Which really is a completely different thing. > FWIW I have come across cases on mobile world were co-locating on a cluster or > a 'big' core with big L2 cache can benefit a small group of tasks. So the > concept is generally beneficial as cache hierarchies are not symmetrical in > more systems now. Even on symmetrical systems, there can be cases made where > two small data dependent task can benefit from packing on a single CPU. Sure, we all know this. pipe-bench is a prime example, it flies if you co-locate them on the same CPU. It tanks if you pull them apart (except SMT siblings, those are mostly good too). > I know this changes the direction being made here; but I strongly believe the > right way is to extend wake up path rather than lump it solely in LB (IIUC). You're really going to need both, and LB really is the more complicated part. On a busy/loaded system, LB will completely wreck things for you if it doesn't play ball.
On 02/19/26 15:41, Peter Zijlstra wrote: > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: > > On 02/10/26 14:18, Tim Chen wrote: > > > This patch series introduces infrastructure for cache-aware load > > > balancing, with the goal of co-locating tasks that share data within > > > the same Last Level Cache (LLC) domain. By improving cache locality, > > > the scheduler can reduce cache bouncing and cache misses, ultimately > > > improving data access efficiency. The design builds on the initial > > > prototype from Peter [1]. > > > > > > This initial implementation treats threads within the same process as > > > entities that are likely to share data. During load balancing, the > > > > This is a very aggressive assumption. From what I've seen, only few tasks truly > > share data. Lumping everything in a process together is an easy way to > > classify, but I think we can do better. > > Not without more information. And that is something we can always add > later. But like you well know, it is an uphill battle to get programs to > explain/annotate themselves. Yes. I think we should be able to come up with a daemon to profile a workload on a machine and come up with a recommendation of tasks that have data co-dependency. Note I strongly against programs specifying this themselves. We need to provide a service that helps with the correct tagging - ie: it is an admin only operation. > > The alternative is sampling things using the PMU, see which process is > trying to access which data, but that too is non-trivial, not to mention > it will get people really upset for consuming PMU resources. I was hoping we can tell which data structures are shared between tasks with perf? I am thinking this is not something that need to run continuously. But disocvered one time off on a machine or once every update. The profiling can be done once (on demand) I believe. Still if someone really wants to tag all the tasks for a process to stay together, I think this is fine if that's what they want. > > Starting things with a simple assumption is fine. This can always be > extended. Gotta start somewhere and all that. It currently groups things > by mm_struct, but it would be fairly straight forward to allow userspace > to group tasks manually. > > > > scheduler attempts to aggregate such threads onto the same LLC domain > > > whenever possible. > > > > I admit yet to look fully at the series. But I must ask, why are you deferring > > to load balance and not looking at wake up path? LB should be for corrections. > > When wake up path is doing wrong decision all the time, LB (which is super slow > > to react) is too late to start grouping tasks? What am I missing? > > There used to be wakeup steering, but I'm not sure that still exists in > this version (still need to read beyond the first few patches). It isn't > hard to add. > > But I think Tim and Chen have mostly been looking at 'enterprise' > workloads. > > > In my head Core Scheduling is already doing what we want. We just need to > > extend it to be a bit more relaxed (best effort rather than completely strict > > for security reasons today). This will be a lot more flexible and will allow > > tasks to be co-located from the get-go. And it will defer the responsibility of > > tagging to userspace. If they do better or worse, it's on them :) It seems you > > already hit a corner case where the grouping was a bad idea and doing some > > magic with thread numbers to alleviate it. > > No, Core scheduling does completely the wrong thing. Core scheduling is > set up to do co-scheduling, because that's what was required for that > whole speculation trainwreck. And that is very much not what you want or > need here. > > You simply want a preference to co-locate things that use the same data. > Which really is a completely different thing. Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie that needs to be co-located. Core scheduling is strict to keep them on the same physical core, but the concept can be extended to co-locate on LLC or closest cache? > > > FWIW I have come across cases on mobile world were co-locating on a cluster or > > a 'big' core with big L2 cache can benefit a small group of tasks. So the > > concept is generally beneficial as cache hierarchies are not symmetrical in > > more systems now. Even on symmetrical systems, there can be cases made where > > two small data dependent task can benefit from packing on a single CPU. > > Sure, we all know this. pipe-bench is a prime example, it flies if you > co-locate them on the same CPU. It tanks if you pull them apart (except > SMT siblings, those are mostly good too). +1 > > > I know this changes the direction being made here; but I strongly believe the > > right way is to extend wake up path rather than lump it solely in LB (IIUC). > > You're really going to need both, and LB really is the more complicated > part. On a busy/loaded system, LB will completely wreck things for you > if it doesn't play ball. Yes I wasn't advocating for wake up both only of course. But I didn't read all the details but I saw no wake up done. And generally as I think I have been indicating here and there; we do need to unify the wakeup and LB decision tree. With push lb this unification become a piece of cake if the wakeup path already handles the case. The current LB is a big beast. And will be slow to react for many systems.
On Thu, 2026-02-19 at 19:48 +0000, Qais Yousef wrote: > On 02/19/26 15:41, Peter Zijlstra wrote: > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: > > > On 02/10/26 14:18, Tim Chen wrote: > > > > This patch series introduces infrastructure for cache-aware load > > > > balancing, with the goal of co-locating tasks that share data within > > > > the same Last Level Cache (LLC) domain. By improving cache locality, > > > > the scheduler can reduce cache bouncing and cache misses, ultimately > > > > improving data access efficiency. The design builds on the initial > > > > prototype from Peter [1]. > > > > > > > > This initial implementation treats threads within the same process as > > > > entities that are likely to share data. During load balancing, the > > > > > > This is a very aggressive assumption. From what I've seen, only few tasks truly > > > share data. Lumping everything in a process together is an easy way to > > > classify, but I think we can do better. > > > > Not without more information. And that is something we can always add > > later. But like you well know, it is an uphill battle to get programs to > > explain/annotate themselves. > > Yes. I think we should be able to come up with a daemon to profile a workload > on a machine and come up with a recommendation of tasks that have data > co-dependency. > > Note I strongly against programs specifying this themselves. We need to provide > a service that helps with the correct tagging - ie: it is an admin only > operation. > > > > > The alternative is sampling things using the PMU, see which process is > > trying to access which data, but that too is non-trivial, not to mention > > it will get people really upset for consuming PMU resources. > > I was hoping we can tell which data structures are shared between tasks with > perf? > > I am thinking this is not something that need to run continuously. But > disocvered one time off on a machine or once every update. The profiling can be > done once (on demand) I believe. > > Still if someone really wants to tag all the tasks for a process to stay > together, I think this is fine if that's what they want. I can envision that with tagging tasks with the same cookie that's analogous to what we are doing for core scheduling. Or grouping tasks by tagging a cgroup. > > > > > Starting things with a simple assumption is fine. This can always be > > extended. Gotta start somewhere and all that. It currently groups things > > by mm_struct, but it would be fairly straight forward to allow userspace > > to group tasks manually. > > > > > > scheduler attempts to aggregate such threads onto the same LLC domain > > > > whenever possible. > > > > > > I admit yet to look fully at the series. But I must ask, why are you deferring > > > to load balance and not looking at wake up path? LB should be for corrections. > > > When wake up path is doing wrong decision all the time, LB (which is super slow > > > to react) is too late to start grouping tasks? What am I missing? > > > > There used to be wakeup steering, but I'm not sure that still exists in > > this version (still need to read beyond the first few patches). It isn't > > hard to add. > > > > But I think Tim and Chen have mostly been looking at 'enterprise' > > workloads. > > > > > In my head Core Scheduling is already doing what we want. We just need to > > > extend it to be a bit more relaxed (best effort rather than completely strict > > > for security reasons today). This will be a lot more flexible and will allow > > > tasks to be co-located from the get-go. And it will defer the responsibility of > > > tagging to userspace. If they do better or worse, it's on them :) It seems you > > > already hit a corner case where the grouping was a bad idea and doing some > > > magic with thread numbers to alleviate it. > > > > No, Core scheduling does completely the wrong thing. Core scheduling is > > set up to do co-scheduling, because that's what was required for that > > whole speculation trainwreck. And that is very much not what you want or > > need here. > > > > You simply want a preference to co-locate things that use the same data. > > Which really is a completely different thing. > > Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie > that needs to be co-located. Core scheduling is strict to keep them on the same > physical core, but the concept can be extended to co-locate on LLC or closest > cache? > In my understanding, core scheduling doesn't try to place the tasks with the same cookie on the same core, but the tasks can safely be scheduled together in SMTs on a core. However, we can certainly use a similar cookie mechanism to indicate tasks should be scheduled close to each other cache wise. > > > > > FWIW I have come across cases on mobile world were co-locating on a cluster or > > > a 'big' core with big L2 cache can benefit a small group of tasks. So the > > > concept is generally beneficial as cache hierarchies are not symmetrical in > > > more systems now. Even on symmetrical systems, there can be cases made where > > > two small data dependent task can benefit from packing on a single CPU. > > > > Sure, we all know this. pipe-bench is a prime example, it flies if you > > co-locate them on the same CPU. It tanks if you pull them apart (except > > SMT siblings, those are mostly good too). > > +1 > > > > > > I know this changes the direction being made here; but I strongly believe the > > > right way is to extend wake up path rather than lump it solely in LB (IIUC). > > > > You're really going to need both, and LB really is the more complicated > > part. On a busy/loaded system, LB will completely wreck things for you > > if it doesn't play ball. > > Yes I wasn't advocating for wake up both only of course. But I didn't read all > the details but I saw no wake up done. > > And generally as I think I have been indicating here and there; we do need to > unify the wakeup and LB decision tree. With push lb this unification become > a piece of cake if the wakeup path already handles the case. The current LB > is a big beast. And will be slow to react for many systems. I think as long as we have up to date information on load at the time of push in push lb, so we don't cause over aggregation and too much load imbalance, it will be viable to make such aggregation at wake up. Tim
On 02/19/26 13:47, Tim Chen wrote: > > > > I know this changes the direction being made here; but I strongly believe the > > > > right way is to extend wake up path rather than lump it solely in LB (IIUC). > > > > > > You're really going to need both, and LB really is the more complicated > > > part. On a busy/loaded system, LB will completely wreck things for you > > > if it doesn't play ball. > > > > Yes I wasn't advocating for wake up both only of course. But I didn't read all > > the details but I saw no wake up done. > > > > And generally as I think I have been indicating here and there; we do need to > > unify the wakeup and LB decision tree. With push lb this unification become > > a piece of cake if the wakeup path already handles the case. The current LB > > is a big beast. And will be slow to react for many systems. > > I think as long as we have up to date information on load at the time of push > in push lb, so we don't cause over aggregation and too much load imbalance, > it will be viable to make such aggregation at wake up. IMHO I see people are constantly tripping over task placement being too simple and need smarter decision making process. I think Vincent's proposal is spot on to help us handle all these situations simply with the added bonus of it being a lot more reactive. Going down this rabbit hole is worthwhile and will benefit us in the long run to handle more cases.
On Fri, Feb 20, 2026 at 03:41:27AM +0000, Qais Yousef wrote: > IMHO I see people are constantly tripping over task placement being too simple > and need smarter decision making process. So at the same time we're always having trouble because its too expensive for some.
On 02/20/26 09:45, Peter Zijlstra wrote: > On Fri, Feb 20, 2026 at 03:41:27AM +0000, Qais Yousef wrote: > > > IMHO I see people are constantly tripping over task placement being too simple > > and need smarter decision making process. > > So at the same time we're always having trouble because its too > expensive for some. If they don't want it, they can turn it off with a simple debugfs/sched_feat toggle? I think our way out of this dilemma is to make it their choice. You know, many problems can disappear if you make it another person's problem :-) Joking aside, I am trying to implement scheduler profiles in Sched QoS so that users can pick throughput, interactive, etc and toggle few debugfs on their behalf. Hopefully this will help abstract the problem while still maintain our kernel development mostly as-is. I don't think we are forced into a choice in many cases (at kernel level). But what do I know :-)
Hi Peter, Qais,
On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
>> On 02/10/26 14:18, Tim Chen wrote:
[ ... ]
>>
>> I admit yet to look fully at the series. But I must ask, why are you deferring
>> to load balance and not looking at wake up path? LB should be for corrections.
>> When wake up path is doing wrong decision all the time, LB (which is super slow
>> to react) is too late to start grouping tasks? What am I missing?
>
> There used to be wakeup steering, but I'm not sure that still exists in
> this version (still need to read beyond the first few patches). It isn't
> hard to add.
>
Please let me explain a little more about why we did this in the
load balance path. Yes, the original version implemented cache-aware
scheduling only in the wakeup path. According to our testing, this appeared
to cause some task bouncing issues across LLCs. This was due to conflicts
with the legacy load balancer, which tries to spread tasks to different
LLCs.
So as Peter said, the load balancer should be taken care of anyway. Later,
we kept only the cache aware logic in the load balancer, and the test
results
became much more stable, so we kept it as is. The wakeup path more or less
aggregates the wakees(threads within the same process) within the LLC in
the
wakeup fast path, so we have not changed it for now.
Let me copy the changelog from the previous patch version:
"
In previous versions, aggregation of tasks were done in the
wake up path, without making load balancing paths aware of
LLC (Last-Level-Cache) preference. This led to the following
problems:
1) Aggregation of tasks during wake up led to load imbalance
between LLCs
2) Load balancing tried to even out the load between LLCs
3) Wake up tasks aggregation happened at a faster rate and
load balancing moved tasks in opposite directions, leading
to continuous and excessive task migrations and regressions
in benchmarks like schbench.
In this version, load balancing is made cache-aware. The main
idea of cache-aware load balancing consists of two parts:
1) Identify tasks that prefer to run on their hottest LLC and
move them there.
2) Prevent generic load balancing from moving a task out of
its hottest LLC.
"
thanks,
Chenyu
On 02/19/26 23:07, Chen, Yu C wrote: > Hi Peter, Qais, > > On 2/19/2026 10:41 PM, Peter Zijlstra wrote: > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: > > > On 02/10/26 14:18, Tim Chen wrote: > > [ ... ] > > > > > > > I admit yet to look fully at the series. But I must ask, why are you deferring > > > to load balance and not looking at wake up path? LB should be for corrections. > > > When wake up path is doing wrong decision all the time, LB (which is super slow > > > to react) is too late to start grouping tasks? What am I missing? > > > > There used to be wakeup steering, but I'm not sure that still exists in > > this version (still need to read beyond the first few patches). It isn't > > hard to add. > > > > Please let me explain a little more about why we did this in the > load balance path. Yes, the original version implemented cache-aware > scheduling only in the wakeup path. According to our testing, this appeared > to cause some task bouncing issues across LLCs. This was due to conflicts > with the legacy load balancer, which tries to spread tasks to different > LLCs. > So as Peter said, the load balancer should be taken care of anyway. Later, > we kept only the cache aware logic in the load balancer, and the test > results Yes, we need both. My concern is that the origin is for wake up path to keep tasks placed correctly as most task wake up and sleep often and this is the common case. If the decision tree is not unified, we will have problems. And this is not a specific problem to doing placement based on memory dependency. We need to extend the wake up path to do placement based on latency. Placement based on energy (EAS) has the same problem too. It disabled LB altogether, which is a problem we are trying to fix if you saw the other discussion about overutilized handling. Load balancer can destroy energy balance easily and it has no notion of how to distribute based on energy. This is a recurring theme for any new task placement decision that is not purely based on load. The LB will wreck havoc. > became much more stable, so we kept it as is. The wakeup path more or less > aggregates the wakees(threads within the same process) within the LLC in the > wakeup fast path, so we have not changed it for now. How expensive is it to use the new push lb, which unifies the decision with wake up path, to detect these bad task placement and steer them back to the right LLC? I think if we can construct the trigger right, we can simplify the load balance to keep tagged tasks within the same LLC much easier. In my view this bad task placement is just a new type of misfit where a task has strayed from its group for whatever reason at wake up and it is not sleeping and waking up again to be placed back with its clan - assuming the conditions has changed to warrant the move - which the wake up path should handle anyway. FWIW, I have been experimenting to use push lb to keep regular LB off and rely solely on it to manage the important corner cases (including overloaded one) - and seeing *very* promising results. But the systems I work with are small compared to yours. But essentially if we can construct the system to keep the wakeup path (via regular sleep/wakeup cycle and push lb) maintain the system relatively balanced and delay regular LB for when we need to do large intervention, we can simplify the problem space significantly IMHO. If the LB had to kick in, then the delays of not finding enough bandwidth to run are larger than the delays of not sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance and handle the small exceptions via natural sleep/wakeup cycle or push lb. > > Let me copy the changelog from the previous patch version: > > " > In previous versions, aggregation of tasks were done in the > wake up path, without making load balancing paths aware of > LLC (Last-Level-Cache) preference. This led to the following > problems: > > 1) Aggregation of tasks during wake up led to load imbalance > between LLCs > 2) Load balancing tried to even out the load between LLCs > 3) Wake up tasks aggregation happened at a faster rate and > load balancing moved tasks in opposite directions, leading > to continuous and excessive task migrations and regressions > in benchmarks like schbench. Note this is an artefact of tagging all tasks belonging to the process as co-dependent. So somehow this is a case of shooting one self in the foot because processes with large number of tasks will create large imbalances and will start to require special handling. I guess the question, were they really that packed which means the steering logic needed to relax a little bit and say hey, this is an overcommit I must spill to the other LLCs, or was it really okay to pack them all in one LLC and LB was overzealous to kick in and needed to be aware the new case is not really a problem that requires its intervention? > > In this version, load balancing is made cache-aware. The main > idea of cache-aware load balancing consists of two parts: I think this might work under the conditions you care about. But will be hard to generalize. But I might need to go and read more. Note I am mainly concerned because the wake up path can't stay based purely on load forever and need to be able to do smarter decisions (latency being the most important one in the horizon). And they all will hit this problem. I think we need to find a good recipe for how to handle these problems in general. I don't think we can extend the LB to be energy aware, latency aware, cache aware etc without hitting a lot of hurdles. And it is too slow to react. > > 1) Identify tasks that prefer to run on their hottest LLC and > move them there. > 2) Prevent generic load balancing from moving a task out of > its hottest LLC. Isn't this 2nd part the fix to the wake up problem you faced? 1 should naturally be happening at wake up. And for random long running strayed tasks, I believe push lb is an easier way to manage them.
On 2/20/2026 11:25 AM, Qais Yousef wrote: > On 02/19/26 23:07, Chen, Yu C wrote: >> Hi Peter, Qais, >> >> On 2/19/2026 10:41 PM, Peter Zijlstra wrote: >>> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: >>>> On 02/10/26 14:18, Tim Chen wrote: [ ... ] >> became much more stable, so we kept it as is. The wakeup path more or less >> aggregates the wakees(threads within the same process) within the LLC in the >> wakeup fast path, so we have not changed it for now. > > How expensive is it to use the new push lb, which unifies the decision with > wake up path, to detect these bad task placement and steer them back to the > right LLC? I think if we can construct the trigger right, we can simplify the > load balance to keep tagged tasks within the same LLC much easier. In my view > this bad task placement is just a new type of misfit where a task has strayed > from its group for whatever reason at wake up and it is not sleeping and waking > up again to be placed back with its clan - assuming the conditions has changed > to warrant the move - which the wake up path should handle anyway. > > FWIW, I have been experimenting to use push lb to keep regular LB off and rely > solely on it to manage the important corner cases (including overloaded one) > - and seeing *very* promising results. But the systems I work with are small > compared to yours. > > But essentially if we can construct the system to keep the wakeup path (via > regular sleep/wakeup cycle and push lb) maintain the system relatively balanced > and delay regular LB for when we need to do large intervention, we can simplify > the problem space significantly IMHO. If the LB had to kick in, then the delays > of not finding enough bandwidth to run are larger than the delays of not > sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance > and handle the small exceptions via natural sleep/wakeup cycle or push lb. > Leveraging push-lb for cache-aware task placement is interesting, and we have considered it during LPC when Vincent and Prateek presented it. It could be an enhancement to the basic cache-aware scheduling, IMO. Tim has mentioned that in https://lore.kernel.org/all/4514b6aef56d0ae144ebd56df9211c6599744633.camel@linux.intel.com/ a bouncing issue needs to be resolved if task wakeup and push-lb are leveraged for cache-aware scheduling. They are very fast - so for cache-aware scheduling, it is possible that multiple invocations of select_idle_sibling() will find the same LLC suitable. Then multiple wakees are woken up on that LLC, causing over-aggregation. Later, when over-aggregation is detected, several tasks are migrated out of the LLC, which makes the LLC eligible again-and the pattern repeats back and forth. >> >> Let me copy the changelog from the previous patch version: >> >> " >> In previous versions, aggregation of tasks were done in the >> wake up path, without making load balancing paths aware of >> LLC (Last-Level-Cache) preference. This led to the following >> problems: >> >> 1) Aggregation of tasks during wake up led to load imbalance >> between LLCs >> 2) Load balancing tried to even out the load between LLCs >> 3) Wake up tasks aggregation happened at a faster rate and >> load balancing moved tasks in opposite directions, leading >> to continuous and excessive task migrations and regressions >> in benchmarks like schbench. > > Note this is an artefact of tagging all tasks belonging to the process as > co-dependent. So somehow this is a case of shooting one self in the foot > because processes with large number of tasks will create large imbalances and > will start to require special handling. I guess the question, were they really > that packed which means the steering logic needed to relax a little bit and say > hey, this is an overcommit I must spill to the other LLCs, or was it really > okay to pack them all in one LLC and LB was overzealous to kick in and needed > to be aware the new case is not really a problem that requires its > intervention? > >> >> In this version, load balancing is made cache-aware. The main >> idea of cache-aware load balancing consists of two parts: > > I think this might work under the conditions you care about. But will be hard > to generalize. But I might need to go and read more. > > Note I am mainly concerned because the wake up path can't stay based purely on > load forever and need to be able to do smarter decisions (latency being the > most important one in the horizon). And they all will hit this problem. I think > we need to find a good recipe for how to handle these problems in general. > I don't think we can extend the LB to be energy aware, latency aware, cache > aware etc without hitting a lot of hurdles. And it is too slow to react. > >> >> 1) Identify tasks that prefer to run on their hottest LLC and >> move them there. >> 2) Prevent generic load balancing from moving a task out of >> its hottest LLC. > > Isn't this 2nd part the fix to the wake up problem you faced? 1 should > naturally be happening at wake up. And for random long running strayed tasks, > I believe push lb is an easier way to manage them. This is doable and some logic needs to be added in wakeup/push lb to avoid the bouncing issue mentioned above. Consider both whether do it in task wakeup/push lb/generic lb, and the task tagging, I was thinking that creating threads within one process appears to be a special case of tagging. If the user chooses to create threads rather than forking new processes, is it a higher potential for data sharing among those threads? However, we agree that fine-grained tagging is necessary. How about this: if the user explicitly tags tasks into a single group, the kernel can perform aggressive task aggregation-for instance, in the wakeup/fair-push path - and let the user accept the corresponding risks. For the default model, generic load balancing can perform per-process task aggregation at a slower pace to reduce the risk of false decisions and over-aggregation. We intended to discuss this in a separate thread, though. Thanks, Chenyu
On 02/21/26 10:48, Chen, Yu C wrote: > Leveraging push-lb for cache-aware task placement is interesting, > and we have considered it during LPC when Vincent and Prateek presented it. > It could be an enhancement to the basic cache-aware scheduling, IMO. > Tim has mentioned that in > https://lore.kernel.org/all/4514b6aef56d0ae144ebd56df9211c6599744633.camel@linux.intel.com/ > a bouncing issue needs to be resolved if task wakeup and push-lb are > leveraged for cache-aware scheduling. They are very fast - so for > cache-aware > scheduling, it is possible that multiple invocations of > select_idle_sibling() > will find the same LLC suitable. Then multiple wakees are woken up on that > LLC, > causing over-aggregation. Later, when over-aggregation is detected, several > tasks are migrated out of the LLC, which makes the LLC eligible again-and > the > pattern repeats back and forth. I believe this is a symptom of how tagging is currently happening. I think if we have more conservative tagging approach this will be less of a problem. But proof is in the pudding as they say :-)
On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote: > Hi Peter, Qais, > > On 2/19/2026 10:41 PM, Peter Zijlstra wrote: > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: > > > On 02/10/26 14:18, Tim Chen wrote: > > [ ... ] > > > > > > > I admit yet to look fully at the series. But I must ask, why are you deferring > > > to load balance and not looking at wake up path? LB should be for corrections. > > > When wake up path is doing wrong decision all the time, LB (which is super slow > > > to react) is too late to start grouping tasks? What am I missing? > > > > There used to be wakeup steering, but I'm not sure that still exists in > > this version (still need to read beyond the first few patches). It isn't > > hard to add. > > > > Please let me explain a little more about why we did this in the > load balance path. Yes, the original version implemented cache-aware > scheduling only in the wakeup path. According to our testing, this appeared > to cause some task bouncing issues across LLCs. This was due to conflicts > with the legacy load balancer, which tries to spread tasks to different > LLCs. > So as Peter said, the load balancer should be taken care of anyway. Later, > we kept only the cache aware logic in the load balancer, and the test > results > became much more stable, so we kept it as is. The wakeup path more or less > aggregates the wakees(threads within the same process) within the LLC in > the > wakeup fast path, so we have not changed it for now. > > Let me copy the changelog from the previous patch version: > > " > In previous versions, aggregation of tasks were done in the > wake up path, without making load balancing paths aware of > LLC (Last-Level-Cache) preference. This led to the following > problems: > > 1) Aggregation of tasks during wake up led to load imbalance > between LLCs > 2) Load balancing tried to even out the load between LLCs > 3) Wake up tasks aggregation happened at a faster rate and > load balancing moved tasks in opposite directions, leading > to continuous and excessive task migrations and regressions > in benchmarks like schbench. > > In this version, load balancing is made cache-aware. The main > idea of cache-aware load balancing consists of two parts: > > 1) Identify tasks that prefer to run on their hottest LLC and > move them there. > 2) Prevent generic load balancing from moving a task out of > its hottest LLC. > " > Another reason why we moved away from doing things in the wake up path is load imbalance consideration. Wake up path does not have the most up to date load information in the LLC sched domains as in the load balance path. So you may actually have everyone rushed into each's favorite LLC and causes LLC overload. And load balance will have to undo this. This led to frequent task migrations that hurts performance. It is better to consider LLC preference in the load balance path so we can aggregate tasks while still keeping load imbalance under control. Tim
On 02/19/26 10:11, Tim Chen wrote: > On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote: > > Hi Peter, Qais, > > > > On 2/19/2026 10:41 PM, Peter Zijlstra wrote: > > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: > > > > On 02/10/26 14:18, Tim Chen wrote: > > > > [ ... ] > > > > > > > > > > I admit yet to look fully at the series. But I must ask, why are you deferring > > > > to load balance and not looking at wake up path? LB should be for corrections. > > > > When wake up path is doing wrong decision all the time, LB (which is super slow > > > > to react) is too late to start grouping tasks? What am I missing? > > > > > > There used to be wakeup steering, but I'm not sure that still exists in > > > this version (still need to read beyond the first few patches). It isn't > > > hard to add. > > > > > > > Please let me explain a little more about why we did this in the > > load balance path. Yes, the original version implemented cache-aware > > scheduling only in the wakeup path. According to our testing, this appeared > > to cause some task bouncing issues across LLCs. This was due to conflicts > > with the legacy load balancer, which tries to spread tasks to different > > LLCs. > > So as Peter said, the load balancer should be taken care of anyway. Later, > > we kept only the cache aware logic in the load balancer, and the test > > results > > became much more stable, so we kept it as is. The wakeup path more or less > > aggregates the wakees(threads within the same process) within the LLC in > > the > > wakeup fast path, so we have not changed it for now. > > > > Let me copy the changelog from the previous patch version: > > > > " > > In previous versions, aggregation of tasks were done in the > > wake up path, without making load balancing paths aware of > > LLC (Last-Level-Cache) preference. This led to the following > > problems: > > > > 1) Aggregation of tasks during wake up led to load imbalance > > between LLCs > > 2) Load balancing tried to even out the load between LLCs > > 3) Wake up tasks aggregation happened at a faster rate and > > load balancing moved tasks in opposite directions, leading > > to continuous and excessive task migrations and regressions > > in benchmarks like schbench. > > > > In this version, load balancing is made cache-aware. The main > > idea of cache-aware load balancing consists of two parts: > > > > 1) Identify tasks that prefer to run on their hottest LLC and > > move them there. > > 2) Prevent generic load balancing from moving a task out of > > its hottest LLC. > > " > > > > Another reason why we moved away from doing things in the wake up > path is load imbalance consideration. Wake up path does not have > the most up to date load information in the LLC sched domains as > in the load balance path. So you may actually have everyone rushed What's the reason wake up doesn't have the latest info? Is this a limitation of these large systems where stats updates are too expensive to do? Is it not fixable at all? > into each's favorite LLC and causes LLC overload. And load balance > will have to undo this. This led to frequent task migrations that > hurts performance. > > It is better to consider LLC preference in the load balance path > so we can aggregate tasks while still keeping load imbalance under > control. > > Tim
On Fri, 2026-02-20 at 03:29 +0000, Qais Yousef wrote: > On 02/19/26 10:11, Tim Chen wrote: > > On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote: > > > Hi Peter, Qais, > > > > > > On 2/19/2026 10:41 PM, Peter Zijlstra wrote: > > > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote: > > > > > On 02/10/26 14:18, Tim Chen wrote: > > > > > > [ ... ] > > > > > > > > > > > > > I admit yet to look fully at the series. But I must ask, why are you deferring > > > > > to load balance and not looking at wake up path? LB should be for corrections. > > > > > When wake up path is doing wrong decision all the time, LB (which is super slow > > > > > to react) is too late to start grouping tasks? What am I missing? > > > > > > > > There used to be wakeup steering, but I'm not sure that still exists in > > > > this version (still need to read beyond the first few patches). It isn't > > > > hard to add. > > > > > > > > > > Please let me explain a little more about why we did this in the > > > load balance path. Yes, the original version implemented cache-aware > > > scheduling only in the wakeup path. According to our testing, this appeared > > > to cause some task bouncing issues across LLCs. This was due to conflicts > > > with the legacy load balancer, which tries to spread tasks to different > > > LLCs. > > > So as Peter said, the load balancer should be taken care of anyway. Later, > > > we kept only the cache aware logic in the load balancer, and the test > > > results > > > became much more stable, so we kept it as is. The wakeup path more or less > > > aggregates the wakees(threads within the same process) within the LLC in > > > the > > > wakeup fast path, so we have not changed it for now. > > > > > > Let me copy the changelog from the previous patch version: > > > > > > " > > > In previous versions, aggregation of tasks were done in the > > > wake up path, without making load balancing paths aware of > > > LLC (Last-Level-Cache) preference. This led to the following > > > problems: > > > > > > 1) Aggregation of tasks during wake up led to load imbalance > > > between LLCs > > > 2) Load balancing tried to even out the load between LLCs > > > 3) Wake up tasks aggregation happened at a faster rate and > > > load balancing moved tasks in opposite directions, leading > > > to continuous and excessive task migrations and regressions > > > in benchmarks like schbench. > > > > > > In this version, load balancing is made cache-aware. The main > > > idea of cache-aware load balancing consists of two parts: > > > > > > 1) Identify tasks that prefer to run on their hottest LLC and > > > move them there. > > > 2) Prevent generic load balancing from moving a task out of > > > its hottest LLC. > > > " > > > > > > > Another reason why we moved away from doing things in the wake up > > path is load imbalance consideration. Wake up path does not have > > the most up to date load information in the LLC sched domains as > > in the load balance path. So you may actually have everyone rushed > > What's the reason wake up doesn't have the latest info? Is this a limitation of > these large systems where stats updates are too expensive to do? Is it not > fixable at all? You will need to sum the load for each run queue for each LLC to get an accurate picture. That will be too expensive on the wake up path. Tim > > > into each's favorite LLC and causes LLC overload. And load balance > > will have to undo this. This led to frequent task migrations that > > hurts performance. > > > > It is better to consider LLC preference in the load balance path > > so we can aggregate tasks while still keeping load imbalance under > > control. > > > > Tim
On 02/20/26 10:14, Tim Chen wrote: > > > Another reason why we moved away from doing things in the wake up > > > path is load imbalance consideration. Wake up path does not have > > > the most up to date load information in the LLC sched domains as > > > in the load balance path. So you may actually have everyone rushed > > > > What's the reason wake up doesn't have the latest info? Is this a limitation of > > these large systems where stats updates are too expensive to do? Is it not > > fixable at all? > > You will need to sum the load for each run queue for each LLC to get > an accurate picture. That will be too expensive on the wake up path. I am probably missing something obvious. But it seems enqueue/dequeue + TICK are not keeping stats enough up-to-date for wakeup path to rely on. I need to read this code more. I could be wrong, but as I was trying to highlight in other places, I think the fact we tag all tasks belonging to a process as needing to stay together is exaggerating this problem. First every process is assumed to need to stay within the same LLC, and every task within the process. The wake up path by design now has a more difficult job and needs to look harder compared to if the tagging was more conservative. And I can appreciate defining and teaching regular LB that some imbalances are okay under these situations is hard. It is sort of overcommit situation by design. Anyway. As I was trying to tell Peter, I am trying to think how we can tie all these similar stories together. I hope once we can provide sensible way to tag tasks we can get wake up path + push lb to work easily as then we should have a handful of tasks asking to co-locate which is much easier to manage.
On Fri, Feb 20, 2026 at 03:29:41AM +0000, Qais Yousef wrote: > What's the reason wake up doesn't have the latest info? Is this a limitation of > these large systems where stats updates are too expensive to do? Is it not > fixable at all? Scalability is indeed the main problem. The periodic load-balancer, by virtue of being 'slow' has two advantages: - the cost of aggregating the numbers is amortized by the relative low frequency of aggregation - it can work with averages; it is less concerned with immediate spikes. This obviously has the exact inverse set of problems in that it is not able to deal with immediate/short term issues. Anyway, we're already at the point where EAS wakeup path is getting far too expensive for the current set of hardware. While we started with a handful of asymmetric CPUs, we're now pushing 32 CPUs or so. (Look at Intel Nova Lake speculation online, that's supposedly going to get us 2 dies of 8P+16E with another 4 bonus weaklings on the south bridge or something, for a grand total of 52 asymmetric CPUs of 3 kinds) Then consider: - Intel Granite Rapids-SP at 8*86 cores for 688 cores / 1376 threads. - AMD Prometheus at 2*192 cores with 384 cores / 768 threads. These are silly number of CPUs. - Power10, it is something like 16 sockets, 16 cores per socket, 8 threads per core for a mere 2048 threads. Now, these are the extreme end of the spectrum systems, 'nobody' will actually have them, but in a few generations they'll seem small again. So whatever we build now, will have to deal with silly numbers of CPUs.
On 02/20/26 10:43, Peter Zijlstra wrote: > On Fri, Feb 20, 2026 at 03:29:41AM +0000, Qais Yousef wrote: > > > What's the reason wake up doesn't have the latest info? Is this a limitation of > > these large systems where stats updates are too expensive to do? Is it not > > fixable at all? > > Scalability is indeed the main problem. The periodic load-balancer, by > virtue of being 'slow' has two advantages: > > - the cost of aggregating the numbers is amortized by the relative low > frequency of aggregation > > - it can work with averages; it is less concerned with immediate > spikes. > > This obviously has the exact inverse set of problems in that it is not > able to deal with immediate/short term issues. Yes. And if we are to focus on providing better task placement based on QoS (which what I think this is essentially is), we have a constant problem of two paths producing results that are incompatible. Which is why I am trying to stress the importance of the wake up path. I understand for this initial drop we don't have a way to provide specific hints for tasks, but this is why we end up with this difficult choices always - which I think we don't have to. More on this at the bottom. > > > Anyway, we're already at the point where EAS wakeup path is getting far > too expensive for the current set of hardware. While we started with a > handful of asymmetric CPUs, we're now pushing 32 CPUs or so. Is this 32 perf domains? Expensive for what workloads? Folks can still use performance governor and plug it to a wall if they want ;-) > > (Look at Intel Nova Lake speculation online, that's supposedly going to > get us 2 dies of 8P+16E with another 4 bonus weaklings on the south > bridge or something, for a grand total of 52 asymmetric CPUs of 3 kinds) Not sure if my experience matters for whatever this is supposed to be used for, but the cost of wrong decision is really high on these topologies. It is bloody worthwhile spending more time to select a better CPU and worthwhile to have the push lb do frequent corrections. Not sure if you saw the other thread on one of Vincent's patches - but I am trying to completely disable overutilized (or regular LB) and rely on wakeup + push lb and seeing great success (and gains). But I am carrying a number of improvements that I discussed in various places on the list that makes this effective setup. Hopefully I'll share full findings properly at some point. > > > Then consider: > > - Intel Granite Rapids-SP at 8*86 cores for 688 cores / 1376 threads. > > - AMD Prometheus at 2*192 cores with 384 cores / 768 threads. These > are silly number of CPUs. > > - Power10, it is something like 16 sockets, 16 cores per socket, 8 > threads per core for a mere 2048 threads. > > Now, these are the extreme end of the spectrum systems, 'nobody' will > actually have them, but in a few generations they'll seem small again. > > > So whatever we build now, will have to deal with silly numbers of CPUs. True, but I think we ought to bite the bullet at some point. My line of thought is that we don't have to (and actually shouldn't) make the compromise at the kernel level. We can define the problem such that it is opt-in/opt-out where users who find the benefit can opt-in or find a disadvantage opt-out. Now the difficulty is that we don't have a way to describe such things, and this is what I am trying to solve with Sched QoS library. I am writing this now, but I think I should be able to help with this use case so that users can describe which workload wants to benefit from co-locating and these tasks will take the hit of harder task placement and frequent migration under loaded scenarios - the contract being that being co-located has significant performance impact they are happy to pay the price. Things that didn't subscribe will work as-is. Anyway, my major goal is to find how we can tie all these stories together as we need to add ability to do task placement based on special requirements and the conflict with LB is one major one that I think Vincent's proposal for push lb is quite neat and spot on. I am not sure if you saw our LPC talk about Sched QoS where we expanded on our overall thoughts. In my view, this problem belongs to the same class of problems of placement based on special requirements (latency, energy, cache etc) and hopefully we can address along the way. But if not, it would be good to know more so we can think how we can better incorporate it as part of the bigger story. So far I think if this can be made to go through the wake up path and rely on push lb; it is part of the same story. If not, then we need to think harder how to connect things together for a coherent approach. If I can successfully give you a way to describe the requirement of tasks needs to be co-located so that we don't have to make the assumption in the kernel that tasks belonging to the same process needs to stay in the same LLC, do you think wake up + push lb works? If not, how do you see it evolving? And more importantly, how do you view the role of regular LB in these cases? The way I see it is that it should trigger less for the reasons you mentioned at the top; and when it triggers it means heavy intervention is required and whatever special task placement requirements will need to be dropped at this stage since the push lb clearly failed to keep up and we are at a point where we need to do heavy handed balancing work. I think this activities are more relevant to multi-LLC systems - which has the added problem of defining when some imbalances are okay; which I believe the difficulty being hit here with wakeup path based approach. For single LLC systems I think this heavy handed approach can be made unnecessary if we do it correctly. Sorry a bit of divergent. But I am interested on how we can all move the ship in the same direction. I think this is all part of making the wake up path multi-modal and improve its co-ordination with LB.
© 2016 - 2026 Red Hat, Inc.