Documentation/scheduler/index.rst | 1 + Documentation/scheduler/sched-qos.rst | 66 ++++++++++ include/linux/sched.h | 10 ++ include/linux/sched/cpufreq.h | 5 - include/uapi/linux/sched.h | 10 +- include/uapi/linux/sched/types.h | 46 +++++++ kernel/sched/core.c | 71 ++++++++++ kernel/sched/cpufreq_schedutil.c | 49 ++++++- kernel/sched/debug.c | 1 + kernel/sched/fair.c | 124 ++++++++++++++++-- kernel/sched/features.h | 21 +++ kernel/sched/pelt.c | 44 ++++++- kernel/sched/sched.h | 12 ++ kernel/sched/syscalls.c | 61 +++++++++ .../trace/beauty/include/uapi/linux/sched.h | 4 + 15 files changed, 501 insertions(+), 24 deletions(-) create mode 100644 Documentation/scheduler/sched-qos.rst
This is the long delayed follow up to the series sent back in August 2024 [1].
Life got in the way to some extent (I had a baby, and now my time that I used
to do upstream work late at night was stolen :). Apologies for those who
replied and I didn't get a chance to respond back.
The series is now rebased on top of tip/sched/core 78cde54ea5f0. I removed
a number of optimization patches that are not necessary for this initial merge
and can be treated as their own separate topics once this is hopefully
accepted.
I discussed the problem in LPC in 2024 [2] and the initial cover letter
contains all the details. I hope all the key parties are up-to-date on the
problem details by now.
As a brief recap, there are some hardcoded constants in the kernel that
introduce a bias that frequently fails to deliver the best outcome on various
systems. It turns out these constant seem to help somewhat against a bigger
problem in utilization signal distortion due to utilization invariance causing
what I call black hole effect. The lower the capacity, the harder it is to
accumulate runtime to cause the signal to rise acting like a gravitational pull
causing time dilation.
One of the major difficulties we will face is that this distortion turns up bad
for performance but good for power. The fix will inevitably rebalance the
system, while in the right way, but also in a surprising way to potentially
cause some to be unhappy. sched_features were added to ensure those unhappy
folks can revert the system to the old behavior while still allow us to make
the right progress.
That is to retain the older behavior one must:
echo 0 | sudo tee /proc/sys/kernel/sched_qos_default_rampup_multiplier
echo CONST_DVFS_HEADROOM NO_UTIL_EST_RAMPUP_ZERO UTIL_EST_FORCE_POST_INIT > /sys/kernel/debug/sched/features
Note for migration margin there's no sched features since I think the old
behavior was worse for perf and power and doesn't require reverting back to.
The system is going to be a lot faster now by default with
sched_qos_default_rampup_multiplier=1 since it fixes the distortion issue and
provides a constant rise time regardless of DVFS latencies.
The desired behavior is for default rampup_multiplier to be 0 and only those
interactive tasks to request a higher rampup multiplier. Preliminary
integration with schedqos is available [3] for those who want to see the full
benefit of fine grained control to mange perf and power.
Open questions:
* The details of the QoS interface is the biggest one.
* Would debugfs be better for setting the default rampup multiplier instead of sysctl?
* Patch 13 makes updating load_avg unconditional not on period boundaries.
Patches 1-3 are prepatory patches renaming a function and introducing new ones.
Patches 4-5 handle the magic margin problem but making them dynamic based on
actual hardware limitations.
Patches 6-7 fix the black hole problem and teaches the scheduler how to handle
bursty and periodic tasks via extending util_est.
Patches 8-9 is where I expect most of the discussion on as I introduce a new
sched_qos interface to support the new rampup_multiplier to help manage DVFS.
Patches 10-11 introduces a couple of necessary optimizations to counter the
power impact of increased responsiveness by disabling some features that we now
know how to handle better.
Patches 12-13 fix a couple of issues causing util_est and util_avg value to
swing for a periodic task. Patch 12 must go via stable.
My mac mini M1 system where I did the testing on before is down and it has been
proven difficult to revive it before sending this series. I will revive and
repeat the testing to ensure all is okay after the rebase.
I did test it on AMD system, but it has only 3 freqs so no real perf numbers to
report since it just whizzes by these 3 freqs anyway. But I did spend enough
time to verify the util_est behaves as expected under different scenarios. More
testing would still be appreciated :)
[1] https://lore.kernel.org/lkml/20240820163512.1096301-1-qyousef@layalina.io/
[2] https://lpc.events/event/18/contributions/1880/
[3] https://github.com/qais-yousef/schedqos/compare/main...schedqos
Qais Yousef (13):
sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom
sched/pelt: Add a new function to approximate the future util_avg
value
sched/pelt: Add a new function to approximate runtime to reach given
util
sched/fair: Remove magic hardcoded margin in fits_capacity()
sched: cpufreq: Remove magic 1.25 headroom from
sugov_apply_dvfs_headroom()
sched/fair: Extend util_est to improve rampup time
sched/fair: util_est: Take into account periodic tasks
sched/qos: Add a new sched-qos interface
sched/qos: Add rampup multiplier QoS
sched/fair: Disable util_est when rampup_multiplier is 0
sched/fair: Don't mess with util_avg post init
sched/fair: Call update_util_est() after dequeue_entities()
sched/pelt: Always allow load updates
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-qos.rst | 66 ++++++++++
include/linux/sched.h | 10 ++
include/linux/sched/cpufreq.h | 5 -
include/uapi/linux/sched.h | 10 +-
include/uapi/linux/sched/types.h | 46 +++++++
kernel/sched/core.c | 71 ++++++++++
kernel/sched/cpufreq_schedutil.c | 49 ++++++-
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 124 ++++++++++++++++--
kernel/sched/features.h | 21 +++
kernel/sched/pelt.c | 44 ++++++-
kernel/sched/sched.h | 12 ++
kernel/sched/syscalls.c | 61 +++++++++
.../trace/beauty/include/uapi/linux/sched.h | 4 +
15 files changed, 501 insertions(+), 24 deletions(-)
create mode 100644 Documentation/scheduler/sched-qos.rst
--
2.34.1
Hi Qais, I tested your v2 12/13 (sched/fair: Call update_util_est() after dequeue_entities()) and RFC 13/13 (sched/pelt: Always allow load updates) on ARM (Raspberry Pi 5, Cortex-A76, 4-core), combined with Peter Zijlstra's ttwu series (rebased to 7.0.y by marioroy). Both patches applied cleanly on top of rpi-7.0.y + 10 ttwu patches without conflicts. Results using stress-ng 0.15.06 pipe stressor (4 workers, 20s): Kernel Clock pipe bogo ops/s D vs. 6.6 ---------------------------------- --------- ---------------- ---------- 6.6.78-v8-16k+ 2800 MHz 2 487 746 +/-0% (ref) 7.0.0-v8-16k+ stock 2400 MHz 1 694 011 -31.9% 7.0.0-v8-16k+ stock 2800 MHz 1 851 567 -25.6% 7.0.0 + ttwu only (10 patches) 2400 MHz 1 836 006 -26.2% 7.0.0 + ttwu only (10 patches) 2800 MHz 1 934 076 -22.3% 7.0.0 + ttwu + your 2 Qais patches 2400 MHz 1 996 002 -19.8% 7.0.0 + ttwu + your 2 Qais patches 2800 MHz 2 342 144 -5.9% The ttwu-only set recovers ~3-4% of the regression on ARM. Adding your two patches brings a much larger improvement -- especially under overclocking, where the combined set recovers roughly 94% of the 6.6 baseline. The remaining ~6% gap may be related to ARM-specific DELAY_DEQUEUE interactions. Device: Raspberry Pi 5 (8 GB, C1-stepping), Bookworm arm64, rpi-7.0.y. Background: https://github.com/raspberrypi/linux/issues/7308 Thanks for the series -- the ARM results look very promising. Tom
Hi Tom On 05/13/26 17:09, Tom Gebhardt wrote: > Hi Qais, > > I tested your v2 12/13 (sched/fair: Call update_util_est() after > dequeue_entities()) and RFC 13/13 (sched/pelt: Always allow load updates) > on ARM (Raspberry Pi 5, Cortex-A76, 4-core), combined with Peter > Zijlstra's ttwu series (rebased to 7.0.y by marioroy). > > Both patches applied cleanly on top of rpi-7.0.y + 10 ttwu patches > without conflicts. > > Results using stress-ng 0.15.06 pipe stressor (4 workers, 20s): > > Kernel Clock pipe bogo ops/s D vs. 6.6 > ---------------------------------- --------- ---------------- ---------- > 6.6.78-v8-16k+ 2800 MHz 2 487 746 +/-0% (ref) > 7.0.0-v8-16k+ stock 2400 MHz 1 694 011 -31.9% > 7.0.0-v8-16k+ stock 2800 MHz 1 851 567 -25.6% > 7.0.0 + ttwu only (10 patches) 2400 MHz 1 836 006 -26.2% > 7.0.0 + ttwu only (10 patches) 2800 MHz 1 934 076 -22.3% > 7.0.0 + ttwu + your 2 Qais patches 2400 MHz 1 996 002 -19.8% > 7.0.0 + ttwu + your 2 Qais patches 2800 MHz 2 342 144 -5.9% > > The ttwu-only set recovers ~3-4% of the regression on ARM. Adding your > two patches brings a much larger improvement -- especially under > overclocking, where the combined set recovers roughly 94% of the 6.6 > baseline. The remaining ~6% gap may be related to ARM-specific > DELAY_DEQUEUE interactions. Hmm this is an interesting impact. Did you get a chance to verify if you need the 2 patches or only one of them is enough? Only 12/13 is actually a fix for a change in behavior from 6.6. The last patch is a new addition for a behavior that has always been there. You have SMP system, so utilization can't be impacting your task placement to potentially being stuck on a little core. And looking at raspberry pi code, it seems they ship with ondemand governor as the default cpufreq governor. Are you using the default one? Assuming yes and you're not using schedutil, then these patches making things better is not expected. Are you familiar with perfetto? Can you use sched-analyzer [1] to capture a trace and inspect how the pattern changes when things are good and bad? Output of sched-analyzer-pp --sched-states $TASK_NAME --freq-residency-task $TASK_NAME \ sched-analyzer.perfetto-trace would be useful to share. I suspect you have a subtle change of sched pattern that I hope you might be able to visualize directly in ui.perfetto.dev, but the above stats might be a good way to see potential difference between good and bad runs. Thanks! [1] https://github.com/qais-yousef/sched-analyzer > > Device: Raspberry Pi 5 (8 GB, C1-stepping), Bookworm arm64, rpi-7.0.y. > Background: https://github.com/raspberrypi/linux/issues/7308 > > Thanks for the series -- the ARM results look very promising. > > Tom
Hi Qais, Thanks for the follow-up. Here are the patch isolation results and answers to your questions. Regarding the governor: Yes, I'm running `ondemand`, not `schedutil`. My mistake for not mentioning that upfront - I assumed the improvement was due to the util_est path being triggered regardless of the governor. The improvement is clearly measurable even with `ondemand`, which is surprising given that your patches specifically target `schedutil`. Patch isolation -- 12/13 only vs. both: I re-ran the benchmarks with patch 13/13 (`sched/pelt: Always allow load updates`) reverted, keeping only patch 12/13 (`sched/fair: Call update_util_est() after dequeue_entities()`). Results using stress-ng 0.15.06 pipe stressor (4 workers, 20s): Kernel Clock pipe bogo ops/s delta vs. 6.6 ----------------------------------- -------- ---------------- ------------- 6.6.78-v8-16k+ 2400 MHz 2 129 330 +/-0% (ref) 6.6.78-v8-16k+ 2800 MHz 2 487 746 +/-0% (ref) 7.0.0-v8-16k+ stock 2400 MHz 1 694 011 -20.5% 7.0.0-v8-16k+ stock 2800 MHz 1 851 567 -25.6% 7.0.0 + ttwu only (10 patches) 2400 MHz 1 836 006 -13.8% 7.0.0 + ttwu only (10 patches) 2800 MHz 1 934 076 -22.3% 7.0.0 + ttwu + patch 12/13 only 2400 MHz 2 054 879 -3.5% 7.0.0 + ttwu + patch 12/13 only 2800 MHz 2 415 617 -2.9% 7.0.0 + ttwu + patches 12+13 2400 MHz 1 996 002 -6.3% 7.0.0 + ttwu + patches 12+13 2800 MHz 2 342 144 -5.9% The key finding: patch 12/13 alone outperforms the combined set on ARM. Adding patch 13/13 actually hurts performance slightly -- about 3 percentage points -- at both clock speeds. This suggests that `sched/pelt: Always allow load updates` has a negative interaction on ARM/Cortex-A76, possibly related to how PELT decay is handled without `schedutil` active, or an ARM-specific DELAY_DEQUEUE interaction. Patch 12/13 alone closes the gap to just -2.9% vs. 6.6 at 2800 MHz (OC), and -3.5% at nominal 2400 MHz. That is a remarkable recovery from the -31.9% regression in 7.0 stock. Regarding Perfetto traces: Unfortunately I cannot provide sched-analyzer traces at this time -- the kernel is not compiled with CONFIG_DEBUG_INFO_BTF=y (pahole/dwarves not available in this build environment), which is required for BPF CO-RE. I can try to arrange that for a future run if it would still be useful. Device: Raspberry Pi 5 (8 GB, C1-stepping), Bookworm arm64, kernel rpi-7.0.y. Background: https://github.com/raspberrypi/linux/issues/7308 Tom
On 05/15/26 10:24, Tom Gebhardt wrote: > Hi Qais, > > Thanks for the follow-up. Here are the patch isolation results and answers to your questions. > > Regarding the governor: > > Yes, I'm running `ondemand`, not `schedutil`. My mistake for not mentioning that upfront - I > assumed the improvement was due to the util_est path being triggered regardless of the governor. > The improvement is clearly measurable even with `ondemand`, which is surprising given that your > patches specifically target `schedutil`. > > Patch isolation -- 12/13 only vs. both: > > I re-ran the benchmarks with patch 13/13 (`sched/pelt: Always allow load updates`) reverted, > keeping only patch 12/13 (`sched/fair: Call update_util_est() after dequeue_entities()`). > > Results using stress-ng 0.15.06 pipe stressor (4 workers, 20s): > > Kernel Clock pipe bogo ops/s delta vs. 6.6 > ----------------------------------- -------- ---------------- ------------- > 6.6.78-v8-16k+ 2400 MHz 2 129 330 +/-0% (ref) > 6.6.78-v8-16k+ 2800 MHz 2 487 746 +/-0% (ref) > 7.0.0-v8-16k+ stock 2400 MHz 1 694 011 -20.5% > 7.0.0-v8-16k+ stock 2800 MHz 1 851 567 -25.6% > 7.0.0 + ttwu only (10 patches) 2400 MHz 1 836 006 -13.8% > 7.0.0 + ttwu only (10 patches) 2800 MHz 1 934 076 -22.3% > 7.0.0 + ttwu + patch 12/13 only 2400 MHz 2 054 879 -3.5% > 7.0.0 + ttwu + patch 12/13 only 2800 MHz 2 415 617 -2.9% > 7.0.0 + ttwu + patches 12+13 2400 MHz 1 996 002 -6.3% > 7.0.0 + ttwu + patches 12+13 2800 MHz 2 342 144 -5.9% > > The key finding: patch 12/13 alone outperforms the combined set on ARM. Adding patch 13/13 > actually hurts performance slightly -- about 3 percentage points -- at both clock speeds. This > suggests that `sched/pelt: Always allow load updates` has a negative interaction on ARM/Cortex-A76, > possibly related to how PELT decay is handled without `schedutil` active, or an ARM-specific > DELAY_DEQUEUE interaction. > > Patch 12/13 alone closes the gap to just -2.9% vs. 6.6 at 2800 MHz (OC), and -3.5% at nominal > 2400 MHz. That is a remarkable recovery from the -31.9% regression in 7.0 stock. > > Regarding Perfetto traces: > > Unfortunately I cannot provide sched-analyzer traces at this time -- the kernel is not compiled > with CONFIG_DEBUG_INFO_BTF=y (pahole/dwarves not available in this build environment), which > is required for BPF CO-RE. I can try to arrange that for a future run if it would still be useful. You don't need to have it enabled in the kernel. I don't need util info, by default if you don't pass any arg it should not cause BPF to be loaded. Note there are binaries in the release page on github, so you don't have to compile it. You can also use regular perfetto command to record a trace and visualize it and the sched-analyzer-pp would be able to analayze it. I am looking to see if the task placement and running/runnable time pattern has changed significantly to cause the big difference. It'd be good to perf it too. You might be hitting weird contention that the patch just happens to accidentally hide. > > Device: Raspberry Pi 5 (8 GB, C1-stepping), Bookworm arm64, kernel rpi-7.0.y. > Background: https://github.com/raspberrypi/linux/issues/7308 > > Tom
Hi Qais, Thanks for the clarification on sched-analyzer -- I'll look at the perfetto approach for task placement traces. In the meantime, I ran `perf stat` and `perf record -g` across three kernels at OC (2800 MHz) with `ondemand` governor, using the same stress-ng pipe workload (4 workers, 20s). Device: Raspberry Pi 5 (8 GB, C1-stepping, Cortex-A76), Bookworm arm64. perf stat results: Metric 6.6.78 7.0 stock 7.0+ttwu+vincent ------------------ --------- ---------- ---------------- bogo ops/s 2 222 639 1 855 066 2 298 965 IPC 1.72 1.47 1.76 branch-misses 625M 1 270M 1 018M context-switches 15 145 738 22 750 121 18 905 924 cache-miss rate 1.58% 1.74% 1.38% Key observations: 1. IPC drops 14% on 7.0 stock (1.72 -> 1.47). ttwu+vincent recovers it almost completely (1.76, slightly above 6.6). This is a genuine efficiency loss in the scheduler path, not a throughput/clock artifact. 2. Branch mispredictions double on 7.0 stock (+103% vs 6.6). ttwu+vincent reduces them by ~20% vs stock, but +63% above 6.6 remains -- this likely explains the residual ~1% gap after patching. 3. Context switches increase 50% on 7.0 stock. ttwu+vincent brings this down to +25% vs 6.6. perf report (-g) highlights: On 6.6, `finish_task_switch` is barely visible in call graphs. On 7.0 (both stock and patched), it appears prominently at 5-8% of samples, alongside elevated `_raw_spin_unlock_irqrestore` time. This points to genuine overhead in the context switch completion path, not lock contention between worker tasks. Regarding the "weird contention accidentally hidden" concern: I don't see evidence for that. The branch miss explosion and IPC drop on 7.0 stock are consistent with more complex/harder-to-predict scheduler control flow (EEVDF decision tree vs. CFS), not with a workload contention pattern that happens to be masked by task placement changes. ttwu+vincent genuinely reduces branch misses and restores IPC -- it doesn't just move the problem. I'll try to get perfetto traces for the task placement / running vs. runnable time breakdown. Happy to provide the raw perf.data files if useful. Tom
On 05/28/26 14:50, Tom Gebhardt wrote: > Hi Qais, > > Thanks for the clarification on sched-analyzer -- I'll look at the perfetto > approach for task placement traces. > > In the meantime, I ran `perf stat` and `perf record -g` across three kernels > at OC (2800 MHz) with `ondemand` governor, using the same stress-ng pipe > workload (4 workers, 20s). > > Device: Raspberry Pi 5 (8 GB, C1-stepping, Cortex-A76), Bookworm arm64. > > perf stat results: > > Metric 6.6.78 7.0 stock 7.0+ttwu+vincent > ------------------ --------- ---------- ---------------- > bogo ops/s 2 222 639 1 855 066 2 298 965 7.0+ttwu+vincent is the best, right? Have you verified your actual workload is seeing benefit? I think when I scanned the github bug you references the original report was observing a regression in some real setup, not this stressng tests. I am wary some of these stress tests don't necessarily represent real cases as it can over stress a particular scenario and amplify minor problems that have no noticeable impact in practice. > IPC 1.72 1.47 1.76 > branch-misses 625M 1 270M 1 018M > context-switches 15 145 738 22 750 121 18 905 924 > cache-miss rate 1.58% 1.74% 1.38% > > Key observations: > > 1. IPC drops 14% on 7.0 stock (1.72 -> 1.47). ttwu+vincent recovers it > almost completely (1.76, slightly above 6.6). This is a genuine > efficiency loss in the scheduler path, not a throughput/clock artifact. Due to stalling you reckon? > > 2. Branch mispredictions double on 7.0 stock (+103% vs 6.6). ttwu+vincent > reduces them by ~20% vs stock, but +63% above 6.6 remains -- this > likely explains the residual ~1% gap after patching. I might not be reading the numbers correctly but they seem higher > > 3. Context switches increase 50% on 7.0 stock. ttwu+vincent brings this > down to +25% vs 6.6. I hope that is something perfetto trace will help visualize the pattern that lead to this higher context switching > > perf report (-g) highlights: > > On 6.6, `finish_task_switch` is barely visible in call graphs. On 7.0 > (both stock and patched), it appears prominently at 5-8% of samples, > alongside elevated `_raw_spin_unlock_irqrestore` time. This points to > genuine overhead in the context switch completion path, not lock contention > between worker tasks. Do you have the full (well, most relevant parts of it) output? It would be interesting to use perf diff to see the difference of 7.0 stock vs 6.6 and 7.0+ttwu+vincent vs 6.6. Maybe there's higher rq lock contention. But this finish_task_switch and __raw_spin_unlock_irqrestore are common to see, especially when there's high context switch rate. It might not necessarily indicate there's a problem. > > Regarding the "weird contention accidentally hidden" concern: I don't see > evidence for that. The branch miss explosion and IPC drop on 7.0 stock are > consistent with more complex/harder-to-predict scheduler control flow > (EEVDF decision tree vs. CFS), not with a workload contention pattern that > happens to be masked by task placement changes. ttwu+vincent genuinely > reduces branch misses and restores IPC -- it doesn't just move the problem. Not necessarily a workload contention but a scheduler lock or cache related on a 'hot variable'. See [1] for example. I am hoping perf diff will help see which part has gotten noticeably worse then you can inspect this function to see where in the code the code has gotten slower; hopefully this can shed some light how this unrelated patch is helping.. [1] https://lore.kernel.org/all/20240307085725.444486-2-sshegde@linux.ibm.com/ > > I'll try to get perfetto traces for the task placement / running vs. > runnable time breakdown. Happy to provide the raw perf.data files if > useful. > > Tom
On 05/29/26 02:43, Qais Yousef wrote:
> 7.0+ttwu+vincent is the best, right?
Yes.
> Have you verified your actual workload is seeing benefit? I think when
> I scanned the github bug you references the original report was observing
> a regression in some real setup, not this stressng tests.
I am the reporter. The original issue (#7308) was observed as a drop in
camera frame rate when running two parallel IMX477 streams via libcamera on
RPi5 under kernel 6.12+. The camera pipeline is pipe-IPC-heavy (GStreamer /
libcamera internal queues), so the regression surfaced there first in a real
workload. To isolate the cause I moved to synthetic pipe benchmarks
(stress-ng), which confirmed and quantified the regression cleanly.
A Raspberry Pi developer (popcornmix) also posted IPC benchmark results on
the issue, independently confirming the trend across kernel versions
(6.6=2065 > 6.18=1805 > 6.12=1662 > 7.0=1570 Kops/s).
The stress-ng pipe stressor is therefore not an artificial worst-case -- it
directly exercises the code path that causes the real-world camera regression.
That said, I agree stress-ng amplifies the effect, and I cannot give you an
exact frame-rate number yet with the ttwu+vincent patches applied.
> IPC drops 14% on 7.0 stock. Due to stalling you reckon?
Yes. The branch misprediction rate explains most of it. On Cortex-A76 a
branch mispredict costs ~13 cycles. Normalised by instruction count:
Kernel branch-miss rate vs 6.6
----------------- ----------------- ------
6.6.78 0.178% ref
7.0.0 stock 0.427% +140%
7.0.0+ttwu+vincent 0.271% +52%
The raw counts I reported yesterday were misleading because the instruction
counts differ between kernels (different amounts of useful work). Apologies
for not normalising upfront. The rate tells a cleaner story: stock EEVDF
causes 2.4\ufffd\ufffd more mispredictions per instruction than CFS, and ttwu+vincent
brings that down to 1.5\ufffd\ufffd -- significant improvement but not full recovery.
> Do you have the full output? It would be interesting to use perf diff.
A proper perf diff with resolved kernel symbols requires running against the
matching kernel. I ran `perf report --no-children -s symbol` on each .data
file while booted on the corresponding kernel. Key findings:
7.0.0 stock (flat, self-overhead):
12.98% finish_task_switch.isra.0
-> __schedule -> schedule
-> anon_pipe_read 5.72%
-> anon_pipe_write 1.38%
7.0.0+ttwu+vincent (flat, self-overhead):
19.62% finish_task_switch.isra.0
-> __schedule -> schedule
-> anon_pipe_read 8.22%
-> anon_pipe_write 4.34%
The striking difference is in the pipe_write -> schedule() path: 1.38% on
stock vs 4.34% with ttwu+vincent. The ttwu patches make pipe writers yield
the CPU far more aggressively after each write, allowing the reader to run
immediately. Stock EEVDF leaves this to the scheduler's own timing, which
results in more latency and lower throughput.
The higher absolute percentage in finish_task_switch for vincent is expected:
vincent completes ~24% more pipe operations in the same wall time, so there
are proportionally more context switches completing.
On 6.6 (from the call-graph profile recorded separately), finish_task_switch
is not visible as a top-level hotspot at all -- consistent with CFS handling
this path much more efficiently.
> Maybe there's higher rq lock contention. But this finish_task_switch and
> __raw_spin_unlock_irqrestore are common to see, especially when there's
> high context switch rate.
Agreed -- I cannot rule out rq lock contention without perf diff with
matched build-IDs. The pattern I see (finish_task_switch dominant, driven
by pipe_read/write) is consistent with high context switch rate rather than
a pathological lock. But your point about a 'hot variable' like rq->clock
is noted -- I cannot confirm or deny that from flat profiles alone.
> I hope perfetto trace will help visualize the pattern that led to this
> higher context switching.
I will work on getting a perfetto trace. Expecting to have that in a
follow-up.
Tom
On 05/29/26 09:53, Tom Gebhardt wrote: > On 05/29/26 02:43, Qais Yousef wrote: > > 7.0+ttwu+vincent is the best, right? > > Yes. > > > Have you verified your actual workload is seeing benefit? I think when > > I scanned the github bug you references the original report was observing > > a regression in some real setup, not this stressng tests. > > I am the reporter. The original issue (#7308) was observed as a drop in Yes, I realized, thanks for taking the time to chase all of this :) > camera frame rate when running two parallel IMX477 streams via libcamera on > RPi5 under kernel 6.12+. The camera pipeline is pipe-IPC-heavy (GStreamer / > libcamera internal queues), so the regression surfaced there first in a real > workload. To isolate the cause I moved to synthetic pipe benchmarks > (stress-ng), which confirmed and quantified the regression cleanly. > A Raspberry Pi developer (popcornmix) also posted IPC benchmark results on > the issue, independently confirming the trend across kernel versions > (6.6=2065 > 6.18=1805 > 6.12=1662 > 7.0=1570 Kops/s). I was wondering if the real workload is as sensitive > > The stress-ng pipe stressor is therefore not an artificial worst-case -- it > directly exercises the code path that causes the real-world camera regression. > That said, I agree stress-ng amplifies the effect, and I cannot give you an > exact frame-rate number yet with the ttwu+vincent patches applied. No worries, don't want to ask you to do more work ;-) > > > IPC drops 14% on 7.0 stock. Due to stalling you reckon? > > Yes. The branch misprediction rate explains most of it. On Cortex-A76 a > branch mispredict costs ~13 cycles. Normalised by instruction count: > > Kernel branch-miss rate vs 6.6 > ----------------- ----------------- ------ > 6.6.78 0.178% ref > 7.0.0 stock 0.427% +140% > 7.0.0+ttwu+vincent 0.271% +52% This could potentially be due to the higher ctx switches > > The raw counts I reported yesterday were misleading because the instruction > counts differ between kernels (different amounts of useful work). Apologies > for not normalising upfront. The rate tells a cleaner story: stock EEVDF > causes 2.4× more mispredictions per instruction than CFS, and ttwu+vincent > brings that down to 1.5× -- significant improvement but not full recovery. Note 6.6 kernels are EEVDF too. > > > Do you have the full output? It would be interesting to use perf diff. > > A proper perf diff with resolved kernel symbols requires running against the > matching kernel. I ran `perf report --no-children -s symbol` on each .data > file while booted on the corresponding kernel. Key findings: > > 7.0.0 stock (flat, self-overhead): > > 12.98% finish_task_switch.isra.0 > -> __schedule -> schedule > -> anon_pipe_read 5.72% > -> anon_pipe_write 1.38% > > 7.0.0+ttwu+vincent (flat, self-overhead): > > 19.62% finish_task_switch.isra.0 > -> __schedule -> schedule > -> anon_pipe_read 8.22% > -> anon_pipe_write 4.34% > > The striking difference is in the pipe_write -> schedule() path: 1.38% on > stock vs 4.34% with ttwu+vincent. The ttwu patches make pipe writers yield > the CPU far more aggressively after each write, allowing the reader to run > immediately. Stock EEVDF leaves this to the scheduler's own timing, which > results in more latency and lower throughput. I collected a trace for stress-ng --pipe 2 on a 2 CPU system (6.8 kernel) and I can see it ends up with 4 tasks, 2 almost always running and 2 that sleep and wake up, rather rapidly. stress-ng 1: 58% RUNNING, 41% RUNNABLE, ~1% sleeping stress-ng 2: 41.5% RUNNING, 58.5% RUNNABLE, ~1% sleeping stress-ng 3: 40.1% RUNNING, 22.2% RUNNABLE, ~37.6% sleeping stress-ng 4: 59.9% RUNNING, 21.7% RUNNABLE, ~18.5% sleeping The avg RUNNING time of these tasks is few 10s of us and min is 100s of ns.. It seems the tasks are pinned too, 2 per cpu. I hope your real workload doesn't behave this way, this is very inefficient :) > > The higher absolute percentage in finish_task_switch for vincent is expected: > vincent completes ~24% more pipe operations in the same wall time, so there > are proportionally more context switches completing. > > On 6.6 (from the call-graph profile recorded separately), finish_task_switch > is not visible as a top-level hotspot at all -- consistent with CFS handling > this path much more efficiently. It could also be about the wakeup preemption pattern. The pattern I see is that one task wakes up runs for a bit before the other tasks wakes up rapidly for 4 times. The first 3 it preempts with ~0.5us but the last one it waits behind the original task until it sleeps which takes ~9us. If I do echo NO_WAKEUP_PREEMPTION | sudo tee /sys/kernel/debug/features I can see the bogo ops/s jump by 16%. The two tasks now interleave equally and the tasks that had ~1% sleeping time now go up to 7 and 10% of sleeping time. You can achieve the same outcome by running as SCHED_BATCH chrt -b 0 stress-ng --pipe 4 --timeout 20s --metrics-brief > > > Maybe there's higher rq lock contention. But this finish_task_switch and > > __raw_spin_unlock_irqrestore are common to see, especially when there's > > high context switch rate. > > Agreed -- I cannot rule out rq lock contention without perf diff with > matched build-IDs. The pattern I see (finish_task_switch dominant, driven > by pipe_read/write) is consistent with high context switch rate rather than > a pathological lock. But your point about a 'hot variable' like rq->clock > is noted -- I cannot confirm or deny that from flat profiles alone. > > > I hope perfetto trace will help visualize the pattern that led to this > > higher context switching. > > I will work on getting a perfetto trace. Expecting to have that in a > follow-up. > > Tom
On 5/15/26 09:24, Tom Gebhardt wrote: > Hi Qais, > > Thanks for the follow-up. Here are the patch isolation results and answers to your questions. > > Regarding the governor: > > Yes, I'm running `ondemand`, not `schedutil`. My mistake for not mentioning that upfront - I > assumed the improvement was due to the util_est path being triggered regardless of the governor. > The improvement is clearly measurable even with `ondemand`, which is surprising given that your > patches specifically target `schedutil`. Something is wrong, as Qais mentioned raspberry pi (SMP) with ondemand shouldn't be affected by util_est changes. In particular with 4 mostly-running workers and 4 CPUs. Does patch 12 also show similar effects with powersave/performance cpufreq governor? Qais also split patch 12 out separately and Vincent posted a fix, care to give that a try? https://lore.kernel.org/lkml/agRyoe1wHyZ-vMk9@vingu-cube/ Thanks for testing these, I'll try to reproduce what you're seeing, too. > [snip]
Hi Christian, Good point -- I ran additional tests with `performance` and `ondemand` governors side by side on the same kernel (7.0.0 + ttwu + patch 12 only): Clock Governor pipe bogo ops/s -------- ------------ ---------------- 2400 MHz performance 2 095 187 2400 MHz ondemand 2 093 221 2800 MHz performance 2 415 817 2800 MHz ondemand 2 415 617 The difference between governors is <0.1% -- well within noise. So you are right: the effect is not cpufreq-related. Whatever patch 12 changes, it affects the scheduler path directly, not through frequency selection. I also applied Vincent's fix [1] and benchmarked it: Kernel Clock pipe bogo ops/s Δ vs. 6.6.78 ---------------------- -------- ---------------- ------------ 6.6.78 2400 MHz 2 129 330 ±0% 6.6.78 2800 MHz 2 487 746 ±0% 7.0 + ttwu + patch 12 2400 MHz 2 093 221 −1.7% 7.0 + ttwu + patch 12 2800 MHz 2 415 617 −2.9% 7.0 + ttwu + Vincent 2400 MHz 2 077 526 −2.4% 7.0 + ttwu + Vincent 2800 MHz 2 458 151 −1.2% Vincent's fix gets very close to 6.6 at 2800 MHz (−1.2%) and is similar to patch 12 at 2400 MHz. Both are a large improvement over vanilla 7.0+ttwu (−22% at 2800 MHz) and plain 7.0 stock (−26% at 2800 MHz). Note: [1] applied with a manual context fixup for the DELAY_DEQUEUE hunk -- the rpi-7.0.y tree's dequeue_entity() differs slightly from mainline in that block (no update_entity_lag() call inside the DELAY_DEQUEUE early-return). The semantic intent of the hunk was preserved. [1] https://lore.kernel.org/lkml/agRyoe1wHyZ-vMk9@vingu-cube/ Thanks for catching that and for offering to reproduce it. Tom
On 5/15/26 14:57, Tom Gebhardt wrote: > Hi Christian, > > Good point -- I ran additional tests with `performance` and `ondemand` governors > side by side on the same kernel (7.0.0 + ttwu + patch 12 only): > > Clock Governor pipe bogo ops/s > -------- ------------ ---------------- > 2400 MHz performance 2 095 187 > 2400 MHz ondemand 2 093 221 > 2800 MHz performance 2 415 817 > 2800 MHz ondemand 2 415 617 > > The difference between governors is <0.1% -- well within noise. So you are > right: the effect is not cpufreq-related. Whatever patch 12 changes, it > affects the scheduler path directly, not through frequency selection. > > I also applied Vincent's fix [1] and benchmarked it: > > Kernel Clock pipe bogo ops/s Δ vs. 6.6.78 > ---------------------- -------- ---------------- ------------ > 6.6.78 2400 MHz 2 129 330 ±0% > 6.6.78 2800 MHz 2 487 746 ±0% > 7.0 + ttwu + patch 12 2400 MHz 2 093 221 −1.7% > 7.0 + ttwu + patch 12 2800 MHz 2 415 617 −2.9% > 7.0 + ttwu + Vincent 2400 MHz 2 077 526 −2.4% > 7.0 + ttwu + Vincent 2800 MHz 2 458 151 −1.2% > > Vincent's fix gets very close to 6.6 at 2800 MHz (−1.2%) and is similar to > patch 12 at 2400 MHz. Both are a large improvement over vanilla 7.0+ttwu > (−22% at 2800 MHz) and plain 7.0 stock (−26% at 2800 MHz). I tried to replicate using orion o6 and offlining all big CPUs leaving 4 little CPUs and an SMP system. Workload: for i in $(seq 0 19); do stress-ng --pipe 4 --pipe-ops 5000000 --metrics-brief --timeout 60 ; sleep 60 ; done Results: (bogo ops/s real time) 7.1-rc3 powersave: 27186.17 ± 813.42 7.1-rc3vingu powersave: 26866.67 ± 899.51 7.1-rc3 performance: 78223.83 ± 4344.88 7.1-rc3vingu performance: 77289.57 ± 3321.10 As expected there's no significant change with Vincent's patch. I didn't notice anything suspicious in the patch either, looks fine to me. Next suspect is of course some interaction with Peter's ttwu series you've applied. Alternatively you could also push your exact tree somewhere and I'll go and use that myself. > > Note: [1] applied with a manual context fixup for the DELAY_DEQUEUE hunk -- > the rpi-7.0.y tree's dequeue_entity() differs slightly from mainline in that > block (no update_entity_lag() call inside the DELAY_DEQUEUE early-return). > The semantic intent of the hunk was preserved. > > [1] https://lore.kernel.org/lkml/agRyoe1wHyZ-vMk9@vingu-cube/ > > Thanks for catching that and for offering to reproduce it. > > Tom
Hi Christian, Here is the patch I applied on top of rpi-7.0.y: https://gist.github.com/Kletternaut/640445f82d2c1f50d5b19d2f6803b395 It is a single combined patch containing marioroy's 10-patch ttwu series plus Vincent's util_est refactoring (with a manual context fixup in the DELAY_DEQUEUE hunk -- see previous mail for details). To reproduce: git clone https://github.com/raspberrypi/linux.git -b rpi-7.0.y --depth=1 cd linux curl -L https://gist.githubusercontent.com/Kletternaut/640445f82d2c1f50d5b19d2f6803b395/raw/0001-sched-ttwu-patch-series-Vincent-s-util_est-refactori.patch | git apply The benchmark I used: stress-ng --pipe 4 --metrics-brief --timeout 20 (stress-ng 0.15.06, 4 workers, 20 s, bogo ops/s real time) Tom
On 5/25/26 08:25, Tom Gebhardt wrote: > Hi Christian, > > Here is the patch I applied on top of rpi-7.0.y: > > https://gist.github.com/Kletternaut/640445f82d2c1f50d5b19d2f6803b395 > > It is a single combined patch containing marioroy's 10-patch ttwu series plus Vincent's util_est refactoring (with a manual context fixup in the > DELAY_DEQUEUE hunk -- see previous mail for details). > > To reproduce: > > git clone https://github.com/raspberrypi/linux.git -b rpi-7.0.y --depth=1 > cd linux > curl -L https://gist.githubusercontent.com/Kletternaut/640445f82d2c1f50d5b19d2f6803b395/raw/0001-sched-ttwu-patch-series-Vincent-s-util_est-refactori.patch | git apply > Unfortunately that branch is not fixed and so the patch doesn't apply :/ What's your base commit?
Hi Christian,
The base commit is:
bb31e96fee23a474a0504a15097d7ee55bed678e
("DTS: set default nvme Host Memory Buffer size to 32MB on BCM2711/2")
So:
git clone https://github.com/raspberrypi/linux.git -b rpi-7.0.y
cd linux
git checkout bb31e96fee23a474a0504a15097d7ee55bed678e
curl -L https://gist.githubusercontent.com/Kletternaut/640445f82d2c1f50d5b19d2f6803b395/raw/0001-sched-ttwu-patch-series-Vincent-s-util_est-refactori.patch | git apply
Sorry for the inconvenience.
Tom
Hi Christian, Thanks for trying to replicate -- and your result actually confirms the picture: without Peter's ttwu series, Vincent's fix has no measurable effect. That's consistent with what I see: the improvement from patch 12 / Vincent's fix only shows up *on top of* the ttwu patches, not standalone. So the interaction seems to be: ttwu patches alone → −22% vs. 6.6 at OC ttwu + util_est fix → −1.2% vs. 6.6 at OC (large recovery) vanilla 7.0 (no ttwu) → −26% vs. 6.6 at OC vanilla 7.0 + util_est → no significant change (your result) This suggests the ttwu series changes something in the dequeue path that exposes the util_est timing issue, and the fix only matters in that context. I would be happy to push the exact tree so you can reproduce it directly. However, I currently have a hardware issue with the Pi and cannot test or prepare the tree right now. I'll push it to GitHub as soon as I'm back up and let you know here. Tom
On Sun, May 3, 2026 at 7:00 PM Qais Yousef <qyousef@layalina.io> wrote: > > This is the long delayed follow up to the series sent back in August 2024 [1]. > Life got in the way to some extent (I had a baby, and now my time that I used > to do upstream work late at night was stolen :). Apologies for those who > replied and I didn't get a chance to respond back. > ... > Open questions: > > * The details of the QoS interface is the biggest one. > * Would debugfs be better for setting the default rampup multiplier instead of sysctl? > * Patch 13 makes updating load_avg unconditional not on period boundaries. > > Patches 1-3 are prepatory patches renaming a function and introducing new ones. > > Patches 4-5 handle the magic margin problem but making them dynamic based on > actual hardware limitations. > > Patches 6-7 fix the black hole problem and teaches the scheduler how to handle > bursty and periodic tasks via extending util_est. > > Patches 8-9 is where I expect most of the discussion on as I introduce a new > sched_qos interface to support the new rampup_multiplier to help manage DVFS. > > Patches 10-11 introduces a couple of necessary optimizations to counter the > power impact of increased responsiveness by disabling some features that we now > know how to handle better. > > Patches 12-13 fix a couple of issues causing util_est and util_avg value to > swing for a periodic task. Patch 12 must go via stable. Just a minor nit, If 12/13 are fixes, should they not be at the front of the series (or possibly sent separately) so they can potentially move forward while the bigger changes in this series are discussed? thanks -john
On 05/11/26 10:58, John Stultz wrote: > On Sun, May 3, 2026 at 7:00 PM Qais Yousef <qyousef@layalina.io> wrote: > > > > This is the long delayed follow up to the series sent back in August 2024 [1]. > > Life got in the way to some extent (I had a baby, and now my time that I used > > to do upstream work late at night was stolen :). Apologies for those who > > replied and I didn't get a chance to respond back. > > > ... > > Open questions: > > > > * The details of the QoS interface is the biggest one. > > * Would debugfs be better for setting the default rampup multiplier instead of sysctl? > > * Patch 13 makes updating load_avg unconditional not on period boundaries. > > > > Patches 1-3 are prepatory patches renaming a function and introducing new ones. > > > > Patches 4-5 handle the magic margin problem but making them dynamic based on > > actual hardware limitations. > > > > Patches 6-7 fix the black hole problem and teaches the scheduler how to handle > > bursty and periodic tasks via extending util_est. > > > > Patches 8-9 is where I expect most of the discussion on as I introduce a new > > sched_qos interface to support the new rampup_multiplier to help manage DVFS. > > > > Patches 10-11 introduces a couple of necessary optimizations to counter the > > power impact of increased responsiveness by disabling some features that we now > > know how to handle better. > > > > Patches 12-13 fix a couple of issues causing util_est and util_avg value to > > swing for a periodic task. Patch 12 must go via stable. > > Just a minor nit, If 12/13 are fixes, should they not be at the front > of the series (or possibly sent separately) so they can potentially > move forward while the bigger changes in this series are discussed? Yeah my plan was to split it and repost it with proper Fixes tag. I found it while verifying my patches so lumped it at the end. Will repost as soon as I can.
© 2016 - 2026 Red Hat, Inc.