[v2] sched/fair/schedutil: Better manage system response time

[PATCH v2 00/13] sched/fair/schedutil: Better manage system response time

Posted by Qais Yousef 1 month, 1 week ago

This is the long delayed follow up to the series sent back in August 2024 [1].
Life got in the way to some extent (I had a baby, and now my time that I used
to do upstream work late at night was stolen :). Apologies for those who
replied and I didn't get a chance to respond back.

The series is now rebased on top of tip/sched/core 78cde54ea5f0. I removed
a number of optimization patches that are not necessary for this initial merge
and can be treated as their own separate topics once this is hopefully
accepted.

I discussed the problem in LPC in 2024 [2] and the initial cover letter
contains all the details. I hope all the key parties are up-to-date on the
problem details by now.

As a brief recap, there are some hardcoded constants in the kernel that
introduce a bias that frequently fails to deliver the best outcome on various
systems. It turns out these constant seem to help somewhat against a bigger
problem in utilization signal distortion due to utilization invariance causing
what I call black hole effect. The lower the capacity, the harder it is to
accumulate runtime to cause the signal to rise acting like a gravitational pull
causing time dilation.

One of the major difficulties we will face is that this distortion turns up bad
for performance but good for power. The fix will inevitably rebalance the
system, while in the right way, but also in a surprising way to potentially
cause some to be unhappy. sched_features were added to ensure those unhappy
folks can revert the system to the old behavior while still allow us to make
the right progress.

That is to retain the older behavior one must:

echo 0 | sudo tee /proc/sys/kernel/sched_qos_default_rampup_multiplier
echo CONST_DVFS_HEADROOM NO_UTIL_EST_RAMPUP_ZERO UTIL_EST_FORCE_POST_INIT > /sys/kernel/debug/sched/features

Note for migration margin there's no sched features since I think the old
behavior was worse for perf and power and doesn't require reverting back to.

The system is going to be a lot faster now by default with
sched_qos_default_rampup_multiplier=1 since it fixes the distortion issue and
provides a constant rise time regardless of DVFS latencies.

The desired behavior is for default rampup_multiplier to be 0 and only those
interactive tasks to request a higher rampup multiplier. Preliminary
integration with schedqos is available [3] for those who want to see the full
benefit of fine grained control to mange perf and power.

Open questions:

* The details of the QoS interface is the biggest one.
* Would debugfs be better for setting the default rampup multiplier instead of sysctl?
* Patch 13 makes updating load_avg unconditional not on period boundaries.

Patches 1-3 are prepatory patches renaming a function and introducing new ones.

Patches 4-5 handle the magic margin problem but making them dynamic based on
actual hardware limitations.

Patches 6-7 fix the black hole problem and teaches the scheduler how to handle
bursty and periodic tasks via extending util_est.

Patches 8-9 is where I expect most of the discussion on as I introduce a new
sched_qos interface to support the new rampup_multiplier to help manage DVFS.

Patches 10-11 introduces a couple of necessary optimizations to counter the
power impact of increased responsiveness by disabling some features that we now
know how to handle better.

Patches 12-13 fix a couple of issues causing util_est and util_avg value to
swing for a periodic task. Patch 12 must go via stable.

My mac mini M1 system where I did the testing on before is down and it has been
proven difficult to revive it before sending this series. I will revive and
repeat the testing to ensure all is okay after the rebase.

I did test it on AMD system, but it has only 3 freqs so no real perf numbers to
report since it just whizzes by these 3 freqs anyway. But I did spend enough
time to verify the util_est behaves as expected under different scenarios. More
testing would still be appreciated :)

[1] https://lore.kernel.org/lkml/20240820163512.1096301-1-qyousef@layalina.io/
[2] https://lpc.events/event/18/contributions/1880/
[3] https://github.com/qais-yousef/schedqos/compare/main...schedqos

Qais Yousef (13):
sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom
sched/pelt: Add a new function to approximate the future util_avg
value
sched/pelt: Add a new function to approximate runtime to reach given
util
sched/fair: Remove magic hardcoded margin in fits_capacity()
sched: cpufreq: Remove magic 1.25 headroom from
sugov_apply_dvfs_headroom()
sched/fair: Extend util_est to improve rampup time
sched/fair: util_est: Take into account periodic tasks
sched/qos: Add a new sched-qos interface
sched/qos: Add rampup multiplier QoS
sched/fair: Disable util_est when rampup_multiplier is 0
sched/fair: Don't mess with util_avg post init
sched/fair: Call update_util_est() after dequeue_entities()
sched/pelt: Always allow load updates

--
2.34.1