[v1] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy

[PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy

Posted by K Prateek Nayak 1 week, 4 days ago

The current hierarchy is always throttled in __schedule() during the
pick when update_curr() detects a cfs_rq running out of the bandwidth
and issues a resched.

This was necessary prior to per-task throttling where the entire
throttled hierarchy was dequeued at the point of first throttle during
the pick but with per-task throttling, tasks continue to run as usual
until they exit to userspace and dequeue themselves one-by-one until the
hierarchy is deemed fully throttled and the PELT is frozen.

throttle_cfs_rq() is now simply a propagator of throttle indicators and
nothing more.

Unify the throttling for current hierarchy under
account_cfs_rq_runtime() which is responsible for the time accounting.
If the bandwidth runs out, account_cfs_rq_runtime() will request for
sched_cfs_bandwidth_slice() and mark the hierarchy as throttled if it
fails to grab bandwidth.

throttle_cfs_rq() will do a task_throttle_setup_work() if it finds the
current task to be on a throttled hierarchy and the task will naturally
dequeue itself when it exits to the userspace without needing an
explicit resched.

First four patches are cleanups and preparation for the final bit that
switches over to using account_cfs_rq_runtime() for throttling which was
provided by Peter in [1].

Following are the results of running hackbench running 3 levels deep
with the setup from "Testing" section on [2] when compared to
tip:sched/core:

  kernel        :  tip        tip + series

  Min           : 207.33        202.20
  Max           : 210.20        222.47
  Median        : 207.83        218.33
  AMean         : 208.29        215.36
  GMean         : 208.29        215.25
  HMean         : 208.29        215.13
  AMean Stddev  : 1.02          7.37
  AMean CoefVar : 0.49 pct      3.42 pct

  All numbers are in seconds.

There is a slight boot to boot variation for this benchmark but the
utilization numbers in top is more or less similar between the two.
Additional testing and feedback is always appreciated as usual :-)

Patches are based on tip:sched/core at commit 9e005ed21152
("sched/topology: Allow multiple domains to claim sched_domain_shared")
All testing was done on a dual socket 4th Generation EPYC system (2 x
128C/256T). CONFIG_CFS_BANDWIDTH=n was only build tested.

Patches also cleanly apply on top of Zecheng's optimization from [3]
when applied on top of the same base. Peter, there is only one trivial
conflict with sched/flat, and Zecheng's optimization is generally
beneficial for deep hierarchies even with flattened pick.

References
==========

[1] https://lore.kernel.org/lkml/20260512110932.GB1889694@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/
[3] https://lore.kernel.org/lkml/20260522141623.600235-4-zli94@ncsu.edu/

---
K Prateek Nayak (4):
  sched/fair: Convert cfs bandwidth throttling to use guards
  sched/fair: Use throttled_csd_list for local unthrottle
  sched/fair: Call update_curr() before unthrottling the hierarchy
  sched/fair: Move the throttled tasks to a local list in
    tg_unthrottle_up()

Peter Zijlstra (1):
  sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()

 kernel/sched/fair.c | 342 +++++++++++++++++++++++---------------------
 1 file changed, 181 insertions(+), 161 deletions(-)


base-commit: 9e005ed21152d4a4bb0ceea71045ff8a642a6feb
-- 
2.34.1

Re: [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy

Posted by Peter Zijlstra 1 week, 4 days ago

On Thu, May 28, 2026 at 09:48:25AM +0000, K Prateek Nayak wrote:
> Patches also cleanly apply on top of Zecheng's optimization from [3]
> when applied on top of the same base. Peter, there is only one trivial
> conflict with sched/flat, and Zecheng's optimization is generally
> beneficial for deep hierarchies even with flattened pick.

Ah, yes, I had meant to go look at how bad that conflict was, but just
hadn't gotten around to it.

Let me go have a play.

Re: [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy

Posted by Aaron Lu 1 week ago

Hi Prateek,

On Thu, May 28, 2026 at 09:48:25AM +0000, K Prateek Nayak wrote:
> The current hierarchy is always throttled in __schedule() during the
> pick when update_curr() detects a cfs_rq running out of the bandwidth
> and issues a resched.
> 
> This was necessary prior to per-task throttling where the entire
> throttled hierarchy was dequeued at the point of first throttle during
> the pick but with per-task throttling, tasks continue to run as usual
> until they exit to userspace and dequeue themselves one-by-one until the
> hierarchy is deemed fully throttled and the PELT is frozen.
> 
> throttle_cfs_rq() is now simply a propagator of throttle indicators and
> nothing more.
> 
> Unify the throttling for current hierarchy under
> account_cfs_rq_runtime() which is responsible for the time accounting.
> If the bandwidth runs out, account_cfs_rq_runtime() will request for
> sched_cfs_bandwidth_slice() and mark the hierarchy as throttled if it
> fails to grab bandwidth.
> 
> throttle_cfs_rq() will do a task_throttle_setup_work() if it finds the
> current task to be on a throttled hierarchy and the task will naturally
> dequeue itself when it exits to the userspace without needing an
> explicit resched.
> 
> First four patches are cleanups and preparation for the final bit that
> switches over to using account_cfs_rq_runtime() for throttling which was
> provided by Peter in [1].
> 
> Following are the results of running hackbench running 3 levels deep
> with the setup from "Testing" section on [2] when compared to
> tip:sched/core:
> 
>   kernel        :  tip        tip + series
> 
>   Min           : 207.33        202.20
>   Max           : 210.20        222.47
>   Median        : 207.83        218.33
>   AMean         : 208.29        215.36
>   GMean         : 208.29        215.25
>   HMean         : 208.29        215.13
>   AMean Stddev  : 1.02          7.37
>   AMean CoefVar : 0.49 pct      3.42 pct
> 
>   All numbers are in seconds.
> 
> There is a slight boot to boot variation for this benchmark but the
> utilization numbers in top is more or less similar between the two.
> Additional testing and feedback is always appreciated as usual :-)

I tested hackbench and netperf with quota set on a 2 sockets Intel EMR
and the result is in noise range.

Hackbench(in seconds, less is better)
base:  176.114420±2
head:  176.214394±3

Netperf(throughput, higher is better)
base:  14071, min/max: 13376/15261
head:  14769, min/max: 14095/15588

Feel free to add my tested-by tag after the clock warning is fixed in
patch 3.