Task based throttle follow ups

[PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Posted by Aaron Lu 3 weeks, 1 day ago

Before task based throttle model, propagating load will stop at a
throttled cfs_rq and that propagate will happen on unthrottle time by
update_load_avg().

Now that there is no update_load_avg() on unthrottle for throttled
cfs_rq and all load tracking is done by task related operations, let the
propagate happen immediately.

While at it, add a comment to explain why cfs_rqs that are not affected
by throttle have to be added to leaf cfs_rq list in
propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
("sched/fair: Fix unfairness caused by missing load decay").

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e1..f993de30e1466 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5729,6 +5729,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttled;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
+}
+
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -6721,6 +6726,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return false;
+}
+
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 	return 0;
@@ -13151,10 +13161,13 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	if (cfs_rq_throttled(cfs_rq))
-		return;
-
-	if (!throttled_hierarchy(cfs_rq))
+	/*
+	 * If a task gets attached to this cfs_rq and before being queued,
+	 * it gets migrated to another CPU due to reasons like affinity
+	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
+	 * that removed load decayed or it can cause faireness problem.
+	 */
+	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13165,10 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		if (!throttled_hierarchy(cfs_rq))
+		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }
-- 
2.39.5

Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Posted by Matteo Martelli 1 week, 2 days ago

Hi Aaron,

On Wed, 10 Sep 2025 17:50:41 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> Before task based throttle model, propagating load will stop at a
> throttled cfs_rq and that propagate will happen on unthrottle time by
> update_load_avg().
> 
> Now that there is no update_load_avg() on unthrottle for throttled
> cfs_rq and all load tracking is done by task related operations, let the
> propagate happen immediately.
> 
> While at it, add a comment to explain why cfs_rqs that are not affected
> by throttle have to be added to leaf cfs_rq list in
> propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> ("sched/fair: Fix unfairness caused by missing load decay").
> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---

I have been testing again the patch set "[PATCH v4 0/5] Defer throttle
when task exits to user" [1] together with these follow up patches. I
found out that with this patch the kernel sometimes produces the warning
WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); in
assert_list_leaf_cfs_rq() called by enqueue_task_fair(). I could
reproduce this systematically by applying both [1] and this patch on top
of tag v6.17-rc6 and also by directly testing at commit fe8d238e646e
from sched/core branch of tip tree. I couldn't reproduce the warning by
testing at commmit 5b726e9bf954 ("sched/fair: Get rid of
throttled_lb_pair()").

The test setup is the same used in my previous testing for v3 [2], where
the CFS throttling events are mostly triggered by the first ssh logins
into the system as the systemd user slice is configured with CPUQuota of
25%. Also note that the same systemd user slice is configured with CPU
affinity set to only one core. I added some tracing to trace functions
throttle_cfs_rq, tg_throttle_down, unthrottle_cfs_rq, tg_unthrottle_up,
and it looks like the warning is triggered after the last unthrottle
event, however I'm not sure the warning is actually related to the
printed trace below or not. See the following logs that contains both
the traced function events and the kernel warning.

[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865040: throttle_cfs_rq <-pick_task_fair
[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865043: tg_throttle_down <-walk_tg_tree_from
[   17.876999]        ktimers/0-15      [000] d.s13    17.882601: unthrottle_cfs_rq <-distribute_cfs_runtime
[   17.876999]        ktimers/0-15      [000] d.s13    17.882603: tg_unthrottle_up <-walk_tg_tree_from
[   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
[   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
[   17.910250]          systemd-999     [000] dN.2.    17.916019: throttle_cfs_rq <-put_prev_entity
[   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
[   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
[   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
[   17.977245]        ktimers/0-15      [000] d.s13    17.982575: unthrottle_cfs_rq <-distribute_cfs_runtime
[   17.977245]        ktimers/0-15      [000] d.s13    17.982578: tg_unthrottle_up <-walk_tg_tree_from
[   17.977245]        ktimers/0-15      [000] d.s13    17.982579: tg_unthrottle_up <-walk_tg_tree_from
[   17.977245]        ktimers/0-15      [000] d.s13    17.982580: tg_unthrottle_up <-walk_tg_tree_from
[   18.009244]          systemd-999     [000] dN.2.    18.015030: throttle_cfs_rq <-pick_task_fair
[   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
[   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
[   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
[   18.076822]        ktimers/0-15      [000] d.s13    18.082607: unthrottle_cfs_rq <-distribute_cfs_runtime
[   18.076822]        ktimers/0-15      [000] d.s13    18.082609: tg_unthrottle_up <-walk_tg_tree_from
[   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
[   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
[   18.109820]          systemd-999     [000] dN.2.    18.115604: throttle_cfs_rq <-put_prev_entity
[   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
[   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
[   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
[   18.177167]        ktimers/0-15      [000] d.s13    18.182630: unthrottle_cfs_rq <-distribute_cfs_runtime
[   18.177167]        ktimers/0-15      [000] d.s13    18.182632: tg_unthrottle_up <-walk_tg_tree_from
[   18.177167]        ktimers/0-15      [000] d.s13    18.182633: tg_unthrottle_up <-walk_tg_tree_from
[   18.177167]        ktimers/0-15      [000] d.s13    18.182634: tg_unthrottle_up <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226594: throttle_cfs_rq <-pick_task_fair
[   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282606: unthrottle_cfs_rq <-distribute_cfs_runtime
[   18.276886]        ktimers/0-15      [000] d.s13    18.282608: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.421349] ------------[ cut here ]------------
[   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
[   18.421355] Modules linked in: efivarfs
[   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
[   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
[   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
[   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
[   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
[   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
[   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
[   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
[   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
[   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
[   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
[   18.421377] Call Trace:
[   18.421383]  <TASK>
[   18.421387]  enqueue_task+0x31/0x70
[   18.421389]  ttwu_do_activate+0x73/0x220
[   18.421391]  try_to_wake_up+0x2b1/0x7a0
[   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
[   18.421396]  ep_autoremove_wake_function+0x12/0x40
[   18.421400]  __wake_up_common+0x72/0xa0
[   18.421402]  __wake_up_sync+0x38/0x50
[   18.421404]  ep_poll_callback+0xd2/0x240
[   18.421406]  __wake_up_common+0x72/0xa0
[   18.421407]  __wake_up_sync_key+0x3f/0x60
[   18.421409]  sock_def_readable+0x42/0xc0
[   18.421414]  unix_dgram_sendmsg+0x48f/0x840
[   18.421420]  ____sys_sendmsg+0x31c/0x350
[   18.421423]  ___sys_sendmsg+0x99/0xe0
[   18.421425]  __sys_sendmsg+0x8a/0xf0
[   18.421429]  do_syscall_64+0xa4/0x260
[   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   18.421438] RIP: 0033:0x7fb07e8d4d94
[   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
[   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
[   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
[   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
[   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
[   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
[   18.421445]  </TASK>
[   18.421446] ---[ end trace 0000000000000000 ]---

[1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
[2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/

I hope this is helpful. I'm happy to provide more information or run
additional tests if needed.

Best regards,
Matteo Martelli

Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Posted by Aaron Lu 3 days, 14 hours ago

On Tue, Sep 23, 2025 at 03:05:29PM +0200, Matteo Martelli wrote:
> Hi Aaron,
> 
> On Wed, 10 Sep 2025 17:50:41 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> > Before task based throttle model, propagating load will stop at a
> > throttled cfs_rq and that propagate will happen on unthrottle time by
> > update_load_avg().
> > 
> > Now that there is no update_load_avg() on unthrottle for throttled
> > cfs_rq and all load tracking is done by task related operations, let the
> > propagate happen immediately.
> > 
> > While at it, add a comment to explain why cfs_rqs that are not affected
> > by throttle have to be added to leaf cfs_rq list in
> > propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> > ("sched/fair: Fix unfairness caused by missing load decay").
> > 
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> 
> I have been testing again the patch set "[PATCH v4 0/5] Defer throttle
> when task exits to user" [1] together with these follow up patches. I
> found out that with this patch the kernel sometimes produces the warning
> WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); in
> assert_list_leaf_cfs_rq() called by enqueue_task_fair(). I could
> reproduce this systematically by applying both [1] and this patch on top
> of tag v6.17-rc6 and also by directly testing at commit fe8d238e646e
> from sched/core branch of tip tree. I couldn't reproduce the warning by
> testing at commmit 5b726e9bf954 ("sched/fair: Get rid of
> throttled_lb_pair()").

Just a note that while trying to reproduce this problem, I noticed a
warn triggered in tg_throttle_down() with one setup. It's a different
problem and I've sent a patch to address that:
https://lore.kernel.org/lkml/20250929074645.416-1-ziqianlu@bytedance.com/

Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Posted by Aaron Lu 1 week, 1 day ago

Hi Matteo,

On Tue, Sep 23, 2025 at 03:05:29PM +0200, Matteo Martelli wrote:
> Hi Aaron,
> 
> On Wed, 10 Sep 2025 17:50:41 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> > Before task based throttle model, propagating load will stop at a
> > throttled cfs_rq and that propagate will happen on unthrottle time by
> > update_load_avg().
> > 
> > Now that there is no update_load_avg() on unthrottle for throttled
> > cfs_rq and all load tracking is done by task related operations, let the
> > propagate happen immediately.
> > 
> > While at it, add a comment to explain why cfs_rqs that are not affected
> > by throttle have to be added to leaf cfs_rq list in
> > propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> > ("sched/fair: Fix unfairness caused by missing load decay").
> > 
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> 
> I have been testing again the patch set "[PATCH v4 0/5] Defer throttle
> when task exits to user" [1] together with these follow up patches. I
> found out that with this patch the kernel sometimes produces the warning
> WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); in
> assert_list_leaf_cfs_rq() called by enqueue_task_fair(). I could
> reproduce this systematically by applying both [1] and this patch on top
> of tag v6.17-rc6 and also by directly testing at commit fe8d238e646e
> from sched/core branch of tip tree. I couldn't reproduce the warning by
> testing at commmit 5b726e9bf954 ("sched/fair: Get rid of
> throttled_lb_pair()").
>

Thanks a lot for the test.

> The test setup is the same used in my previous testing for v3 [2], where
> the CFS throttling events are mostly triggered by the first ssh logins
> into the system as the systemd user slice is configured with CPUQuota of
> 25%. Also note that the same systemd user slice is configured with CPU

I tried to replicate this setup, below is my setup using a 4 cpu VM
and rt kernel at commit fe8d238e646e("sched/fair: Propagate load for
throttled cfs_rq"):
# pwd
/sys/fs/cgroup/user.slice
# cat cpu.max
25000 100000
# cat cpuset.cpus
0

I then login using ssh as a normal user and I can see throttle happened
but couldn't hit this warning. Do you have to do something special to
trigger it?

> affinity set to only one core. I added some tracing to trace functions
> throttle_cfs_rq, tg_throttle_down, unthrottle_cfs_rq, tg_unthrottle_up,
> and it looks like the warning is triggered after the last unthrottle
> event, however I'm not sure the warning is actually related to the
> printed trace below or not. See the following logs that contains both
> the traced function events and the kernel warning.
> 
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865040: throttle_cfs_rq <-pick_task_fair
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865043: tg_throttle_down <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882601: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882603: tg_unthrottle_up <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916019: throttle_cfs_rq <-put_prev_entity
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982575: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982578: tg_unthrottle_up <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982579: tg_unthrottle_up <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982580: tg_unthrottle_up <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015030: throttle_cfs_rq <-pick_task_fair
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082607: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082609: tg_unthrottle_up <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115604: throttle_cfs_rq <-put_prev_entity
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182630: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182632: tg_unthrottle_up <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182633: tg_unthrottle_up <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182634: tg_unthrottle_up <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226594: throttle_cfs_rq <-pick_task_fair
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282606: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282608: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.421349] ------------[ cut here ]------------
> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980

I stared at the code and haven't been able to figure out when
enqueue_task_fair() would end up with a broken leaf cfs_rq list.

No matter what the culprit commit did, enqueue_task_fair() should always
get all the non-queued cfs_rqs on the list in a bottom up way. It should
either add the whole hierarchy to rq's leaf cfs_rq list, or stop at one
of the ancestor cfs_rqs which is already on the list. Either way, the
list should not be broken.

> [   18.421355] Modules linked in: efivarfs
> [   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
> [   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
> [   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
> [   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
> [   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
> [   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
> [   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
> [   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
> [   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
> [   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
> [   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
> [   18.421377] Call Trace:
> [   18.421383]  <TASK>
> [   18.421387]  enqueue_task+0x31/0x70
> [   18.421389]  ttwu_do_activate+0x73/0x220
> [   18.421391]  try_to_wake_up+0x2b1/0x7a0
> [   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
> [   18.421396]  ep_autoremove_wake_function+0x12/0x40
> [   18.421400]  __wake_up_common+0x72/0xa0
> [   18.421402]  __wake_up_sync+0x38/0x50
> [   18.421404]  ep_poll_callback+0xd2/0x240
> [   18.421406]  __wake_up_common+0x72/0xa0
> [   18.421407]  __wake_up_sync_key+0x3f/0x60
> [   18.421409]  sock_def_readable+0x42/0xc0
> [   18.421414]  unix_dgram_sendmsg+0x48f/0x840
> [   18.421420]  ____sys_sendmsg+0x31c/0x350
> [   18.421423]  ___sys_sendmsg+0x99/0xe0
> [   18.421425]  __sys_sendmsg+0x8a/0xf0
> [   18.421429]  do_syscall_64+0xa4/0x260
> [   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   18.421438] RIP: 0033:0x7fb07e8d4d94
> [   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
> [   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
> [   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
> [   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
> [   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
> [   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
> [   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
> [   18.421445]  </TASK>
> [   18.421446] ---[ end trace 0000000000000000 ]---
> 
> [1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
> [2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/
> 
> I hope this is helpful. I'm happy to provide more information or run
> additional tests if needed.

Yeah, definitely helpful, thanks.

While looking at this commit, I'm thinking maybe we shouldn't use
cfs_rq_pelt_clock_throttled() to decide if cfs_rq should be added
to rq's leaf list. The reason is, for a cfs_rq that is in throttled
hierarchy, it can be removed from that leaf list when it has no entities
left in dequeue_entity(). So even when it's on the list now doesn't
mean it will still be on the list at unthrottle time.

Considering that the purpose is to have cfs_rq and its ancestors to be
added to the list in case this cfs_rq may have some removed load that
needs to be decayed later as described in commit 0258bdfaff5b("sched/fair: 
Fix unfairness caused by missing load decay"), I'm thinking maybe we
should deal with cfs_rqs differently according to whether it is in
throttled hierarchy or not:
- for cfs_rqs not in throttled hierarchy, add it and its ancestors to
  the list so that the removed load can be decayed;
- for cfs_rqs in throttled hierarchy, check on unthrottle time whether
  it has any removed load that needs to be decayed.
  The case in my mind is: an blocked task @p gets attached to a throttled
  cfs_rq by attaching a pid to a cgroup. Assume the cfs_rq was empty, had
  no tasks throttled or queued underneath it. Then @p is migrated to
  another cpu before being queued on it, so this cfs_rq now has some
  removed load on it. On unthrottle, this cfs_rq is considered fully
  decayed and isn't added to leaf cfs_rq list. Then we have a problem.

With the above said, I'm thinking the below diff. No idea if this can
fix Matteo's problem though, it's just something I think can fix the
issue I described above, if I understand things correctly...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e1466..444f0eb2df71d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->removed.nr)
+		return false;
+
 	return true;
 }
 
@@ -13167,7 +13170,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
 	 * that removed load decayed or it can cause faireness problem.
 	 */
-	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+	if (!throttled_hierarchy(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13178,7 +13181,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+		if (!throttled_hierarchy(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }

Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Posted by Chengming Zhou 3 weeks, 1 day ago

On 2025/9/10 17:50, Aaron Lu wrote:
> Before task based throttle model, propagating load will stop at a
> throttled cfs_rq and that propagate will happen on unthrottle time by
> update_load_avg().
> 
> Now that there is no update_load_avg() on unthrottle for throttled
> cfs_rq and all load tracking is done by task related operations, let the
> propagate happen immediately.
> 
> While at it, add a comment to explain why cfs_rqs that are not affected
> by throttle have to be added to leaf cfs_rq list in
> propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> ("sched/fair: Fix unfairness caused by missing load decay").
> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

LGTM!

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks.

> ---
>   kernel/sched/fair.c | 26 ++++++++++++++++++--------
>   1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df8dc389af8e1..f993de30e1466 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5729,6 +5729,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>   	return cfs_bandwidth_used() && cfs_rq->throttled;
>   }
>   
> +static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
> +}
> +
>   /* check whether cfs_rq, or any parent, is throttled */
>   static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>   {
> @@ -6721,6 +6726,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>   	return 0;
>   }
>   
> +static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return false;
> +}
> +
>   static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>   {
>   	return 0;
> @@ -13151,10 +13161,13 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>   {
>   	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>   
> -	if (cfs_rq_throttled(cfs_rq))
> -		return;
> -
> -	if (!throttled_hierarchy(cfs_rq))
> +	/*
> +	 * If a task gets attached to this cfs_rq and before being queued,
> +	 * it gets migrated to another CPU due to reasons like affinity
> +	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
> +	 * that removed load decayed or it can cause faireness problem.
> +	 */
> +	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
>   		list_add_leaf_cfs_rq(cfs_rq);
>   
>   	/* Start to propagate at parent */
> @@ -13165,10 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>   
>   		update_load_avg(cfs_rq, se, UPDATE_TG);
>   
> -		if (cfs_rq_throttled(cfs_rq))
> -			break;
> -
> -		if (!throttled_hierarchy(cfs_rq))
> +		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
>   			list_add_leaf_cfs_rq(cfs_rq);
>   	}
>   }

[tip: sched/core] sched/fair: Propagate load for throttled cfs_rq

Posted by tip-bot2 for Aaron Lu 2 weeks, 2 days ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fe8d238e646e16cc431b7a5899f8dda690258ee9
Gitweb:        https://git.kernel.org/tip/fe8d238e646e16cc431b7a5899f8dda690258ee9
Author:        Aaron Lu <ziqianlu@bytedance.com>
AuthorDate:    Wed, 10 Sep 2025 17:50:41 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 15 Sep 2025 09:38:37 +02:00

sched/fair: Propagate load for throttled cfs_rq

Before task based throttle model, propagating load will stop at a
throttled cfs_rq and that propagate will happen on unthrottle time by
update_load_avg().

Now that there is no update_load_avg() on unthrottle for throttled
cfs_rq and all load tracking is done by task related operations, let the
propagate happen immediately.

While at it, add a comment to explain why cfs_rqs that are not affected
by throttle have to be added to leaf cfs_rq list in
propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
("sched/fair: Fix unfairness caused by missing load decay").

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
---
 kernel/sched/fair.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc38..f993de3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5729,6 +5729,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttled;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
+}
+
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -6721,6 +6726,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return false;
+}
+
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 	return 0;
@@ -13151,10 +13161,13 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	if (cfs_rq_throttled(cfs_rq))
-		return;
-
-	if (!throttled_hierarchy(cfs_rq))
+	/*
+	 * If a task gets attached to this cfs_rq and before being queued,
+	 * it gets migrated to another CPU due to reasons like affinity
+	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
+	 * that removed load decayed or it can cause faireness problem.
+	 */
+	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13165,10 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		if (!throttled_hierarchy(cfs_rq))
+		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }

[PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
[PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs
[PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy
[PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq