sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

[PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

Posted by K Prateek Nayak 5 days, 14 hours ago

Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().

The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.

Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:

    root(#)
   /    \
  A(#)   B(*)
         |
         C <--- new cgroup
         |
         D <--- new cgroup

  # - Already on leaf cfs_rq list
  * - Throttled with PELT frozen

The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:

1. online_fair_sched_group() for cgroup D will call
   propagate_entity_cfs_rq(). (Same can happen if a throttled task is
   moved to cgroup C and enqueue_task_fair() returns early.)

   propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
   "rq->tmp_alone_branch" since its PELT clock is not marked throttled
   and cfs_rq of cgroup B is not on the list.

   cfs_rq of cgroup B is skipped since its PELT is throttled.

   root cfs_rq already exists on cfs_rq leading to
   list_add_leaf_cfs_rq() returning early.

   The cfs_rq of cgroup C is left dangling on the
   "rq->tmp_alone_branch".

2. A new task wakes up on cgroup A. Since the whole hierarchy is already
   on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
   without any modifications to "rq->tmp_alone_branch".

   The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
   dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".

   !!! Splat !!!

Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.

Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.

While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.

Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Stress test included running sched-messaging in nested hierarchy with
various quota set alongside a continuous loop of cgroup creation and
deletion, as well as another loop of continuous movement of a busy loop
between cgroups.

No splats have been observed yet with this patch.

Aaron, Matteo,

I've not added any "Tested-by" tags since the final diff is slightly
different from the diff shared previously. The previous diff led to this
splat during my test with the newly added assert:

  ------------[ cut here ]------------
  WARNING: CPU: 6 PID: 22912 at kernel/sched/fair.c:400 propagate_entity_cfs_rq+0x1f9/0x270
  CPU: 6 UID: 0 PID: 22912 Comm: bash Not tainted 6.17.0-rc4-test+ #50 PREEMPT_{RT,(full)}
  Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
  RIP: 0010:propagate_entity_cfs_rq+0x1f9/0x270
  Code: ...
  RSP: 0018:ffffd4a203713928 EFLAGS: 00010087
  RAX: ffff8ea4714b1e00 RBX: 0000000000000000 RCX: 0000000000000001
  RDX: ffff8ea4714b28e8 RSI: 0000003570150cf8 RDI: ffff8e65abc69400
  RBP: ffff8ea4714b1f00 R08: 0000000000006160 R09: 0000000000016d5a
  R10: 000000000000009c R11: 0000000000000000 R12: ffff8ea4714b1e00
  R13: ffff8e6586df1a00 R14: 0000000000000001 R15: 0000000000000246
  FS:  00007f73a6a7f740(0000) GS:ffff8e64d0689000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000562b9791d461 CR3: 000000010d875003 CR4: 0000000000f70ef0
  PKRU: 55555554
  Call Trace:
   <TASK>
   sched_move_task+0xc2/0x280
   cpu_cgroup_attach+0x37/0x70
   cgroup_migrate_execute+0x3b8/0x560
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   cgroup_attach_task+0x15b/0x200
   ? cgroup_attach_permissions+0x17e/0x1e0
   __cgroup_procs_write+0x105/0x170
   cgroup_procs_write+0x17/0x30
   kernfs_fop_write_iter+0x127/0x1c0
   vfs_write+0x305/0x420
   ksys_write+0x6b/0xe0
   do_syscall_64+0x85/0xb30
   ? __x64_sys_close+0x3d/0x80
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? wp_page_reuse.constprop.0+0x7c/0x90
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_wp_page+0x3df/0xd70
   ? set_close_on_exec+0x31/0x70
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_fcntl+0x3f6/0x760
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __handle_mm_fault+0x375/0x380
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? count_memcg_events+0xd9/0x1c0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_mm_fault+0x1af/0x290
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_user_addr_fault+0x2d0/0x8c0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f73a6b968f7
  Code: ...
  RSP: 002b:00007ffebcfdc688 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f73a6b968f7
  RDX: 0000000000000006 RSI: 0000562b9791d460 RDI: 0000000000000001
  RBP: 0000562b9791d460 R08: 00007f73a6c53460 R09: 000000007fffffff
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
  R13: 00007f73a6c9d780 R14: 00007f73a6c99600 R15: 00007f73a6c98a00
   </TASK>
  ---[ end trace 0000000000000000 ]---


... which happens because we can have:

	  A (*throttled; PELT is not frozen)
        /   \
 (new) C     B (contains runnable tasks)

C is a new cgroup but is not enqueued on cfs_rq of A yet. B is another
cfs_rq with tasks and is enqueued on cfs_rq of A. Entire hierarchy shown
above is throttled.

When all tasks on B leaves, PELT of A and B is frozen but since C was
never queued on A and A's PELT wasn't throttled when it was created, C
still has pelt_clock_throttled as 0.

Next propagate_entity_cfs_rq() on C will see C doesn't have PELT clock
throttled, adding it to "rq->tmp_alone_branch" but then it sees A has
its PELT clock throttled which leads to the dangling reference, or
worse, incorrect ordering in leaf cfs_rq list.

Changes are prepared on top of tip:sched/core at commit 45b7f780739a
("sched: Fix some typos in include/linux/preempt.h")
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18a30ae35441..9af82e214312 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,6 +6437,16 @@ static void sync_throttle(struct task_group *tg, int cpu)
 
 	cfs_rq->throttle_count = pcfs_rq->throttle_count;
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+
+	/*
+	 * It is not enough to sync the "pelt_clock_throttled" indicator
+	 * with the parent cfs_rq when the hierarchy is not queued.
+	 * Always join a throttled hierarchy with PELT clock throttled
+	 * and leave it to the first enqueue, or distribution to
+	 * unthrottle the PELT clock and add us on the leaf_cfs_rq_list.
+	 */
+	if (cfs_rq->throttle_count)
+		cfs_rq->pelt_clock_throttled = 1;
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -13192,6 +13202,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
+
+	assert_list_leaf_cfs_rq(rq_of(cfs_rq));
 }
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
 static void propagate_entity_cfs_rq(struct sched_entity *se) { }

base-commit: 45b7f780739a3145aeef24d2dfa02517a6c82ed6
-- 
2.34.1

Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

Posted by Matteo Martelli 5 days, 7 hours ago

Hi Prateek,

On Fri, 26 Sep 2025 08:19:17 +0000, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
> 
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
> 
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
> 
>     root(#)
>    /    \
>   A(#)   B(*)
>          |
>          C <--- new cgroup
>          |
>          D <--- new cgroup
> 
>   # - Already on leaf cfs_rq list
>   * - Throttled with PELT frozen
> 
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
> 
> 1. online_fair_sched_group() for cgroup D will call
>    propagate_entity_cfs_rq(). (Same can happen if a throttled task is
>    moved to cgroup C and enqueue_task_fair() returns early.)
> 
>    propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
>    "rq->tmp_alone_branch" since its PELT clock is not marked throttled
>    and cfs_rq of cgroup B is not on the list.
> 
>    cfs_rq of cgroup B is skipped since its PELT is throttled.
> 
>    root cfs_rq already exists on cfs_rq leading to
>    list_add_leaf_cfs_rq() returning early.
> 
>    The cfs_rq of cgroup C is left dangling on the
>    "rq->tmp_alone_branch".
> 
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
>    on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
>    without any modifications to "rq->tmp_alone_branch".
> 
>    The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
>    dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
> 
>    !!! Splat !!!
> 
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
> 
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
> 
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
> 
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Stress test included running sched-messaging in nested hierarchy with
> various quota set alongside a continuous loop of cgroup creation and
> deletion, as well as another loop of continuous movement of a busy loop
> between cgroups.
> 
> No splats have been observed yet with this patch.
> 
> Aaron, Matteo,
> 
> I've not added any "Tested-by" tags since the final diff is slightly
> different from the diff shared previously. ...

I applied this patch on top of commit 45b7f780739a ("sched: Fix some
typos in include/linux/preempt.h") from sched/core branch of tip tree,
and tested it with exactly the same setup I described in my previous
email[1]. With the patch applied, I couldn't reproduce the warning in 5
hours of testing, while before the patch the issue was systematically
reprodicible and the warning was being triggered at least once per
minute.

Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>

> ...

[1]: https://lore.kernel.org/all/e2e558b863c929c5019264b2ddefd4c0@codethink.co.uk/

Thanks to you and Aaron for addressing this!

Best regards,
Matteo Martelli

Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

Posted by Aaron Lu 5 days, 12 hours ago

On Fri, Sep 26, 2025 at 08:19:17AM +0000, K Prateek Nayak wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
> 
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
> 
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
> 
>     root(#)
>    /    \
>   A(#)   B(*)
>          |
>          C <--- new cgroup
>          |
>          D <--- new cgroup
> 
>   # - Already on leaf cfs_rq list
>   * - Throttled with PELT frozen
> 
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
> 
> 1. online_fair_sched_group() for cgroup D will call
>    propagate_entity_cfs_rq(). (Same can happen if a throttled task is
>    moved to cgroup C and enqueue_task_fair() returns early.)
> 
>    propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
>    "rq->tmp_alone_branch" since its PELT clock is not marked throttled
>    and cfs_rq of cgroup B is not on the list.
> 
>    cfs_rq of cgroup B is skipped since its PELT is throttled.
> 
>    root cfs_rq already exists on cfs_rq leading to
>    list_add_leaf_cfs_rq() returning early.
> 
>    The cfs_rq of cgroup C is left dangling on the
>    "rq->tmp_alone_branch".
> 
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
>    on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
>    without any modifications to "rq->tmp_alone_branch".
> 
>    The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
>    dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
> 
>    !!! Splat !!!
> 
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
>

Sigh...

> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
> 

Makes sense.

> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
> 
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")

Should be Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle
model")? "Task based throttle time accounting" doesn't touch pelt bits.

> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>

Thanks for the fix.

BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
ancestor cfs_rq's pelt clock throttled status but only the first level
cfs_rq's, because the purpose is to have the first level cfs_rq to stay
on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
are just to make sure the list is fully connected. I mean something like
this:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75c615f5ed640..6a6d9200ab93c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);
 
 	/*
 	 * If a task gets attached to this cfs_rq and before being queued,
@@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
 	 * that removed load decayed or it can cause faireness problem.
 	 */
-	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+	if (add)
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+		if (add)
 			list_add_leaf_cfs_rq(cfs_rq);
 	}

But this is a different thing and can be taken care of if necessary
later. Current logic doesn't have a problem, it's just not as clear as
the above diff to me.

Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

Posted by Benjamin Segall an hour ago

Aaron Lu <ziqianlu@bytedance.com> writes:

> BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
> ancestor cfs_rq's pelt clock throttled status but only the first level
> cfs_rq's, because the purpose is to have the first level cfs_rq to stay
> on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
> are just to make sure the list is fully connected. I mean something like
> this:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 75c615f5ed640..6a6d9200ab93c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>  static void propagate_entity_cfs_rq(struct sched_entity *se)
>  {
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);
>  
>  	/*
>  	 * If a task gets attached to this cfs_rq and before being queued,
> @@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
>  	 * that removed load decayed or it can cause faireness problem.
>  	 */
> -	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +	if (add)
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
>  	/* Start to propagate at parent */
> @@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  
> -		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +		if (add)
>  			list_add_leaf_cfs_rq(cfs_rq);
>  	}
>
> But this is a different thing and can be taken care of if necessary
> later. Current logic doesn't have a problem, it's just not as clear as
> the above diff to me.

So the question is if we need to call list_add_leaf_cfs_rq in cases
where the immediate cfs_rq clock was stopped but the parents might not
have. It certainly doesn't seem to me like it should be necessary, but
I'm insufficiently clear on the exact details of the leaf_cfs_rq list
to swear to it.

Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

Posted by K Prateek Nayak 5 days, 12 hours ago

(+Chengming for the discussion on the PELT bits)

Hello Aaron,

On 9/26/2025 3:08 PM, Aaron Lu wrote:
>> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
>> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
>> dequeue on other subtree on the throttled hierarchy can freeze the PELT
>> clock for the parent hierarchy without setting the indicators for this
>> newly added cfs_rq which was never enqueued.
>>
> 
> Sigh...

I had the exact reaction when I managed to trigger that again (T_T)

> 
>> Since there are no tasks on the new hierarchy, start a cfs_rq on a
>> throttled hierarchy with its PELT clock throttled. The first enqueue, or
>> the distribution (whichever happens first) will unfreeze the PELT clock
>> and queue the cfs_rq on the leaf cfs_rq list.
>>
> 
> Makes sense.
> 
>> While at it, add an assert_list_leaf_cfs_rq() in
>> propagate_entity_cfs_rq() to catch such cases in the future.
>>
>> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
>> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
>> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
>> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> 
> Should be Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle
> model")? "Task based throttle time accounting" doesn't touch pelt bits.

True that the issue is seen after e1fad12dcb66 but I though this can be
considered as a fix for the initial introduction of indicator. I don't
have any strong feelings on this.

Peter, can you update the commit in the fixes tag if and when you pick
this?

> 
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
> Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> 
> Thanks for the fix.

Thank you for helping with the debug and testing :)

> 
> BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
> ancestor cfs_rq's pelt clock throttled status but only the first level
> cfs_rq's, because the purpose is to have the first level cfs_rq to stay
> on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
> are just to make sure the list is fully connected. I mean something like
> this:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 75c615f5ed640..6a6d9200ab93c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>  static void propagate_entity_cfs_rq(struct sched_entity *se)
>  {
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);

I actually think what it is doing now is correct already. A part of the
hierarchy, between the root and the first throttled cfs_rq get the load
via propagate and this needs to be decayed, otherwise CPU might appear
busy.

Consider: root (decayed) -> A (decayed) -> B (*throttled)

If a heavy task is migrated to B, the load is propagated to B, A, and
then to root. If we don't queue the lhem on leaf_cfs_rq_list until
distribution, the root cfs_rq will appear to have some load and the load
balancer may get confused during find_busiest_group().

Now, if we place A, and root cfs_rq back on leaf_cfs_rq_list, each ILB
stats kick will decay the load and eventually the CPU will appear idle
and post unthrottle, the load will start accumulating again.

>  
>  	/*
>  	 * If a task gets attached to this cfs_rq and before being queued,
> @@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
>  	 * that removed load decayed or it can cause faireness problem.
>  	 */
> -	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +	if (add)
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
>  	/* Start to propagate at parent */
> @@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  
> -		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +		if (add)
>  			list_add_leaf_cfs_rq(cfs_rq);
>  	}
> 
> But this is a different thing and can be taken care of if necessary
> later. Current logic doesn't have a problem, it's just not as clear as
> the above diff to me.

Ack! Ben, Chengming have better idea of these bits and perhaps they can
shed some light on the intended purpose and what is the right thing to
do.

-- 
Thanks and Regards,
Prateek