[v5] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.

[PATCH v5] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.

Posted by xupengbo 5 months, 2 weeks ago

When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows.

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
is a possibility that the reduced load_avg is not updated to tg->load_avg
when a task migrates out.
2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
function cfs_rq_is_decayed() does not check whether
cfs->tg_load_avg_contrib is null. Consequently, in some cases,
__update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
updated to tg->load_avg.

Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which fixes the case (2.) mentioned above.

Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: xupengbo <xupengbo@oppo.com>
---
Changes:
v1 -> v2: 
- Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in 
cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
- Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/
v2 -> v3:
- Check if cfs_rq->tg_load_avg_contrib is 0 derectly.
- Link to v2 : https://lore.kernel.org/cgroups/20250805144121.14871-1-xupengbo@oppo.com/
v3 -> v4:
- Fix typo
- Link to v3 : https://lore.kernel.org/cgroups/20250826075743.19106-1-xupengbo@oppo.com/
v4 -> v5:
- Amend the commit message
- Link to v4 : https://lore.kernel.org/cgroups/20250826084854.25956-1-xupengbo@oppo.com/

After some preliminary discussion and analysis, I think it is feasible to
directly check if cfs_rq->tg_load_avg_contrib is 0 in cfs_rq_is_decay().
So patch v3 was submitted.

Please send emails to a different email address <xupengbo1029@163.com>
after September 3, 2025, after that date <xupengbo@oppo.com> will expire
for personal reasons.

Thanks,
Xu Pengbo
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..81b7df87f1ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->tg_load_avg_contrib)
+		return false;
+
 	return true;
 }
 

base-commit: fab1beda7597fac1cecc01707d55eadb6bbe773c
-- 
2.43.0

Re: [PATCH v5] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.

Posted by Aaron Lu 2 months, 1 week ago

Hello,

On Wed, Aug 27, 2025 at 10:22:07AM +0800, xupengbo wrote:
> When a task is migrated out, there is a probability that the tg->load_avg
> value will become abnormal. The reason is as follows.
> 
> 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> is a possibility that the reduced load_avg is not updated to tg->load_avg
> when a task migrates out.
> 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> function cfs_rq_is_decayed() does not check whether
> cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> updated to tg->load_avg.
> 
> Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> which fixes the case (2.) mentioned above.
> 
> Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: xupengbo <xupengbo@oppo.com>

I wonder if there are any more concerns about this patch? If no, I hope
this fix can be merged. It's a rare case but it does happen for some
specific setup.

Sorry if this is a bad timing, but I just hit an oncall where this exact
problem occurred so I suppose it's worth a ping :)

Best regards,
Aaron

Re: [PATCH v5] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.

Posted by Peter Zijlstra 2 months, 1 week ago

On Fri, Nov 28, 2025 at 07:54:45PM +0800, Aaron Lu wrote:
> Hello,
> 
> On Wed, Aug 27, 2025 at 10:22:07AM +0800, xupengbo wrote:
> > When a task is migrated out, there is a probability that the tg->load_avg
> > value will become abnormal. The reason is as follows.
> > 
> > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > when a task migrates out.
> > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > function cfs_rq_is_decayed() does not check whether
> > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > updated to tg->load_avg.
> > 
> > Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> > which fixes the case (2.) mentioned above.
> > 
> > Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> > Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> > Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: xupengbo <xupengbo@oppo.com>
> 
> I wonder if there are any more concerns about this patch? If no, I hope
> this fix can be merged. It's a rare case but it does happen for some
> specific setup.
> 
> Sorry if this is a bad timing, but I just hit an oncall where this exact
> problem occurred so I suppose it's worth a ping :)

Totally missed it. Seems okay, let me go queue the thing.

Re: [PATCH v5] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.

Posted by Aaron Lu 2 months, 1 week ago

On Fri, Nov 28, 2025 at 02:40:17PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 28, 2025 at 07:54:45PM +0800, Aaron Lu wrote:
> > Hello,
> > 
> > On Wed, Aug 27, 2025 at 10:22:07AM +0800, xupengbo wrote:
> > > When a task is migrated out, there is a probability that the tg->load_avg
> > > value will become abnormal. The reason is as follows.
> > > 
> > > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > > when a task migrates out.
> > > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > > function cfs_rq_is_decayed() does not check whether
> > > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > > updated to tg->load_avg.
> > > 
> > > Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> > > which fixes the case (2.) mentioned above.
> > > 
> > > Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> > > Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> > > Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
> > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > Signed-off-by: xupengbo <xupengbo@oppo.com>
> > 
> > I wonder if there are any more concerns about this patch? If no, I hope
> > this fix can be merged. It's a rare case but it does happen for some
> > specific setup.
> > 
> > Sorry if this is a bad timing, but I just hit an oncall where this exact
> > problem occurred so I suppose it's worth a ping :)
> 
> Totally missed it. Seems okay, let me go queue the thing.

Thanks Peter!

[tip: sched/urgent] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out

Posted by tip-bot2 for xupengbo 2 months ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     ca125231dd29fc0678dd3622e9cdea80a51dffe4
Gitweb:        https://git.kernel.org/tip/ca125231dd29fc0678dd3622e9cdea80a51dffe4
Author:        xupengbo <xupengbo@oppo.com>
AuthorDate:    Wed, 27 Aug 2025 10:22:07 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Dec 2025 10:03:13 +01:00

sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out

When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows:

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
   is a possibility that the reduced load_avg is not updated to tg->load_avg
   when a task migrates out.

2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
   calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
   function cfs_rq_is_decayed() does not check whether
   cfs->tg_load_avg_contrib is null. Consequently, in some cases,
   __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
   updated to tg->load_avg.

Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which fixes the case (2.) mentioned above.

Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
Signed-off-by: xupengbo <xupengbo@oppo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 769d7b7..da46c31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4034,6 +4034,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->tg_load_avg_contrib)
+		return false;
+
 	return true;
 }

[tip: sched/urgent] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out

Posted by tip-bot2 for xupengbo 2 months ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     3dc7ae575aa1a32971565d9aaf784e6050dae959
Gitweb:        https://git.kernel.org/tip/3dc7ae575aa1a32971565d9aaf784e6050dae959
Author:        xupengbo <xupengbo@oppo.com>
AuthorDate:    Wed, 27 Aug 2025 10:22:07 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 03 Dec 2025 19:26:22 +01:00

sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out

When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows:

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
   is a possibility that the reduced load_avg is not updated to tg->load_avg
   when a task migrates out.

2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
   calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
   function cfs_rq_is_decayed() does not check whether
   cfs->tg_load_avg_contrib is null. Consequently, in some cases,
   __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
   updated to tg->load_avg.

Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which fixes the case (2.) mentioned above.

Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
Signed-off-by: xupengbo <xupengbo@oppo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00a32c9..a31d88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4034,6 +4034,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->tg_load_avg_contrib)
+		return false;
+
 	return true;
 }

[tip: sched/urgent] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out

Posted by tip-bot2 for xupengbo 2 months ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     36c26a1f1f510b23ab81db176c90921305fae669
Gitweb:        https://git.kernel.org/tip/36c26a1f1f510b23ab81db176c90921305fae669
Author:        xupengbo <xupengbo@oppo.com>
AuthorDate:    Wed, 27 Aug 2025 10:22:07 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 02 Dec 2025 15:25:00 +01:00

sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out

When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows:

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
   is a possibility that the reduced load_avg is not updated to tg->load_avg
   when a task migrates out.

2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
   calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
   function cfs_rq_is_decayed() does not check whether
   cfs->tg_load_avg_contrib is null. Consequently, in some cases,
   __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
   updated to tg->load_avg.

Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which fixes the case (2.) mentioned above.

Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
Signed-off-by: xupengbo <xupengbo@oppo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 769d7b7..da46c31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4034,6 +4034,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->tg_load_avg_contrib)
+		return false;
+
 	return true;
 }