[PATCH] sched/fair: Fix DELAY_DEQUEUE issue related to cgroup throttling

Han Guangjiang posted 1 patch 4 weeks, 1 day ago
kernel/sched/fair.c | 21 ++++++---------------
1 file changed, 6 insertions(+), 15 deletions(-)
[PATCH] sched/fair: Fix DELAY_DEQUEUE issue related to cgroup throttling
Posted by Han Guangjiang 4 weeks, 1 day ago
From: Han Guangjiang <hanguangjiang@lixiang.com>

When both CPU cgroup and memory cgroup are enabled with parent cgroup
resource limits much smaller than child cgroup's, the system frequently
hangs with NULL pointer dereference:

Unable to handle kernel NULL pointer dereference
at virtual address 0000000000000051
Internal error: Oops: 0000000096000006 [#1] PREEMPT_RT SMP
pc : pick_task_fair+0x68/0x150
Call trace:
 pick_task_fair+0x68/0x150
 pick_next_task_fair+0x30/0x3b8
 __schedule+0x180/0xb98
 preempt_schedule+0x48/0x60
 rt_mutex_slowunlock+0x298/0x340
 rt_spin_unlock+0x84/0xa0
 page_vma_mapped_walk+0x1c8/0x478
 folio_referenced_one+0xdc/0x490
 rmap_walk_file+0x11c/0x200
 folio_referenced+0x160/0x1e8
 shrink_folio_list+0x5c4/0xc60
 shrink_lruvec+0x5f8/0xb88
 shrink_node+0x308/0x940
 do_try_to_free_pages+0xd4/0x540
 try_to_free_mem_cgroup_pages+0x12c/0x2c0

The issue can be mitigated by increasing parent cgroup's CPU resources,
or completely resolved by disabling DELAY_DEQUEUE feature.

SCHED_FEAT(DELAY_DEQUEUE, false)

With CONFIG_SCHED_DEBUG enabled, the following warning appears:

WARNING: CPU: 1 PID: 27 at kernel/sched/fair.c:704 update_entity_lag+0xa8/0xd0
!se->on_rq
Call trace:
 update_entity_lag+0xa8/0xd0
 dequeue_entity+0x90/0x538
 dequeue_entities+0xd0/0x490
 dequeue_task_fair+0xcc/0x230
 rt_mutex_setprio+0x2ec/0x4d8
 rtlock_slowlock_locked+0x6c8/0xce8

The warning indicates se->on_rq is 0, meaning dequeue_entity() was
entered at least twice and executed update_entity_lag().

Root cause analysis:
In rt_mutex_setprio(), there are two dequeue_task() calls:
1. First call: dequeue immediately if task is delay-dequeued
2. Second call: dequeue running tasks

Through debugging, we observed that for the same task, both dequeue_task()
calls are actually executed. The task is a sched_delayed task on cfs_rq,
which confirms our analysis that dequeue_entity() is entered at least
twice.

Semantically, rt_mutex handles scheduling and priority inheritance, and
should only dequeue/enqueue running tasks. A sched_delayed task is
essentially non-running, so the second dequeue_task() should not execute.

Further analysis of dequeue_entities() shows multiple cfs_rq_throttled()
checks. At the function's end, __block_task() updates sched_delayed
tasks to non-running state. However, when cgroup throttling occurs, the
function returns early without executing __block_task(), leaving the
sched_delayed task in running state. This causes the unexpected second
dequeue_task() in rt_mutex_setprio(), leading to system crash.

We initially tried modifying the two cfs_rq_throttled() return points in
dequeue_entities() to jump to the __block_task() condition check, which
resolved the issue completely.

This patch takes a cleaner approach by moving the __block_task()
operation from dequeue_entities() to finish_delayed_dequeue_entity(),
ensuring sched_delayed tasks are properly marked as non-running
regardless of cgroup throttling status.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com>
---
 kernel/sched/fair.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..d6c2a604358f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5373,6 +5373,12 @@ static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
 	clear_delayed(se);
 	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
 		se->vlag = 0;
+
+	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
+
+		__block_task(task_rq(p), p);
+	}
 }
 
 static bool
@@ -7048,21 +7054,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
-	if (p && task_delayed) {
-		WARN_ON_ONCE(!task_sleep);
-		WARN_ON_ONCE(p->on_rq != 1);
-
-		/* Fix-up what dequeue_task_fair() skipped */
-		hrtick_update(rq);
-
-		/*
-		 * Fix-up what block_task() skipped.
-		 *
-		 * Must be last, @p might not be valid after this.
-		 */
-		__block_task(rq, p);
-	}
-
 	return 1;
 }
 
-- 
2.25.1
Re: [PATCH] sched/fair: Fix DELAY_DEQUEUE issue related to cgroup throttling
Posted by Peter Zijlstra 2 days, 17 hours ago
On Thu, Sep 04, 2025 at 09:51:50AM +0800, Han Guangjiang wrote:
> From: Han Guangjiang <hanguangjiang@lixiang.com>
> 
> When both CPU cgroup and memory cgroup are enabled with parent cgroup
> resource limits much smaller than child cgroup's, the system frequently
> hangs with NULL pointer dereference:

Is this the same issue as here:

  https://lore.kernel.org/all/105ae6f1-f629-4fe7-9644-4242c3bed035@amd.com/T/#u

  ?
Re: [PATCH] sched/fair: Fix DELAY_DEQUEUE issue related to cgroup throttling
Posted by Han Guangjiang 2 days, 5 hours ago
>> From: Han Guangjiang <hanguangjiang@lixiang.com>
>>
>> When both CPU cgroup and memory cgroup are enabled with parent cgroup
>> resource limits much smaller than child cgroup's, the system frequently
>> hangs with NULL pointer dereference:
>>
> Is this the same issue as here:
>
>   https://lore.kernel.org/all/105ae6f1-f629-4fe7-9644-4242c3bed035@amd.com/T/#u
>
>   ?

Yes, based on the patch modifications, I believe this is the same issue.
When dequeue_entities() is executed on a delay_dequeued task while the
cgroup is being throttled, it returns early and misses the
__block_task() operation on the task. This leads to inconsistency
between p->on_rq and se->on_rq.

When PI or scheduler switching occurs, the second dequeue_entities()
call assumes the task is still in the CFS scheduler, but in reality
it is no longer there.

By the way, I have a question about the hrtick_update() in
dequeue_entities(). Should it be changed to:

dequeue_entities()
{
    ...
    if (p) {
        hrtick_update(rq);
    }
    ...
}

And remove hrtick_update() from dequeue_task_fair()?
Because for dequeue_delayed tasks, hrtick_update() will be executed
twice in this proces.

Also, should the return type of dequeue_entities() be changed to
match dequeue_task_fair(), where true means the task was actually
removed from the queue, and false means it was delay dequeued?

Thanks,
Han Guangjiang
Re: [PATCH] sched/fair: Fix DELAY_DEQUEUE issue related to cgroup throttling
Posted by Pierre Gondois 2 days, 18 hours ago
Hello Han,

On 9/4/25 03:51, Han Guangjiang wrote:
> From: Han Guangjiang <hanguangjiang@lixiang.com>
>
> When both CPU cgroup and memory cgroup are enabled with parent cgroup
> resource limits much smaller than child cgroup's, the system frequently
> hangs with NULL pointer dereference:
Is it happening while running a specific workload ?
Would it be possible to provide a reproducer ?

> Unable to handle kernel NULL pointer dereference
> at virtual address 0000000000000051
> Internal error: Oops: 0000000096000006 [#1] PREEMPT_RT SMP
> pc : pick_task_fair+0x68/0x150
> Call trace:
>   pick_task_fair+0x68/0x150
>   pick_next_task_fair+0x30/0x3b8
>   __schedule+0x180/0xb98
>   preempt_schedule+0x48/0x60
>   rt_mutex_slowunlock+0x298/0x340
>   rt_spin_unlock+0x84/0xa0
>   page_vma_mapped_walk+0x1c8/0x478
>   folio_referenced_one+0xdc/0x490
>   rmap_walk_file+0x11c/0x200
>   folio_referenced+0x160/0x1e8
>   shrink_folio_list+0x5c4/0xc60
>   shrink_lruvec+0x5f8/0xb88
>   shrink_node+0x308/0x940
>   do_try_to_free_pages+0xd4/0x540
>   try_to_free_mem_cgroup_pages+0x12c/0x2c0
>
> The issue can be mitigated by increasing parent cgroup's CPU resources,
> or completely resolved by disabling DELAY_DEQUEUE feature.
>
> SCHED_FEAT(DELAY_DEQUEUE, false)
>
> With CONFIG_SCHED_DEBUG enabled, the following warning appears:
>
> WARNING: CPU: 1 PID: 27 at kernel/sched/fair.c:704 update_entity_lag+0xa8/0xd0
> !se->on_rq
> Call trace:
>   update_entity_lag+0xa8/0xd0
>   dequeue_entity+0x90/0x538
>   dequeue_entities+0xd0/0x490
>   dequeue_task_fair+0xcc/0x230
>   rt_mutex_setprio+0x2ec/0x4d8
>   rtlock_slowlock_locked+0x6c8/0xce8
>
> The warning indicates se->on_rq is 0, meaning dequeue_entity() was
> entered at least twice and executed update_entity_lag().
>
> Root cause analysis:
> In rt_mutex_setprio(), there are two dequeue_task() calls:
> 1. First call: dequeue immediately if task is delay-dequeued
> 2. Second call: dequeue running tasks
>
> Through debugging, we observed that for the same task, both dequeue_task()
> calls are actually executed. The task is a sched_delayed task on cfs_rq,
> which confirms our analysis that dequeue_entity() is entered at least
> twice.
>
> Semantically, rt_mutex handles scheduling and priority inheritance, and
> should only dequeue/enqueue running tasks. A sched_delayed task is
> essentially non-running, so the second dequeue_task() should not execute.
>
> Further analysis of dequeue_entities() shows multiple cfs_rq_throttled()
> checks. At the function's end, __block_task() updates sched_delayed
> tasks to non-running state. However, when cgroup throttling occurs, the
> function returns early without executing __block_task(), leaving the
> sched_delayed task in running state. This causes the unexpected second
> dequeue_task() in rt_mutex_setprio(), leading to system crash.
>
> We initially tried modifying the two cfs_rq_throttled() return points in
> dequeue_entities() to jump to the __block_task() condition check, which
> resolved the issue completely.
>
> This patch takes a cleaner approach by moving the __block_task()
> operation from dequeue_entities() to finish_delayed_dequeue_entity(),
> ensuring sched_delayed tasks are properly marked as non-running
> regardless of cgroup throttling status.
>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com>
> ---
>   kernel/sched/fair.c | 21 ++++++---------------
>   1 file changed, 6 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..d6c2a604358f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5373,6 +5373,12 @@ static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
>   	clear_delayed(se);
>   	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
>   		se->vlag = 0;
> +
> +	if (entity_is_task(se)) {
> +		struct task_struct *p = task_of(se);
> +
> +		__block_task(task_rq(p), p);
> +	}
>   }
>   
>   static bool
> @@ -7048,21 +7054,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>   		rq->next_balance = jiffies;
>   
> -	if (p && task_delayed) {
> -		WARN_ON_ONCE(!task_sleep);
> -		WARN_ON_ONCE(p->on_rq != 1);
> -
> -		/* Fix-up what dequeue_task_fair() skipped */
> -		hrtick_update(rq);
> -
> -		/*
> -		 * Fix-up what block_task() skipped.
> -		 *
> -		 * Must be last, @p might not be valid after this.
> -		 */
> -		__block_task(rq, p);
> -	}
> -
>   	return 1;
>   }
>
Re: [PATCH] sched/fair: Fix DELAY_DEQUEUE issue related to cgroup throttling
Posted by Han Guangjiang 2 days, 5 hours ago
>> When both CPU cgroup and memory cgroup are enabled with parent cgroup
>> resource limits much smaller than child cgroup's, the system frequently
>> hangs with NULL pointer dereference:
> Is it happening while running a specific workload ?
> Would it be possible to provide a reproducer ?

Hi,

Yes, this happens on our complex workload. We are using PREEMPT_RT option,
and from the error log, we can see that rt mutex PI operation is
being executed, and it needs to switch scheduler for a delay_dequeued task.
The parent group of this delay_dequeued task is being throttled by
cgroup at this time. And We currently do not have a minimal bug reproduction
program constructed.

similar issue: https://lore.kernel.org/all/87254ef1-fa58-4747-b2e1-5c85ecde15bf@windriver.com/
 
Thanks,
Han Guangjiang