kernel/sched/fair.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-)
From: Han Guangjiang <hanguangjiang@lixiang.com>
When both CPU cgroup and memory cgroup are enabled with parent cgroup
resource limits much smaller than child cgroup's, the system frequently
hangs with NULL pointer dereference:
Unable to handle kernel NULL pointer dereference
at virtual address 0000000000000051
Internal error: Oops: 0000000096000006 [#1] PREEMPT_RT SMP
pc : pick_task_fair+0x68/0x150
Call trace:
pick_task_fair+0x68/0x150
pick_next_task_fair+0x30/0x3b8
__schedule+0x180/0xb98
preempt_schedule+0x48/0x60
rt_mutex_slowunlock+0x298/0x340
rt_spin_unlock+0x84/0xa0
page_vma_mapped_walk+0x1c8/0x478
folio_referenced_one+0xdc/0x490
rmap_walk_file+0x11c/0x200
folio_referenced+0x160/0x1e8
shrink_folio_list+0x5c4/0xc60
shrink_lruvec+0x5f8/0xb88
shrink_node+0x308/0x940
do_try_to_free_pages+0xd4/0x540
try_to_free_mem_cgroup_pages+0x12c/0x2c0
The issue can be mitigated by increasing parent cgroup's CPU resources,
or completely resolved by disabling DELAY_DEQUEUE feature.
SCHED_FEAT(DELAY_DEQUEUE, false)
With CONFIG_SCHED_DEBUG enabled, the following warning appears:
WARNING: CPU: 1 PID: 27 at kernel/sched/fair.c:704 update_entity_lag+0xa8/0xd0
!se->on_rq
Call trace:
update_entity_lag+0xa8/0xd0
dequeue_entity+0x90/0x538
dequeue_entities+0xd0/0x490
dequeue_task_fair+0xcc/0x230
rt_mutex_setprio+0x2ec/0x4d8
rtlock_slowlock_locked+0x6c8/0xce8
The warning indicates se->on_rq is 0, meaning dequeue_entity() was
entered at least twice and executed update_entity_lag().
Root cause analysis:
In rt_mutex_setprio(), there are two dequeue_task() calls:
1. First call: dequeue immediately if task is delay-dequeued
2. Second call: dequeue running tasks
Through debugging, we observed that for the same task, both dequeue_task()
calls are actually executed. The task is a sched_delayed task on cfs_rq,
which confirms our analysis that dequeue_entity() is entered at least
twice.
Semantically, rt_mutex handles scheduling and priority inheritance, and
should only dequeue/enqueue running tasks. A sched_delayed task is
essentially non-running, so the second dequeue_task() should not execute.
Further analysis of dequeue_entities() shows multiple cfs_rq_throttled()
checks. At the function's end, __block_task() updates sched_delayed
tasks to non-running state. However, when cgroup throttling occurs, the
function returns early without executing __block_task(), leaving the
sched_delayed task in running state. This causes the unexpected second
dequeue_task() in rt_mutex_setprio(), leading to system crash.
We initially tried modifying the two cfs_rq_throttled() return points in
dequeue_entities() to jump to the __block_task() condition check, which
resolved the issue completely.
This patch takes a cleaner approach by moving the __block_task()
operation from dequeue_entities() to finish_delayed_dequeue_entity(),
ensuring sched_delayed tasks are properly marked as non-running
regardless of cgroup throttling status.
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com>
---
kernel/sched/fair.c | 21 ++++++---------------
1 file changed, 6 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..d6c2a604358f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5373,6 +5373,12 @@ static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
clear_delayed(se);
if (sched_feat(DELAY_ZERO) && se->vlag > 0)
se->vlag = 0;
+
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
+
+ __block_task(task_rq(p), p);
+ }
}
static bool
@@ -7048,21 +7054,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
rq->next_balance = jiffies;
- if (p && task_delayed) {
- WARN_ON_ONCE(!task_sleep);
- WARN_ON_ONCE(p->on_rq != 1);
-
- /* Fix-up what dequeue_task_fair() skipped */
- hrtick_update(rq);
-
- /*
- * Fix-up what block_task() skipped.
- *
- * Must be last, @p might not be valid after this.
- */
- __block_task(rq, p);
- }
-
return 1;
}
--
2.25.1
On Thu, Sep 04, 2025 at 09:51:50AM +0800, Han Guangjiang wrote: > From: Han Guangjiang <hanguangjiang@lixiang.com> > > When both CPU cgroup and memory cgroup are enabled with parent cgroup > resource limits much smaller than child cgroup's, the system frequently > hangs with NULL pointer dereference: Is this the same issue as here: https://lore.kernel.org/all/105ae6f1-f629-4fe7-9644-4242c3bed035@amd.com/T/#u ?
>> From: Han Guangjiang <hanguangjiang@lixiang.com> >> >> When both CPU cgroup and memory cgroup are enabled with parent cgroup >> resource limits much smaller than child cgroup's, the system frequently >> hangs with NULL pointer dereference: >> > Is this the same issue as here: > > https://lore.kernel.org/all/105ae6f1-f629-4fe7-9644-4242c3bed035@amd.com/T/#u > > ? Yes, based on the patch modifications, I believe this is the same issue. When dequeue_entities() is executed on a delay_dequeued task while the cgroup is being throttled, it returns early and misses the __block_task() operation on the task. This leads to inconsistency between p->on_rq and se->on_rq. When PI or scheduler switching occurs, the second dequeue_entities() call assumes the task is still in the CFS scheduler, but in reality it is no longer there. By the way, I have a question about the hrtick_update() in dequeue_entities(). Should it be changed to: dequeue_entities() { ... if (p) { hrtick_update(rq); } ... } And remove hrtick_update() from dequeue_task_fair()? Because for dequeue_delayed tasks, hrtick_update() will be executed twice in this proces. Also, should the return type of dequeue_entities() be changed to match dequeue_task_fair(), where true means the task was actually removed from the queue, and false means it was delay dequeued? Thanks, Han Guangjiang
Hello Han, On 9/4/25 03:51, Han Guangjiang wrote: > From: Han Guangjiang <hanguangjiang@lixiang.com> > > When both CPU cgroup and memory cgroup are enabled with parent cgroup > resource limits much smaller than child cgroup's, the system frequently > hangs with NULL pointer dereference: Is it happening while running a specific workload ? Would it be possible to provide a reproducer ? > Unable to handle kernel NULL pointer dereference > at virtual address 0000000000000051 > Internal error: Oops: 0000000096000006 [#1] PREEMPT_RT SMP > pc : pick_task_fair+0x68/0x150 > Call trace: > pick_task_fair+0x68/0x150 > pick_next_task_fair+0x30/0x3b8 > __schedule+0x180/0xb98 > preempt_schedule+0x48/0x60 > rt_mutex_slowunlock+0x298/0x340 > rt_spin_unlock+0x84/0xa0 > page_vma_mapped_walk+0x1c8/0x478 > folio_referenced_one+0xdc/0x490 > rmap_walk_file+0x11c/0x200 > folio_referenced+0x160/0x1e8 > shrink_folio_list+0x5c4/0xc60 > shrink_lruvec+0x5f8/0xb88 > shrink_node+0x308/0x940 > do_try_to_free_pages+0xd4/0x540 > try_to_free_mem_cgroup_pages+0x12c/0x2c0 > > The issue can be mitigated by increasing parent cgroup's CPU resources, > or completely resolved by disabling DELAY_DEQUEUE feature. > > SCHED_FEAT(DELAY_DEQUEUE, false) > > With CONFIG_SCHED_DEBUG enabled, the following warning appears: > > WARNING: CPU: 1 PID: 27 at kernel/sched/fair.c:704 update_entity_lag+0xa8/0xd0 > !se->on_rq > Call trace: > update_entity_lag+0xa8/0xd0 > dequeue_entity+0x90/0x538 > dequeue_entities+0xd0/0x490 > dequeue_task_fair+0xcc/0x230 > rt_mutex_setprio+0x2ec/0x4d8 > rtlock_slowlock_locked+0x6c8/0xce8 > > The warning indicates se->on_rq is 0, meaning dequeue_entity() was > entered at least twice and executed update_entity_lag(). > > Root cause analysis: > In rt_mutex_setprio(), there are two dequeue_task() calls: > 1. First call: dequeue immediately if task is delay-dequeued > 2. Second call: dequeue running tasks > > Through debugging, we observed that for the same task, both dequeue_task() > calls are actually executed. The task is a sched_delayed task on cfs_rq, > which confirms our analysis that dequeue_entity() is entered at least > twice. > > Semantically, rt_mutex handles scheduling and priority inheritance, and > should only dequeue/enqueue running tasks. A sched_delayed task is > essentially non-running, so the second dequeue_task() should not execute. > > Further analysis of dequeue_entities() shows multiple cfs_rq_throttled() > checks. At the function's end, __block_task() updates sched_delayed > tasks to non-running state. However, when cgroup throttling occurs, the > function returns early without executing __block_task(), leaving the > sched_delayed task in running state. This causes the unexpected second > dequeue_task() in rt_mutex_setprio(), leading to system crash. > > We initially tried modifying the two cfs_rq_throttled() return points in > dequeue_entities() to jump to the __block_task() condition check, which > resolved the issue completely. > > This patch takes a cleaner approach by moving the __block_task() > operation from dequeue_entities() to finish_delayed_dequeue_entity(), > ensuring sched_delayed tasks are properly marked as non-running > regardless of cgroup throttling status. > > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") > Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com> > --- > kernel/sched/fair.c | 21 ++++++--------------- > 1 file changed, 6 insertions(+), 15 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b173a059315c..d6c2a604358f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5373,6 +5373,12 @@ static inline void finish_delayed_dequeue_entity(struct sched_entity *se) > clear_delayed(se); > if (sched_feat(DELAY_ZERO) && se->vlag > 0) > se->vlag = 0; > + > + if (entity_is_task(se)) { > + struct task_struct *p = task_of(se); > + > + __block_task(task_rq(p), p); > + } > } > > static bool > @@ -7048,21 +7054,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > if (unlikely(!was_sched_idle && sched_idle_rq(rq))) > rq->next_balance = jiffies; > > - if (p && task_delayed) { > - WARN_ON_ONCE(!task_sleep); > - WARN_ON_ONCE(p->on_rq != 1); > - > - /* Fix-up what dequeue_task_fair() skipped */ > - hrtick_update(rq); > - > - /* > - * Fix-up what block_task() skipped. > - * > - * Must be last, @p might not be valid after this. > - */ > - __block_task(rq, p); > - } > - > return 1; > } >
>> When both CPU cgroup and memory cgroup are enabled with parent cgroup >> resource limits much smaller than child cgroup's, the system frequently >> hangs with NULL pointer dereference: > Is it happening while running a specific workload ? > Would it be possible to provide a reproducer ? Hi, Yes, this happens on our complex workload. We are using PREEMPT_RT option, and from the error log, we can see that rt mutex PI operation is being executed, and it needs to switch scheduler for a delay_dequeued task. The parent group of this delay_dequeued task is being throttled by cgroup at this time. And We currently do not have a minimal bug reproduction program constructed. similar issue: https://lore.kernel.org/all/87254ef1-fa58-4747-b2e1-5c85ecde15bf@windriver.com/ Thanks, Han Guangjiang
© 2016 - 2025 Red Hat, Inc.