Commit 8671bad873eb ("sched: Do not call __put_task_struct() on rt
if pi_blocked_on is set") changed the behavior of put_task_struct()
unconditionally, even when PREEMPT_RT was not enabled, in clear mismatch
with the commit description.
Restore the previous behavior of put_task_struct() for the PREEMPT_RT
disabled case.
Fixes: 8671bad873eb ("sched: Do not call __put_task_struct() on rt if pi_blocked_on is set")
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
---
Note: This patch is a fix motivated by Oleg Nesterov's question at
https://lore.kernel.org/linux-rt-devel/20250728201441.GA4690@redhat.com/
Also, the kernel test robot reported a problem found in a x86 (32 bit) VM
test that was bisected to the original fix being amended here:
https://lkml.org/lkml/2025/9/5/147
Though I was not able the reproduce the reported problem, this patch here
would minimize the problem by isolating the behavior to PREEMPT_RT-enabled
kernels.
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ea41795a352b..51678a541477 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -130,6 +133,16 @@ static inline void put_task_struct(struct task_struct *t)
if (!refcount_dec_and_test(&t->usage))
return;
+ /* In !RT, it is always safe to call __put_task_struct(). */
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+ static DEFINE_WAIT_OVERRIDE_MAP(put_task_map, LD_WAIT_SLEEP);
+
+ lock_map_acquire_try(&put_task_map);
+ __put_task_struct(t);
+ lock_map_release(&put_task_map);
+ return;
+ }
+
/*
* Under PREEMPT_RT, we can't call __put_task_struct
* in atomic context because it will indirectly
@@ -137,10 +150,6 @@ static inline void put_task_struct(struct task_struct *t)
* current process has a mutex enqueued (blocked on
* a PI chain).
*
- * In !RT, it is always safe to call __put_task_struct().
- * Though, in order to simplify the code, resort to the
- * deferred call too.
- *
* call_rcu() will schedule __put_task_struct_rcu_cb()
* to be called in process context.
*
On Mon, Sep 15, 2025 at 08:15:19AM -0300, Luis Claudio R. Goncalves wrote: > Commit 8671bad873eb ("sched: Do not call __put_task_struct() on rt > if pi_blocked_on is set") changed the behavior of put_task_struct() > unconditionally, even when PREEMPT_RT was not enabled, in clear mismatch > with the commit description. > > Restore the previous behavior of put_task_struct() for the PREEMPT_RT > disabled case. > > Fixes: 8671bad873eb ("sched: Do not call __put_task_struct() on rt if pi_blocked_on is set") > Acked-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> > --- > > Note: This patch is a fix motivated by Oleg Nesterov's question at > https://lore.kernel.org/linux-rt-devel/20250728201441.GA4690@redhat.com/ Right, but I thought we did want to make this behaviour consistent. Why have !RT behave differently? That is, why isn't this simply a 'buggy' comment/changelog issue? > Also, the kernel test robot reported a problem found in a x86 (32 bit) VM > test that was bisected to the original fix being amended here: > > https://lkml.org/lkml/2025/9/5/147 > > Though I was not able the reproduce the reported problem, this patch here > would minimize the problem by isolating the behavior to PREEMPT_RT-enabled > kernels. Yeah, robot also had trouble reproducing that if I read that %reproduction thing right. And IIRC RT has been running with this for ages, and never seen a problem.
On 09/15, Peter Zijlstra wrote: > > On Mon, Sep 15, 2025 at 08:15:19AM -0300, Luis Claudio R. Goncalves wrote: > > Commit 8671bad873eb ("sched: Do not call __put_task_struct() on rt > > if pi_blocked_on is set") changed the behavior of put_task_struct() > > unconditionally, even when PREEMPT_RT was not enabled, in clear mismatch > > with the commit description. > > > > Restore the previous behavior of put_task_struct() for the PREEMPT_RT > > disabled case. > > > > Fixes: 8671bad873eb ("sched: Do not call __put_task_struct() on rt if pi_blocked_on is set") > > Acked-by: Oleg Nesterov <oleg@redhat.com> > > Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> > > --- > > > > Note: This patch is a fix motivated by Oleg Nesterov's question at > > https://lore.kernel.org/linux-rt-devel/20250728201441.GA4690@redhat.com/ > > Right, but I thought we did want to make this behaviour consistent. > > Why have !RT behave differently? That is, why isn't this simply a > 'buggy' comment/changelog issue? Well, this was discussed several times, in particular see https://lore.kernel.org/all/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/ And task_struct->rcu_users was introduced to avoid RCU call in put_task_struct() ... But I won't really argue if you decide to remove this !RT optimization, although I think it would be better to do this in a separate patch. Oleg.
On Mon, Sep 15, 2025 at 02:35:40PM +0200, Oleg Nesterov wrote: > On 09/15, Peter Zijlstra wrote: > > > > On Mon, Sep 15, 2025 at 08:15:19AM -0300, Luis Claudio R. Goncalves wrote: > > > Commit 8671bad873eb ("sched: Do not call __put_task_struct() on rt > > > if pi_blocked_on is set") changed the behavior of put_task_struct() > > > unconditionally, even when PREEMPT_RT was not enabled, in clear mismatch > > > with the commit description. > > > > > > Restore the previous behavior of put_task_struct() for the PREEMPT_RT > > > disabled case. > > > > > > Fixes: 8671bad873eb ("sched: Do not call __put_task_struct() on rt if pi_blocked_on is set") > > > Acked-by: Oleg Nesterov <oleg@redhat.com> > > > Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> > > > --- > > > > > > Note: This patch is a fix motivated by Oleg Nesterov's question at > > > https://lore.kernel.org/linux-rt-devel/20250728201441.GA4690@redhat.com/ > > > > Right, but I thought we did want to make this behaviour consistent. > > > > Why have !RT behave differently? That is, why isn't this simply a > > 'buggy' comment/changelog issue? > > Well, this was discussed several times, in particular see > https://lore.kernel.org/all/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/ > > And task_struct->rcu_users was introduced to avoid RCU call in put_task_struct() ... Ah, I forgot about that thing.. Although I had vague memories of that argument on rcu_assign_pointer() vs RCU_INIT_POINTER(). > But I won't really argue if you decide to remove this !RT optimization, although > I think it would be better to do this in a separate patch. Right. So when they wanted to remove that preemptible() clause, I was like why again do we have this !RT exception at all, and can't we get rid of that. If git isn't confusing me again, this got merged in this cycle. But so far no benchmark came and told us this was a bad idea. So what do we do now... do we restore the !RT exception (so far there aren't any numbers to suggest this mattered) or do we let it be for a bit and then go and clean things up? --- include/linux/sched.h | 1 - include/linux/sched/task.h | 37 +++---------------------------------- kernel/bpf/helpers.c | 6 +++--- kernel/exit.c | 21 ++------------------- kernel/fork.c | 14 +++++++++----- kernel/sched/core.c | 3 +-- 6 files changed, 18 insertions(+), 64 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 644a01bdae70..fd6586c04ed3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1576,7 +1576,6 @@ struct task_struct { # endif #endif struct rcu_head rcu; - refcount_t rcu_users; int pagefault_disabled; #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index ea41795a352b..1125c44b205a 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -122,41 +122,12 @@ static inline struct task_struct *tryget_task_struct(struct task_struct *t) return refcount_inc_not_zero(&t->usage) ? t : NULL; } -extern void __put_task_struct(struct task_struct *t); extern void __put_task_struct_rcu_cb(struct rcu_head *rhp); static inline void put_task_struct(struct task_struct *t) { - if (!refcount_dec_and_test(&t->usage)) - return; - - /* - * Under PREEMPT_RT, we can't call __put_task_struct - * in atomic context because it will indirectly - * acquire sleeping locks. The same is true if the - * current process has a mutex enqueued (blocked on - * a PI chain). - * - * In !RT, it is always safe to call __put_task_struct(). - * Though, in order to simplify the code, resort to the - * deferred call too. - * - * call_rcu() will schedule __put_task_struct_rcu_cb() - * to be called in process context. - * - * __put_task_struct() is called when - * refcount_dec_and_test(&t->usage) succeeds. - * - * This means that it can't "conflict" with - * put_task_struct_rcu_user() which abuses ->rcu the same - * way; rcu_users has a reference so task->usage can't be - * zero after rcu_users 1 -> 0 transition. - * - * delayed_free_task() also uses ->rcu, but it is only called - * when it fails to fork a process. Therefore, there is no - * way it can conflict with __put_task_struct(). - */ - call_rcu(&t->rcu, __put_task_struct_rcu_cb); + if (refcount_dec_and_test(&t->usage)) + call_rcu(&t->rcu, __put_task_struct_rcu_cb); } DEFINE_FREE(put_task, struct task_struct *, if (_T) put_task_struct(_T)) @@ -164,11 +135,9 @@ DEFINE_FREE(put_task, struct task_struct *, if (_T) put_task_struct(_T)) static inline void put_task_struct_many(struct task_struct *t, int nr) { if (refcount_sub_and_test(nr, &t->usage)) - __put_task_struct(t); + call_rcu(&t->rcu, __put_task_struct_rcu_cb); } -void put_task_struct_rcu_user(struct task_struct *task); - /* Free all architecture-specific resources held by a thread. */ void release_thread(struct task_struct *dead_task); diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 8af62cb243d9..bdc4e65bca5c 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2459,7 +2459,7 @@ __bpf_kfunc struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root, struc */ __bpf_kfunc struct task_struct *bpf_task_acquire(struct task_struct *p) { - if (refcount_inc_not_zero(&p->rcu_users)) + if (refcount_inc_not_zero(&p->usage)) return p; return NULL; } @@ -2470,12 +2470,12 @@ __bpf_kfunc struct task_struct *bpf_task_acquire(struct task_struct *p) */ __bpf_kfunc void bpf_task_release(struct task_struct *p) { - put_task_struct_rcu_user(p); + put_task_struct(p); } __bpf_kfunc void bpf_task_release_dtor(void *p) { - put_task_struct_rcu_user(p); + put_task_struct(p); } CFI_NOSEAL(bpf_task_release_dtor); diff --git a/kernel/exit.c b/kernel/exit.c index 343eb97543d5..240a6faa0e26 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -220,23 +220,6 @@ static void __exit_signal(struct release_task_post *post, struct task_struct *ts tty_kref_put(tty); } -static void delayed_put_task_struct(struct rcu_head *rhp) -{ - struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); - - kprobe_flush_task(tsk); - rethook_flush_task(tsk); - perf_event_delayed_put(tsk); - trace_sched_process_free(tsk); - put_task_struct(tsk); -} - -void put_task_struct_rcu_user(struct task_struct *task) -{ - if (refcount_dec_and_test(&task->rcu_users)) - call_rcu(&task->rcu, delayed_put_task_struct); -} - void __weak release_thread(struct task_struct *dead_task) { } @@ -305,7 +288,7 @@ void release_task(struct task_struct *p) if (thread_group_leader(p)) flush_sigqueue(&p->signal->shared_pending); - put_task_struct_rcu_user(p); + put_task_struct(p); p = leader; if (unlikely(zap_leader)) @@ -1057,7 +1040,7 @@ void __noreturn make_task_dead(int signr) pr_alert("Fixing recursive fault but reboot is needed!\n"); futex_exit_recursive(tsk); tsk->exit_state = EXIT_DEAD; - refcount_inc(&tsk->rcu_users); + get_task_struct(tsk); do_task_dead(); } diff --git a/kernel/fork.c b/kernel/fork.c index 33dfb82af25b..383c811626a3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -106,6 +106,7 @@ #include <linux/pidfs.h> #include <linux/tick.h> #include <linux/unwind_deferred.h> +#include <linux/kprobes.h> #include <asm/pgalloc.h> #include <linux/uaccess.h> @@ -729,12 +730,17 @@ static inline void put_signal_struct(struct signal_struct *sig) free_signal_struct(sig); } -void __put_task_struct(struct task_struct *tsk) +static void __put_task_struct(struct task_struct *tsk) { WARN_ON(!tsk->exit_state); WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); + kprobe_flush_task(tsk); + rethook_flush_task(tsk); + perf_event_delayed_put(tsk); + trace_sched_process_free(tsk); + unwind_task_free(tsk); sched_ext_free(tsk); io_uring_free(tsk); @@ -915,12 +921,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) dup_user_cpus_ptr(tsk, orig, node); /* - * One for the user space visible state that goes away when reaped. + * One for user space visible state that goes away when reaped. * One for the scheduler. */ - refcount_set(&tsk->rcu_users, 2); - /* One for the rcu users */ - refcount_set(&tsk->usage, 1); + refcount_set(&tsk->usage, 2); #ifdef CONFIG_BLK_DEV_IO_TRACE tsk->btrace_seq = 0; #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da2062de97a2..ad3c0e9f3eb4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5259,8 +5259,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) /* Task is done with its stack. */ put_task_stack(prev); - - put_task_struct_rcu_user(prev); + put_task_struct(prev); } return rq;
On 09/16, Peter Zijlstra wrote: > > On Mon, Sep 15, 2025 at 02:35:40PM +0200, Oleg Nesterov wrote: > > On 09/15, Peter Zijlstra wrote: > > > > > > Why have !RT behave differently? That is, why isn't this simply a > > > 'buggy' comment/changelog issue? > > > > Well, this was discussed several times, in particular see > > https://lore.kernel.org/all/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/ > > > > And task_struct->rcu_users was introduced to avoid RCU call in put_task_struct() ... > > Ah, I forgot about that thing.. Although I had vague memories of that > argument on rcu_assign_pointer() vs RCU_INIT_POINTER(). > > > But I won't really argue if you decide to remove this !RT optimization, although > > I think it would be better to do this in a separate patch. > > Right. So when they wanted to remove that preemptible() clause, I was > like why again do we have this !RT exception at all, and can't we get > rid of that. > > If git isn't confusing me again, this got merged in this cycle. But so > far no benchmark came and told us this was a bad idea. I still think it would be safer to merge this patch from Luis before v6.17, then possibly remove this special case in a separate patch... > So what do we do now... do we restore the !RT exception (so far there > aren't any numbers to suggest this mattered) or do we let it be for a > bit and then go and clean things up? It is not that simple. Please note that put_task_struct_rcu_user() delays put(tsk->usage), not free(tsk). So for example with this change > @@ -305,7 +288,7 @@ void release_task(struct task_struct *p) > if (thread_group_leader(p)) > flush_sigqueue(&p->signal->shared_pending); > > - put_task_struct_rcu_user(p); > + put_task_struct(p); > > p = leader; > if (unlikely(zap_leader)) This code rcu_read_lock(); tsk = find_task_by_vpid(...); if (tsk) get_task_struct(tsk); rcu_read_unlock(); becomes wrong, get_task_struct(tsk) can increment tsk->usage == 0. Oleg.
On 2025-09-15 13:38:12 [+0200], Peter Zijlstra wrote: > Right, but I thought we did want to make this behaviour consistent. That is correct, that is what I asked for (either consistent or a compelling reason why not). Oleg pointed out that the patch description does not match the change. That is the only complaint. > And IIRC RT has been running with this for ages, and never seen a > problem. The syz-bot picked up RT recently. Other than that, yes, RT had it for ages. Sebastian
On Mon, Sep 15, 2025 at 02:24:44PM +0200, Sebastian Andrzej Siewior wrote: > On 2025-09-15 13:38:12 [+0200], Peter Zijlstra wrote: > > Right, but I thought we did want to make this behaviour consistent. > > That is correct, that is what I asked for (either consistent or a > compelling reason why not). I received a ping from colleagues investigating a problem possible caused by excessive pressure on RCU and this change could make the problem worse. But last I heard from them, sounds like the problem they are investigating lies elsewhere. That said, I don't have problems either way, limiting the change to RT or keeping it general. I will follow whatever guidance I get from you all. Luis > Oleg pointed out that the patch description does not match the change. > That is the only complaint. > > > And IIRC RT has been running with this for ages, and never seen a > > problem. > > The syz-bot picked up RT recently. Other than that, yes, RT had it for > ages. > > Sebastian > ---end quoted text---
On 2025-09-15 11:49:37 [-0300], Luis Claudio R. Goncalves wrote: > On Mon, Sep 15, 2025 at 02:24:44PM +0200, Sebastian Andrzej Siewior wrote: > > On 2025-09-15 13:38:12 [+0200], Peter Zijlstra wrote: > > > Right, but I thought we did want to make this behaviour consistent. > > > > That is correct, that is what I asked for (either consistent or a > > compelling reason why not). > > I received a ping from colleagues investigating a problem possible caused > by excessive pressure on RCU and this change could make the problem worse. > But last I heard from them, sounds like the problem they are investigating > lies elsewhere. Either way, if this change introduces a problem then it affects RT as today. So we should deal with this. If it is not a problem and it just put pressure on an existing problem then it shouldn't be an issue. > Luis Sebastian
© 2016 - 2025 Red Hat, Inc.