arch/Kconfig | 2 +- kernel/kprobes.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
There is a deadlock scenario in kprobe_optimizer():
pid A pid B pid C
kprobe_optimizer() do_exit() perf_kprobe_init()
mutex_lock(&kprobe_mutex) exit_tasks_rcu_start() mutex_lock(&kprobe_mutex)
synchronize_rcu_tasks() zap_pid_ns_processes() // waiting kprobe_mutex
// waiting tasks_rcu_exit_srcu kernel_wait4()
// waiting pid C exit
To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
---
arch/Kconfig | 2 +-
kernel/kprobes.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index f4b210ab0612..dc6a18854017 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
config OPTPROBES
def_bool y
depends on KPROBES && HAVE_OPTPROBES
- select TASKS_RCU if PREEMPTION
+ select TASKS_RUDE_RCU
config KPROBES_ON_FTRACE
def_bool y
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index d5a0ee40bf66..09056ae50c58 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
* Note that on non-preemptive kernel, this is transparently converted
* to synchronoze_sched() to wait for all interrupts to have completed.
*/
- synchronize_rcu_tasks();
+ synchronize_rcu_tasks_rude();
/* Step 3: Optimize kprobes after quiesence period */
do_optimize_kprobes();
--
2.25.1
On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <chenzhongjin@huawei.com> wrote: > There is a deadlock scenario in kprobe_optimizer(): > > pid A pid B pid C > kprobe_optimizer() do_exit() perf_kprobe_init() > mutex_lock(&kprobe_mutex) exit_tasks_rcu_start() mutex_lock(&kprobe_mutex) > synchronize_rcu_tasks() zap_pid_ns_processes() // waiting kprobe_mutex > // waiting tasks_rcu_exit_srcu kernel_wait4() > // waiting pid C exit > > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer() > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu. > > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Thanks. Should we backport this fix into earlier kernels? If so, are we able to identify a suitable Fixes: target?
On Wed, Jan 17, 2024 at 12:31:33PM -0800, Andrew Morton wrote: > On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <chenzhongjin@huawei.com> wrote: > > > There is a deadlock scenario in kprobe_optimizer(): > > > > pid A pid B pid C > > kprobe_optimizer() do_exit() perf_kprobe_init() > > mutex_lock(&kprobe_mutex) exit_tasks_rcu_start() mutex_lock(&kprobe_mutex) > > synchronize_rcu_tasks() zap_pid_ns_processes() // waiting kprobe_mutex > > // waiting tasks_rcu_exit_srcu kernel_wait4() > > // waiting pid C exit > > > > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer() > > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise > > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu. Hello, Chen Zhongjin, Relying on the non-preemptability of the last portion of the do_exit() function does appear to be the basis for a much better solution than what we currently have. So at the very least, thank you for the idea!!! I feel a bit silly for not having thought of it myself. ;-) However, just invoking synchronize_rcu_tasks_rude() will be bad for both energy efficiency and real-time response. This is due to the fact that synchronize_rcu_tasks_rude() sends an IPI to each and every online CPUs, almost none of which will be in the non-preemptible tail of do_exit() at any given time. These additional unnecessary IPIs will drain battery when they hit idle CPUs and degrade real-time response when they hit CPUs running aggressive real-time applications. Which might not make people happy. So, would you be willing to use RCU's do_exit() hooks and RCU's hook into the scheduler (either rcu_note_context_switch() or rcu_tasks_qs(), whichever would work better) maintain a per-CPU variable that is a pointer to the task in the non-preemptible tail of do_exit() if there is one or NULL otherwise? This would get us the deadlock-avoidance simplicity of your underlying idea, while avoiding (almost all of) the energy-efficiency and real-time-response issues with your patch. This does require a bit of memory-ordering care, so if you would prefer that I do the implementation, just let me know. (The memory-ordering trick is to use "smp_mb(); smp_load_acquire();" to sample the counter and "smp_store_release(); smp_mb();" to update it.) Thanx, Paul > > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> > > Thanks. Should we backport this fix into earlier kernels? If so, are > we able to identify a suitable Fixes: target?
© 2016 - 2025 Red Hat, Inc.