[v1] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

[PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

Posted by Chen Zhongjin 2 years ago

There is a deadlock scenario in kprobe_optimizer():

pid A				pid B			pid C
kprobe_optimizer()		do_exit()		perf_kprobe_init()
mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
// waiting tasks_rcu_exit_srcu	kernel_wait4()
				// waiting pid C exit

To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.

Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
---
 arch/Kconfig     | 2 +-
 kernel/kprobes.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index f4b210ab0612..dc6a18854017 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
 config OPTPROBES
 	def_bool y
 	depends on KPROBES && HAVE_OPTPROBES
-	select TASKS_RCU if PREEMPTION
+	select TASKS_RUDE_RCU
 
 config KPROBES_ON_FTRACE
 	def_bool y
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index d5a0ee40bf66..09056ae50c58 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
 	 * Note that on non-preemptive kernel, this is transparently converted
 	 * to synchronoze_sched() to wait for all interrupts to have completed.
 	 */
-	synchronize_rcu_tasks();
+	synchronize_rcu_tasks_rude();
 
 	/* Step 3: Optimize kprobes after quiesence period */
 	do_optimize_kprobes();
-- 
2.25.1

Re: [PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

Posted by Andrew Morton 2 years ago

On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <chenzhongjin@huawei.com> wrote:

> There is a deadlock scenario in kprobe_optimizer():
> 
> pid A				pid B			pid C
> kprobe_optimizer()		do_exit()		perf_kprobe_init()
> mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> // waiting tasks_rcu_exit_srcu	kernel_wait4()
> 				// waiting pid C exit
> 
> To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
> 
> Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>

Thanks.  Should we backport this fix into earlier kernels?  If so, are
we able to identify a suitable Fixes: target?

Re: [PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

Posted by Paul E. McKenney 2 years ago

On Wed, Jan 17, 2024 at 12:31:33PM -0800, Andrew Morton wrote:
> On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <chenzhongjin@huawei.com> wrote:
> 
> > There is a deadlock scenario in kprobe_optimizer():
> > 
> > pid A				pid B			pid C
> > kprobe_optimizer()		do_exit()		perf_kprobe_init()
> > mutex_lock(&kprobe_mutex)	exit_tasks_rcu_start()	mutex_lock(&kprobe_mutex)
> > synchronize_rcu_tasks()		zap_pid_ns_processes()	// waiting kprobe_mutex
> > // waiting tasks_rcu_exit_srcu	kernel_wait4()
> > 				// waiting pid C exit
> > 
> > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.

Hello, Chen Zhongjin,

Relying on the non-preemptability of the last portion of the do_exit()
function does appear to be the basis for a much better solution than
what we currently have.  So at the very least, thank you for the idea!!!
I feel a bit silly for not having thought of it myself.  ;-)

However, just invoking synchronize_rcu_tasks_rude() will be bad for both
energy efficiency and real-time response.  This is due to the fact that
synchronize_rcu_tasks_rude() sends an IPI to each and every online CPUs,
almost none of which will be in the non-preemptible tail of do_exit()
at any given time.  These additional unnecessary IPIs will drain battery
when they hit idle CPUs and degrade real-time response when they hit
CPUs running aggressive real-time applications.  Which might not make
people happy.

So, would you be willing to use RCU's do_exit() hooks and RCU's hook
into the scheduler (either rcu_note_context_switch() or rcu_tasks_qs(),
whichever would work better) maintain a per-CPU variable that is a
pointer to the task in the non-preemptible tail of do_exit() if there
is one or NULL otherwise?  This would get us the deadlock-avoidance
simplicity of your underlying idea, while avoiding (almost all of)
the energy-efficiency and real-time-response issues with your patch.

This does require a bit of memory-ordering care, so if you would prefer
that I do the implementation, just let me know.

(The memory-ordering trick is to use "smp_mb(); smp_load_acquire();" to
sample the counter and "smp_store_release(); smp_mb();" to update it.)

							Thanx, Paul

> > Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
> 
> Thanks.  Should we backport this fix into earlier kernels?  If so, are
> we able to identify a suitable Fixes: target?