[v3] perf/core: Fix pending work re-queued in __perf_event_overflow

[PATCH v3] perf/core: Fix pending work re-queued in __perf_event_overflow

Posted by Liangyan 2 months, 3 weeks ago

We got warning below during perf test.
[  467.100914] [      T1] WARNING: CPU: 0 PID: 1 at kernel/events/core.c:5147 put_pmu_ctx+0x2ef/0x3c0
[  467.107702] [      T1] CPU: 0 UID: 0 PID: 1 Comm: systemd Kdump: loaded Tainted: G            E       6.18.0-rc4-dirty #114 PREEMPT(voluntary)
[  467.109835] [      T1] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1
[  467.111027] [      T1] RIP: 0010:put_pmu_ctx+0x2ef/0x3c0
[  467.122081] [      T1] Call Trace:
[  467.122463] [      T1]  <TASK>
[  467.124822] [      T1]  __free_event+0x337/0xa50
[  467.125306] [      T1]  perf_pending_task+0x10f/0x3b0
[  467.125824] [      T1]  task_work_run+0x140/0x210
[  467.127413] [      T1]  exit_to_user_mode_loop+0x10e/0x130
[  467.127965] [      T1]  do_syscall_64+0x26d/0x2e0
[  467.128453] [      T1]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  467.129025] [      T1] RIP: 0033:0x7f01d22349ca
[  467.135157] [      T1]  </TASK>

A race condition occurs between task context and IRQ context when
handling sigtrap tracepoint event overflows:

1. In task context, an event is overflowed and its pending work is
   queued to task->task_works
2. Before pending_work is set, the same event overflows in IRQ context
3. Both contexts queue the same perf pending work to task->task_works

This double queuing causes:
- task_work_run() enters infinite loop calling perf_pending_task()
- Potential warnings and use-after-free when event is freed in
perf_pending_task()

Fix the race by disabling interrupts during queuing of perf pending work.

Fixes: c5d93d23a260 ("perf: Enqueue SIGTRAP always via task_work.")
Reported-by: Xianjun Zeng <zengxianjun@bytedance.com>
Signed-off-by: Liangyan <liangyan.peng@bytedance.com>
---
v3: Refine commit log suggested by Sebastian.
---
 kernel/events/core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index cae921f4d137..7c63e5fdd334 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10433,6 +10433,16 @@ static int __perf_event_overflow(struct perf_event *event,
 
 		notify_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
 
+		/*
+		 * Task context queues the work via task_work_add() but has not yet
+		 * set event->pending_work when the same event overflows in
+		 * IRQ context. The IRQ path, seeing !event->pending_work,
+		 * queues the work again.
+		 * The double queuing causes corruption in task->task_works.
+		 * Prevent this by disabling interrupts around the critical section.
+		 */
+		guard(irqsave)();
+
 		if (!event->pending_work &&
 		    !task_work_add(current, &event->pending_task, notify_mode)) {
 			event->pending_work = pending_id;
-- 
2.39.3 (Apple Git-145)

Re: [PATCH v3] perf/core: Fix pending work re-queued in __perf_event_overflow

Posted by Sebastian Andrzej Siewior 2 months, 3 weeks ago

On 2025-11-14 11:33:49 [+0800], Liangyan wrote:
> We got warning below during perf test.
> [  467.100914] [      T1] WARNING: CPU: 0 PID: 1 at kernel/events/core.c:5147 put_pmu_ctx+0x2ef/0x3c0
> [  467.107702] [      T1] CPU: 0 UID: 0 PID: 1 Comm: systemd Kdump: loaded Tainted: G            E       6.18.0-rc4-dirty #114 PREEMPT(voluntary)
> [  467.109835] [      T1] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1
> [  467.111027] [      T1] RIP: 0010:put_pmu_ctx+0x2ef/0x3c0
> [  467.122081] [      T1] Call Trace:
> [  467.122463] [      T1]  <TASK>
> [  467.124822] [      T1]  __free_event+0x337/0xa50
> [  467.125306] [      T1]  perf_pending_task+0x10f/0x3b0
> [  467.125824] [      T1]  task_work_run+0x140/0x210
> [  467.127413] [      T1]  exit_to_user_mode_loop+0x10e/0x130
> [  467.127965] [      T1]  do_syscall_64+0x26d/0x2e0
> [  467.128453] [      T1]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  467.129025] [      T1] RIP: 0033:0x7f01d22349ca
> [  467.135157] [      T1]  </TASK>
> 
> A race condition occurs between task context and IRQ context when
> handling sigtrap tracepoint event overflows:
> 
> 1. In task context, an event is overflowed and its pending work is
>    queued to task->task_works
> 2. Before pending_work is set, the same event overflows in IRQ context
> 3. Both contexts queue the same perf pending work to task->task_works
> 
> This double queuing causes:
> - task_work_run() enters infinite loop calling perf_pending_task()
> - Potential warnings and use-after-free when event is freed in
> perf_pending_task()
> 
> Fix the race by disabling interrupts during queuing of perf pending work.
> 
> Fixes: c5d93d23a260 ("perf: Enqueue SIGTRAP always via task_work.")
> Reported-by: Xianjun Zeng <zengxianjun@bytedance.com>
> Signed-off-by: Liangyan <liangyan.peng@bytedance.com>
> ---
> v3: Refine commit log suggested by Sebastian.

I assumed you get rid of the warning backtrace as it adds to value but
instead you added the whole thing including timestamps and so on.

> ---
>  kernel/events/core.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index cae921f4d137..7c63e5fdd334 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -10433,6 +10433,16 @@ static int __perf_event_overflow(struct perf_event *event,
>  
>  		notify_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
>  
> +		/*
> +		 * Task context queues the work via task_work_add() but has not yet
> +		 * set event->pending_work when the same event overflows in
> +		 * IRQ context. The IRQ path, seeing !event->pending_work,
> +		 * queues the work again.
> +		 * The double queuing causes corruption in task->task_works.

    The same event can be enqueued in TASK and IRQ context because
    assigning perf_event::pending_work is not atomic in regard to
    enqueue. task_work_add() does not prevent double enqueue.

The above should be enough if it is not self explained :)
However I did think that we have per-context events here. But it seems
those are not used in this case here.

> +		 * Prevent this by disabling interrupts around the critical section.
> +		 */
> +		guard(irqsave)();
> +
>  		if (!event->pending_work &&
>  		    !task_work_add(current, &event->pending_task, notify_mode)) {
>  			event->pending_work = pending_id;

Sebastian