kernel/trace/trace_event_perf.c | 5 +++++ 1 file changed, 5 insertions(+)
From: Yuan Chen <chenyuan@kylinos.cn>
There is a critical race condition in kprobe initialization that can lead to
NULL pointer dereference and kernel crash.
[1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000
...
[1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO)
[1135630.269239] pc : kprobe_perf_func+0x30/0x260
[1135630.277643] lr : kprobe_dispatcher+0x44/0x60
[1135630.286041] sp : ffffaeff4977fa40
[1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400
[1135630.302837] x27: 0000000000000000 x26: 0000000000000000
[1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528
[1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50
[1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50
[1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000
[1135630.349985] x17: 0000000000000000 x16: 0000000000000000
[1135630.359285] x15: 0000000000000000 x14: 0000000000000000
[1135630.368445] x13: 0000000000000000 x12: 0000000000000000
[1135630.377473] x11: 0000000000000000 x10: 0000000000000000
[1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000
[1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000
[1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000
[1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006
[1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000
[1135630.429410] Call trace:
[1135630.434828] kprobe_perf_func+0x30/0x260
[1135630.441661] kprobe_dispatcher+0x44/0x60
[1135630.448396] aggr_pre_handler+0x70/0xc8
[1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0
[1135630.462435] brk_handler+0xbc/0xd8
[1135630.468437] do_debug_exception+0x84/0x138
[1135630.475074] el1_dbg+0x18/0x8c
[1135630.480582] security_file_permission+0x0/0xd0
[1135630.487426] vfs_write+0x70/0x1c0
[1135630.493059] ksys_write+0x5c/0xc8
[1135630.498638] __arm64_sys_write+0x24/0x30
[1135630.504821] el0_svc_common+0x78/0x130
[1135630.510838] el0_svc_handler+0x38/0x78
[1135630.516834] el0_svc+0x8/0x1b0
kernel/trace/trace_kprobe.c: 1308
0xffff3df8995039ec <kprobe_perf_func+0x2c>: ldr x21, [x24,#120]
include/linux/compiler.h: 294
0xffff3df8995039f0 <kprobe_perf_func+0x30>: ldr x1, [x21,x0]
kernel/trace/trace_kprobe.c
1308: head = this_cpu_ptr(call->perf_events);
1309: if (hlist_empty(head))
1310: return 0;
crash> struct trace_event_call -o
struct trace_event_call {
...
[120] struct hlist_head *perf_events; //(call->perf_event)
...
}
crash> struct trace_event_call ffffaf015340e528
struct trace_event_call {
...
perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0
...
}
Race Condition Analysis:
The race occurs between kprobe activation and perf_events initialization:
CPU0 CPU1
==== ====
perf_kprobe_init
perf_trace_event_init
tp_event->perf_events = list;(1)
tp_event->class->reg (2)← KPROBE ACTIVE
Debug exception triggers
...
kprobe_dispatcher
kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE)
head = this_cpu_ptr(call->perf_events)(3)
(perf_events is still NULL)
Problem:
1. CPU0 executes (1) assigning tp_event->perf_events = list
2. CPU0 executes (2) enabling kprobe functionality via class->reg()
3. CPU1 triggers and reaches kprobe_dispatcher
4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed)
5. CPU1 calls kprobe_perf_func() and crashes at (3) because
call->perf_events is still NULL
The issue: Assignment in step 1 may not be visible to CPU1 due to
missing memory barriers before step 2 sets TP_FLAG_PROFILE flag.
Add smp_mb() barrier between perf_events assignment and enabling
profile functionality to ensure visibility ordering across CPUs.
v1->v2: Fix race analysis (per Masami) - kprobe arms in class->reg().
Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
---
kernel/trace/trace_event_perf.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a6bb7577e8c5..6eff8c9d6bae 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -113,6 +113,11 @@ static int perf_trace_event_reg(struct trace_event_call *tp_event,
tp_event->perf_events = list;
+ /* Ensure perf_events assignment is visible to all CPUs before enabling
+ * profile functionality
+ */
+ smp_mb();
+
if (!total_ref_count) {
char __percpu *buf;
int i;
--
2.39.5
On Mon, 29 Sep 2025 07:57:31 +0100 chenyuan_fl@163.com wrote: > From: Yuan Chen <chenyuan@kylinos.cn> > > There is a critical race condition in kprobe initialization that can lead to > NULL pointer dereference and kernel crash. > > [1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000 > ... > [1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO) > [1135630.269239] pc : kprobe_perf_func+0x30/0x260 > [1135630.277643] lr : kprobe_dispatcher+0x44/0x60 > [1135630.286041] sp : ffffaeff4977fa40 > [1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400 > [1135630.302837] x27: 0000000000000000 x26: 0000000000000000 > [1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528 > [1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50 > [1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50 > [1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000 > [1135630.349985] x17: 0000000000000000 x16: 0000000000000000 > [1135630.359285] x15: 0000000000000000 x14: 0000000000000000 > [1135630.368445] x13: 0000000000000000 x12: 0000000000000000 > [1135630.377473] x11: 0000000000000000 x10: 0000000000000000 > [1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000 > [1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000 > [1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000 > [1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006 > [1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000 > [1135630.429410] Call trace: > [1135630.434828] kprobe_perf_func+0x30/0x260 > [1135630.441661] kprobe_dispatcher+0x44/0x60 > [1135630.448396] aggr_pre_handler+0x70/0xc8 > [1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0 > [1135630.462435] brk_handler+0xbc/0xd8 > [1135630.468437] do_debug_exception+0x84/0x138 > [1135630.475074] el1_dbg+0x18/0x8c > [1135630.480582] security_file_permission+0x0/0xd0 > [1135630.487426] vfs_write+0x70/0x1c0 > [1135630.493059] ksys_write+0x5c/0xc8 > [1135630.498638] __arm64_sys_write+0x24/0x30 > [1135630.504821] el0_svc_common+0x78/0x130 > [1135630.510838] el0_svc_handler+0x38/0x78 > [1135630.516834] el0_svc+0x8/0x1b0 > > kernel/trace/trace_kprobe.c: 1308 > 0xffff3df8995039ec <kprobe_perf_func+0x2c>: ldr x21, [x24,#120] > include/linux/compiler.h: 294 > 0xffff3df8995039f0 <kprobe_perf_func+0x30>: ldr x1, [x21,x0] > > kernel/trace/trace_kprobe.c > 1308: head = this_cpu_ptr(call->perf_events); > 1309: if (hlist_empty(head)) > 1310: return 0; > > crash> struct trace_event_call -o > struct trace_event_call { > ... > [120] struct hlist_head *perf_events; //(call->perf_event) > ... > } > > crash> struct trace_event_call ffffaf015340e528 > struct trace_event_call { > ... > perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0 > ... > } > > Race Condition Analysis: > > The race occurs between kprobe activation and perf_events initialization: > > CPU0 CPU1 > ==== ==== > perf_kprobe_init > perf_trace_event_init > tp_event->perf_events = list;(1) > tp_event->class->reg (2)← KPROBE ACTIVE > Debug exception triggers > ... > kprobe_dispatcher > kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE) > head = this_cpu_ptr(call->perf_events)(3) > (perf_events is still NULL) > > Problem: > 1. CPU0 executes (1) assigning tp_event->perf_events = list > 2. CPU0 executes (2) enabling kprobe functionality via class->reg() > 3. CPU1 triggers and reaches kprobe_dispatcher > 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) > 5. CPU1 calls kprobe_perf_func() and crashes at (3) because > call->perf_events is still NULL > > The issue: Assignment in step 1 may not be visible to CPU1 due to > missing memory barriers before step 2 sets TP_FLAG_PROFILE flag. > > Add smp_mb() barrier between perf_events assignment and enabling > profile functionality to ensure visibility ordering across CPUs. > > v1->v2: Fix race analysis (per Masami) - kprobe arms in class->reg(). > > Signed-off-by: Yuan Chen <chenyuan@kylinos.cn> > --- > kernel/trace/trace_event_perf.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c > index a6bb7577e8c5..6eff8c9d6bae 100644 > --- a/kernel/trace/trace_event_perf.c > +++ b/kernel/trace/trace_event_perf.c > @@ -113,6 +113,11 @@ static int perf_trace_event_reg(struct trace_event_call *tp_event, > > tp_event->perf_events = list; > > + /* Ensure perf_events assignment is visible to all CPUs before enabling > + * profile functionality > + */ > + smp_mb(); So from other discussions I had with John and Sebastian (both Cc'd), memory barriers are not for "making memory visible", but instead are for interactions between memory and two different CPUs, where both CPUs have memory barriers. John or Sebastian, thoughts? -- Steve > + > if (!total_ref_count) { > char __percpu *buf; > int i;
On 2025-09-29, Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 29 Sep 2025 07:57:31 +0100 > chenyuan_fl@163.com wrote: > >> From: Yuan Chen <chenyuan@kylinos.cn> >> >> There is a critical race condition in kprobe initialization that can lead to >> NULL pointer dereference and kernel crash. >> >> [1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000 >> ... >> [1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO) >> [1135630.269239] pc : kprobe_perf_func+0x30/0x260 >> [1135630.277643] lr : kprobe_dispatcher+0x44/0x60 >> [1135630.286041] sp : ffffaeff4977fa40 >> [1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400 >> [1135630.302837] x27: 0000000000000000 x26: 0000000000000000 >> [1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528 >> [1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50 >> [1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50 >> [1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000 >> [1135630.349985] x17: 0000000000000000 x16: 0000000000000000 >> [1135630.359285] x15: 0000000000000000 x14: 0000000000000000 >> [1135630.368445] x13: 0000000000000000 x12: 0000000000000000 >> [1135630.377473] x11: 0000000000000000 x10: 0000000000000000 >> [1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000 >> [1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000 >> [1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000 >> [1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006 >> [1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000 >> [1135630.429410] Call trace: >> [1135630.434828] kprobe_perf_func+0x30/0x260 >> [1135630.441661] kprobe_dispatcher+0x44/0x60 >> [1135630.448396] aggr_pre_handler+0x70/0xc8 >> [1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0 >> [1135630.462435] brk_handler+0xbc/0xd8 >> [1135630.468437] do_debug_exception+0x84/0x138 >> [1135630.475074] el1_dbg+0x18/0x8c >> [1135630.480582] security_file_permission+0x0/0xd0 >> [1135630.487426] vfs_write+0x70/0x1c0 >> [1135630.493059] ksys_write+0x5c/0xc8 >> [1135630.498638] __arm64_sys_write+0x24/0x30 >> [1135630.504821] el0_svc_common+0x78/0x130 >> [1135630.510838] el0_svc_handler+0x38/0x78 >> [1135630.516834] el0_svc+0x8/0x1b0 >> >> kernel/trace/trace_kprobe.c: 1308 >> 0xffff3df8995039ec <kprobe_perf_func+0x2c>: ldr x21, [x24,#120] >> include/linux/compiler.h: 294 >> 0xffff3df8995039f0 <kprobe_perf_func+0x30>: ldr x1, [x21,x0] >> >> kernel/trace/trace_kprobe.c >> 1308: head = this_cpu_ptr(call->perf_events); >> 1309: if (hlist_empty(head)) >> 1310: return 0; >> >> crash> struct trace_event_call -o >> struct trace_event_call { >> ... >> [120] struct hlist_head *perf_events; //(call->perf_event) >> ... >> } >> >> crash> struct trace_event_call ffffaf015340e528 >> struct trace_event_call { >> ... >> perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0 >> ... >> } >> >> Race Condition Analysis: >> >> The race occurs between kprobe activation and perf_events initialization: >> >> CPU0 CPU1 >> ==== ==== >> perf_kprobe_init >> perf_trace_event_init >> tp_event->perf_events = list;(1) >> tp_event->class->reg (2)← KPROBE ACTIVE >> Debug exception triggers >> ... >> kprobe_dispatcher >> kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE) >> head = this_cpu_ptr(call->perf_events)(3) >> (perf_events is still NULL) I do not know anything about the kprobe and perf internals. This email should hopefully help to act as a guide of where you need to place the memory barrier _pair_. If I understand the problem description correctly, you would need: >> Problem: >> 1. CPU0 executes (1) assigning tp_event->perf_events = list smp_wmb() >> 2. CPU0 executes (2) enabling kprobe functionality via class->reg() >> 3. CPU1 triggers and reaches kprobe_dispatcher >> 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) smp_rmb() >> 5. CPU1 calls kprobe_perf_func() and crashes at (3) because >> call->perf_events is still NULL >> >> The issue: Assignment in step 1 may not be visible to CPU1 due to >> missing memory barriers before step 2 sets TP_FLAG_PROFILE flag. A better explanation of the issue would be: CPU1 sees that kprobe functionality is enabled but does not see that perf_events has been assigned. Add pairing read and write memory barriers to guarantee that if CPU1 sees that kprobe functionality is enabled, it must also see that perf_events has been assigned. Note that this could also be done more efficiently using a store_release when setting the flag (in step 2) and a load_acquire when loading the flag (in step 4). John Ogness
On Mon, Sep 29, 2025 at 11:38:08AM +0206, John Ogness wrote: > >> Problem: > >> 1. CPU0 executes (1) assigning tp_event->perf_events = list > > smp_wmb() > > >> 2. CPU0 executes (2) enabling kprobe functionality via class->reg() > >> 3. CPU1 triggers and reaches kprobe_dispatcher > >> 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) > > smp_rmb() > > >> 5. CPU1 calls kprobe_perf_func() and crashes at (3) because > >> call->perf_events is still NULL > >> > >> The issue: Assignment in step 1 may not be visible to CPU1 due to > >> missing memory barriers before step 2 sets TP_FLAG_PROFILE flag. > > A better explanation of the issue would be: CPU1 sees that kprobe > functionality is enabled but does not see that perf_events has been > assigned. > > Add pairing read and write memory barriers to guarantee that if CPU1 > sees that kprobe functionality is enabled, it must also see that > perf_events has been assigned. > > Note that this could also be done more efficiently using a store_release > when setting the flag (in step 2) and a load_acquire when loading the > flag (in step 4). The RELEASE+ACQUIRE is a better pattern for these cases. And I'll argue the barrier should be in 2 not 1, since it is 2 that sets the flag checked in 4. Any store before that flag might be affected, not just the ->perf_events list.
On Mon, 29 Sep 2025 12:12:59 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Sep 29, 2025 at 11:38:08AM +0206, John Ogness wrote: > > > >> Problem: > > >> 1. CPU0 executes (1) assigning tp_event->perf_events = list > > > > smp_wmb() > > > > >> 2. CPU0 executes (2) enabling kprobe functionality via class->reg() > > >> 3. CPU1 triggers and reaches kprobe_dispatcher > > >> 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) > > > > smp_rmb() > > > > >> 5. CPU1 calls kprobe_perf_func() and crashes at (3) because > > >> call->perf_events is still NULL > > >> > > >> The issue: Assignment in step 1 may not be visible to CPU1 due to > > >> missing memory barriers before step 2 sets TP_FLAG_PROFILE flag. > > > > A better explanation of the issue would be: CPU1 sees that kprobe > > functionality is enabled but does not see that perf_events has been > > assigned. > > > > Add pairing read and write memory barriers to guarantee that if CPU1 > > sees that kprobe functionality is enabled, it must also see that > > perf_events has been assigned. > > > > Note that this could also be done more efficiently using a store_release > > when setting the flag (in step 2) and a load_acquire when loading the > > flag (in step 4). > > The RELEASE+ACQUIRE is a better pattern for these cases. > > And I'll argue the barrier should be in 2 not 1, since it is 2 that sets > the flag checked in 4. Any store before that flag might be affected, > not just the ->perf_events list. RELEASE+ACQUIRE ensures the memory ordering on the `same` CPU, so do we still need smp_rmb() and smp_wmb()? e.g. perf_trace_event_init() ----- tp_event->perf_events = list /*(1)*/ smp_store_release(&tp->event->flags, newflag); ----- kprobe_dispatcher() ----- smp_load_acquire(&tk->tp.flags); /*(2)*/ this_cpu_ptr(call->perf_events); ----- This ensures - the flags update is shown on other CPUs - perf_events update is done before the flags update on the same CPU - perf_events load is done after the flags load on the same CPU But we are still not sure the perf_events update is shown on other CPUs? To ensure that we still need smp_wmb() and smp_rmb() at (1) and (2). Or we don't need smp_*mb()? Thank you, -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Tue, Sep 30, 2025 at 05:58:26PM +0900, Masami Hiramatsu wrote: > On Mon, 29 Sep 2025 12:12:59 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, Sep 29, 2025 at 11:38:08AM +0206, John Ogness wrote: > > > > > >> Problem: > > > >> 1. CPU0 executes (1) assigning tp_event->perf_events = list > > > > > > smp_wmb() > > > > > > >> 2. CPU0 executes (2) enabling kprobe functionality via class->reg() > > > >> 3. CPU1 triggers and reaches kprobe_dispatcher > > > >> 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) > > > > > > smp_rmb() > > > > > > >> 5. CPU1 calls kprobe_perf_func() and crashes at (3) because > > > >> call->perf_events is still NULL > > > >> > > > >> The issue: Assignment in step 1 may not be visible to CPU1 due to > > > >> missing memory barriers before step 2 sets TP_FLAG_PROFILE flag. > > > > > > A better explanation of the issue would be: CPU1 sees that kprobe > > > functionality is enabled but does not see that perf_events has been > > > assigned. > > > > > > Add pairing read and write memory barriers to guarantee that if CPU1 > > > sees that kprobe functionality is enabled, it must also see that > > > perf_events has been assigned. > > > > > > Note that this could also be done more efficiently using a store_release > > > when setting the flag (in step 2) and a load_acquire when loading the > > > flag (in step 4). > > > > The RELEASE+ACQUIRE is a better pattern for these cases. > > > > And I'll argue the barrier should be in 2 not 1, since it is 2 that sets > > the flag checked in 4. Any store before that flag might be affected, > > not just the ->perf_events list. > > RELEASE+ACQUIRE ensures the memory ordering on the `same` CPU, so do we still need smp_rmb() and smp_wmb()? e.g. Eh, no, that's wrong. RELEASE and ACQUIRE are SMP barriers.
On Tue, 30 Sep 2025 12:10:52 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Sep 30, 2025 at 05:58:26PM +0900, Masami Hiramatsu wrote: > > On Mon, 29 Sep 2025 12:12:59 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Mon, Sep 29, 2025 at 11:38:08AM +0206, John Ogness wrote: > > > > > > > >> Problem: > > > > >> 1. CPU0 executes (1) assigning tp_event->perf_events = list > > > > > > > > smp_wmb() > > > > > > > > >> 2. CPU0 executes (2) enabling kprobe functionality via class->reg() > > > > >> 3. CPU1 triggers and reaches kprobe_dispatcher > > > > >> 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) > > > > > > > > smp_rmb() > > > > > > > > >> 5. CPU1 calls kprobe_perf_func() and crashes at (3) because > > > > >> call->perf_events is still NULL > > > > >> > > > > >> The issue: Assignment in step 1 may not be visible to CPU1 due to > > > > >> missing memory barriers before step 2 sets TP_FLAG_PROFILE flag. > > > > > > > > A better explanation of the issue would be: CPU1 sees that kprobe > > > > functionality is enabled but does not see that perf_events has been > > > > assigned. > > > > > > > > Add pairing read and write memory barriers to guarantee that if CPU1 > > > > sees that kprobe functionality is enabled, it must also see that > > > > perf_events has been assigned. > > > > > > > > Note that this could also be done more efficiently using a store_release > > > > when setting the flag (in step 2) and a load_acquire when loading the > > > > flag (in step 4). > > > > > > The RELEASE+ACQUIRE is a better pattern for these cases. > > > > > > And I'll argue the barrier should be in 2 not 1, since it is 2 that sets > > > the flag checked in 4. Any store before that flag might be affected, > > > not just the ->perf_events list. > > > > RELEASE+ACQUIRE ensures the memory ordering on the `same` CPU, so do we still need smp_rmb() and smp_wmb()? e.g. > > Eh, no, that's wrong. RELEASE and ACQUIRE are SMP barriers. OK, thanks for confirmation! -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
From: Yuan Chen <chenyuan@kylinos.cn>
There is a critical race condition in kprobe initialization that can lead to
NULL pointer dereference and kernel crash.
[1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000
...
[1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO)
[1135630.269239] pc : kprobe_perf_func+0x30/0x260
[1135630.277643] lr : kprobe_dispatcher+0x44/0x60
[1135630.286041] sp : ffffaeff4977fa40
[1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400
[1135630.302837] x27: 0000000000000000 x26: 0000000000000000
[1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528
[1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50
[1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50
[1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000
[1135630.349985] x17: 0000000000000000 x16: 0000000000000000
[1135630.359285] x15: 0000000000000000 x14: 0000000000000000
[1135630.368445] x13: 0000000000000000 x12: 0000000000000000
[1135630.377473] x11: 0000000000000000 x10: 0000000000000000
[1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000
[1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000
[1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000
[1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006
[1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000
[1135630.429410] Call trace:
[1135630.434828] kprobe_perf_func+0x30/0x260
[1135630.441661] kprobe_dispatcher+0x44/0x60
[1135630.448396] aggr_pre_handler+0x70/0xc8
[1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0
[1135630.462435] brk_handler+0xbc/0xd8
[1135630.468437] do_debug_exception+0x84/0x138
[1135630.475074] el1_dbg+0x18/0x8c
[1135630.480582] security_file_permission+0x0/0xd0
[1135630.487426] vfs_write+0x70/0x1c0
[1135630.493059] ksys_write+0x5c/0xc8
[1135630.498638] __arm64_sys_write+0x24/0x30
[1135630.504821] el0_svc_common+0x78/0x130
[1135630.510838] el0_svc_handler+0x38/0x78
[1135630.516834] el0_svc+0x8/0x1b0
kernel/trace/trace_kprobe.c: 1308
0xffff3df8995039ec <kprobe_perf_func+0x2c>: ldr x21, [x24,#120]
include/linux/compiler.h: 294
0xffff3df8995039f0 <kprobe_perf_func+0x30>: ldr x1, [x21,x0]
kernel/trace/trace_kprobe.c
1308: head = this_cpu_ptr(call->perf_events);
1309: if (hlist_empty(head))
1310: return 0;
crash> struct trace_event_call -o
struct trace_event_call {
...
[120] struct hlist_head *perf_events; //(call->perf_event)
...
}
crash> struct trace_event_call ffffaf015340e528
struct trace_event_call {
...
perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0
...
}
Race Condition Analysis:
The race occurs between kprobe activation and perf_events initialization:
CPU0 CPU1
==== ====
perf_kprobe_init
perf_trace_event_init
tp_event->perf_events = list;(1)
tp_event->class->reg (2)← KPROBE ACTIVE
Debug exception triggers
...
kprobe_dispatcher
kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE)
head = this_cpu_ptr(call->perf_events)(3)
(perf_events is still NULL)
Problem:
1. CPU0 executes (1) assigning tp_event->perf_events = list
2. CPU0 executes (2) enabling kprobe functionality via class->reg()
3. CPU1 triggers and reaches kprobe_dispatcher
4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed)
5. CPU1 calls kprobe_perf_func() and crashes at (3) because
call->perf_events is still NULL
CPU1 sees that kprobe functionality is enabled but does not see that
perf_events has been assigned.
Add pairing read and write memory barriers to guarantee that if CPU1
sees that kprobe functionality is enabled, it must also see that
perf_events has been assigned.
v1->v2: Fix race analysis (per Masami) - kprobe arms in class->reg().
v2->v3: Adopt RELEASE/ACQUIRE semantics per Peter/John's suggestions,
aligning with Steven's clarification on barrier purposes.
Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
---
kernel/trace/trace_probe.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 842383fbc03b..98b838591edc 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -274,19 +274,19 @@ struct event_file_link {
static inline bool trace_probe_test_flag(struct trace_probe *tp,
unsigned int flag)
{
- return !!(tp->event->flags & flag);
+ return !!(smp_load_acquire(&tp->event->flags) & flag);
}
static inline void trace_probe_set_flag(struct trace_probe *tp,
unsigned int flag)
{
- tp->event->flags |= flag;
+ smp_store_release(&tp->event->flags, tp->event->flags | flag);
}
static inline void trace_probe_clear_flag(struct trace_probe *tp,
unsigned int flag)
{
- tp->event->flags &= ~flag;
+ smp_store_release(&tp->event->flags, tp->event->flags & ~flag);
}
static inline bool trace_probe_is_enabled(struct trace_probe *tp)
--
2.39.5
On 2025-09-30, chenyuan_fl@163.com wrote: > diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h > index 842383fbc03b..98b838591edc 100644 > --- a/kernel/trace/trace_probe.h > +++ b/kernel/trace/trace_probe.h > @@ -274,19 +274,19 @@ struct event_file_link { > static inline bool trace_probe_test_flag(struct trace_probe *tp, > unsigned int flag) > { > - return !!(tp->event->flags & flag); > + return !!(smp_load_acquire(&tp->event->flags) & flag); > } > > static inline void trace_probe_set_flag(struct trace_probe *tp, > unsigned int flag) > { > - tp->event->flags |= flag; > + smp_store_release(&tp->event->flags, tp->event->flags | flag); > } > > static inline void trace_probe_clear_flag(struct trace_probe *tp, > unsigned int flag) > { > - tp->event->flags &= ~flag; > + smp_store_release(&tp->event->flags, tp->event->flags & ~flag); > } > > static inline bool trace_probe_is_enabled(struct trace_probe *tp) I don't have any feedback about the correctness with regards to tracing and kprobes. However, I recommend writing a comment at each necessary memory barrier site. The comment should mention the pairing memory barrier and the ordering it is guaranteeing. John
On Tue, Sep 30, 2025 at 09:18:48AM +0100, chenyuan_fl@163.com wrote: > diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h > index 842383fbc03b..98b838591edc 100644 > --- a/kernel/trace/trace_probe.h > +++ b/kernel/trace/trace_probe.h > @@ -274,19 +274,19 @@ struct event_file_link { > static inline bool trace_probe_test_flag(struct trace_probe *tp, > unsigned int flag) > { > - return !!(tp->event->flags & flag); > + return !!(smp_load_acquire(&tp->event->flags) & flag); > } > > static inline void trace_probe_set_flag(struct trace_probe *tp, > unsigned int flag) > { > - tp->event->flags |= flag; > + smp_store_release(&tp->event->flags, tp->event->flags | flag); > } > > static inline void trace_probe_clear_flag(struct trace_probe *tp, > unsigned int flag) > { > - tp->event->flags &= ~flag; > + smp_store_release(&tp->event->flags, tp->event->flags & ~flag); > } I _think_ the clear one is superfluous. Is there anything that cares about stores done before the clear when the flag is found not set? Also, code like: static int fentry_dispatcher(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data) { struct trace_fprobe *tf = container_of(fp, struct trace_fprobe, fp); int ret = 0; if (trace_probe_test_flag(&tf->tp, TP_FLAG_TRACE)) fentry_trace_func(tf, entry_ip, fregs); #ifdef CONFIG_PERF_EVENTS if (trace_probe_test_flag(&tf->tp, TP_FLAG_PROFILE)) ret = fentry_perf_func(tf, entry_ip, fregs); #endif return ret; } Will now have two barriers; where one would suffice, eg. flags = smp_load_acquire(&tp->event->flags); if (flags & TP_FLAG_TRACE) fentry_trace_func(...); if (flags & TP_FLAG_PROFILE) fentry_perf_func(...); Should be just fine afaict. Is this something anybody cares about?
On Tue, 30 Sep 2025 10:46:45 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Sep 30, 2025 at 09:18:48AM +0100, chenyuan_fl@163.com wrote: > > > diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h > > index 842383fbc03b..98b838591edc 100644 > > --- a/kernel/trace/trace_probe.h > > +++ b/kernel/trace/trace_probe.h > > @@ -274,19 +274,19 @@ struct event_file_link { > > static inline bool trace_probe_test_flag(struct trace_probe *tp, > > unsigned int flag) > > { > > - return !!(tp->event->flags & flag); > > + return !!(smp_load_acquire(&tp->event->flags) & flag); > > } > > > > static inline void trace_probe_set_flag(struct trace_probe *tp, > > unsigned int flag) > > { > > - tp->event->flags |= flag; > > + smp_store_release(&tp->event->flags, tp->event->flags | flag); > > } > > > > static inline void trace_probe_clear_flag(struct trace_probe *tp, > > unsigned int flag) > > { > > - tp->event->flags &= ~flag; > > + smp_store_release(&tp->event->flags, tp->event->flags & ~flag); > > } > > > I _think_ the clear one is superfluous. Is there anything that cares > about stores done before the clear when the flag is found not set? > > Also, code like: > > static int fentry_dispatcher(struct fprobe *fp, unsigned long entry_ip, > unsigned long ret_ip, struct ftrace_regs *fregs, > void *entry_data) > { > struct trace_fprobe *tf = container_of(fp, struct trace_fprobe, fp); > int ret = 0; > > if (trace_probe_test_flag(&tf->tp, TP_FLAG_TRACE)) > fentry_trace_func(tf, entry_ip, fregs); > > #ifdef CONFIG_PERF_EVENTS > if (trace_probe_test_flag(&tf->tp, TP_FLAG_PROFILE)) > ret = fentry_perf_func(tf, entry_ip, fregs); > #endif > return ret; > } > > > Will now have two barriers; where one would suffice, eg. > > flags = smp_load_acquire(&tp->event->flags); > > if (flags & TP_FLAG_TRACE) > fentry_trace_func(...); > > if (flags & TP_FLAG_PROFILE) > fentry_perf_func(...); > > Should be just fine afaict. Looks good to me. We should replace trace_probe_test_flag() with trace_probe_load_flag(). Thanks, > > > Is this something anybody cares about? -- Masami Hiramatsu (Google) <mhiramat@kernel.org>
From: Yuan Chen <chenyuan@kylinos.cn>
There is a critical race condition in kprobe initialization that can lead to
NULL pointer dereference and kernel crash.
[1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000
...
[1135630.260314] pstate: 404003c9 (nZcv DAIF +PAN -UAO)
[1135630.269239] pc : kprobe_perf_func+0x30/0x260
[1135630.277643] lr : kprobe_dispatcher+0x44/0x60
[1135630.286041] sp : ffffaeff4977fa40
[1135630.293441] x29: ffffaeff4977fa40 x28: ffffaf015340e400
[1135630.302837] x27: 0000000000000000 x26: 0000000000000000
[1135630.312257] x25: ffffaf029ed108a8 x24: ffffaf015340e528
[1135630.321705] x23: ffffaeff4977fc50 x22: ffffaeff4977fc50
[1135630.331154] x21: 0000000000000000 x20: ffffaeff4977fc50
[1135630.340586] x19: ffffaf015340e400 x18: 0000000000000000
[1135630.349985] x17: 0000000000000000 x16: 0000000000000000
[1135630.359285] x15: 0000000000000000 x14: 0000000000000000
[1135630.368445] x13: 0000000000000000 x12: 0000000000000000
[1135630.377473] x11: 0000000000000000 x10: 0000000000000000
[1135630.386411] x9 : 0000000000000000 x8 : 0000000000000000
[1135630.395252] x7 : 0000000000000000 x6 : 0000000000000000
[1135630.403963] x5 : 0000000000000000 x4 : 0000000000000000
[1135630.412545] x3 : 0000710a04630000 x2 : 0000000000000006
[1135630.421021] x1 : ffffaeff4977fc50 x0 : 0000710a04630000
[1135630.429410] Call trace:
[1135630.434828] kprobe_perf_func+0x30/0x260
[1135630.441661] kprobe_dispatcher+0x44/0x60
[1135630.448396] aggr_pre_handler+0x70/0xc8
[1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0
[1135630.462435] brk_handler+0xbc/0xd8
[1135630.468437] do_debug_exception+0x84/0x138
[1135630.475074] el1_dbg+0x18/0x8c
[1135630.480582] security_file_permission+0x0/0xd0
[1135630.487426] vfs_write+0x70/0x1c0
[1135630.493059] ksys_write+0x5c/0xc8
[1135630.498638] __arm64_sys_write+0x24/0x30
[1135630.504821] el0_svc_common+0x78/0x130
[1135630.510838] el0_svc_handler+0x38/0x78
[1135630.516834] el0_svc+0x8/0x1b0
kernel/trace/trace_kprobe.c: 1308
0xffff3df8995039ec <kprobe_perf_func+0x2c>: ldr x21, [x24,#120]
include/linux/compiler.h: 294
0xffff3df8995039f0 <kprobe_perf_func+0x30>: ldr x1, [x21,x0]
kernel/trace/trace_kprobe.c
1308: head = this_cpu_ptr(call->perf_events);
1309: if (hlist_empty(head))
1310: return 0;
crash> struct trace_event_call -o
struct trace_event_call {
...
[120] struct hlist_head *perf_events; //(call->perf_event)
...
}
crash> struct trace_event_call ffffaf015340e528
struct trace_event_call {
...
perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0
...
}
Race Condition Analysis:
The race occurs between kprobe activation and perf_events initialization:
CPU0 CPU1
==== ====
perf_kprobe_init
perf_trace_event_init
tp_event->perf_events = list;(1)
tp_event->class->reg (2)← KPROBE ACTIVE
Debug exception triggers
...
kprobe_dispatcher
kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE)
head = this_cpu_ptr(call->perf_events)(3)
(perf_events is still NULL)
Problem:
1. CPU0 executes (1) assigning tp_event->perf_events = list
2. CPU0 executes (2) enabling kprobe functionality via class->reg()
3. CPU1 triggers and reaches kprobe_dispatcher
4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed)
5. CPU1 calls kprobe_perf_func() and crashes at (3) because
call->perf_events is still NULL
CPU1 sees that kprobe functionality is enabled but does not see that
perf_events has been assigned.
Add pairing read and write memory barriers to guarantee that if CPU1
sees that kprobe functionality is enabled, it must also see that
perf_events has been assigned.
v1->v2: Fix race analysis (per Masami) - kprobe arms in class->reg().
v2->v3: Adopt RELEASE/ACQUIRE semantics per Peter/John's suggestions,
aligning with Steven's clarification on barrier purposes.
v3->v4: Introduce load_flag() (Masami) and optimize barrier usage in
checks/clear (Peter).
Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
---
kernel/trace/trace_fprobe.c | 10 ++++++----
kernel/trace/trace_kprobe.c | 11 +++++++----
kernel/trace/trace_probe.h | 9 +++++++--
kernel/trace/trace_uprobe.c | 12 ++++++++----
4 files changed, 28 insertions(+), 14 deletions(-)
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index b36ade43d4b3..ad9d6347b5fa 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -522,13 +522,14 @@ static int fentry_dispatcher(struct fprobe *fp, unsigned long entry_ip,
void *entry_data)
{
struct trace_fprobe *tf = container_of(fp, struct trace_fprobe, fp);
+ unsigned int flags = trace_probe_load_flag(&tf->tp);
int ret = 0;
- if (trace_probe_test_flag(&tf->tp, TP_FLAG_TRACE))
+ if (flags & TP_FLAG_TRACE)
fentry_trace_func(tf, entry_ip, fregs);
#ifdef CONFIG_PERF_EVENTS
- if (trace_probe_test_flag(&tf->tp, TP_FLAG_PROFILE))
+ if (flags & TP_FLAG_PROFILE)
ret = fentry_perf_func(tf, entry_ip, fregs);
#endif
return ret;
@@ -540,11 +541,12 @@ static void fexit_dispatcher(struct fprobe *fp, unsigned long entry_ip,
void *entry_data)
{
struct trace_fprobe *tf = container_of(fp, struct trace_fprobe, fp);
+ unsigned int flags = trace_probe_load_flag(&tf->tp);
- if (trace_probe_test_flag(&tf->tp, TP_FLAG_TRACE))
+ if (flags & TP_FLAG_TRACE)
fexit_trace_func(tf, entry_ip, ret_ip, fregs, entry_data);
#ifdef CONFIG_PERF_EVENTS
- if (trace_probe_test_flag(&tf->tp, TP_FLAG_PROFILE))
+ if (flags & TP_FLAG_PROFILE)
fexit_perf_func(tf, entry_ip, ret_ip, fregs, entry_data);
#endif
}
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index ccae62d4fb91..b1b793b8f191 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1813,14 +1813,15 @@ static int kprobe_register(struct trace_event_call *event,
static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs)
{
struct trace_kprobe *tk = container_of(kp, struct trace_kprobe, rp.kp);
+ unsigned int flags = trace_probe_load_flag(&tk->tp);
int ret = 0;
raw_cpu_inc(*tk->nhit);
- if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE))
+ if (flags & TP_FLAG_TRACE)
kprobe_trace_func(tk, regs);
#ifdef CONFIG_PERF_EVENTS
- if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE))
+ if (flags & TP_FLAG_PROFILE)
ret = kprobe_perf_func(tk, regs);
#endif
return ret;
@@ -1832,6 +1833,7 @@ kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs)
{
struct kretprobe *rp = get_kretprobe(ri);
struct trace_kprobe *tk;
+ unsigned int flags;
/*
* There is a small chance that get_kretprobe(ri) returns NULL when
@@ -1844,10 +1846,11 @@ kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs)
tk = container_of(rp, struct trace_kprobe, rp);
raw_cpu_inc(*tk->nhit);
- if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE))
+ flags = trace_probe_load_flag(&tk->tp);
+ if (flags & TP_FLAG_TRACE)
kretprobe_trace_func(tk, ri, regs);
#ifdef CONFIG_PERF_EVENTS
- if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE))
+ if (flags & TP_FLAG_PROFILE)
kretprobe_perf_func(tk, ri, regs);
#endif
return 0; /* We don't tweak kernel, so just return 0 */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 842383fbc03b..08b5bda24da2 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -271,16 +271,21 @@ struct event_file_link {
struct list_head list;
};
+static inline unsigned int trace_probe_load_flag(struct trace_probe *tp)
+{
+ return smp_load_acquire(&tp->event->flags);
+}
+
static inline bool trace_probe_test_flag(struct trace_probe *tp,
unsigned int flag)
{
- return !!(tp->event->flags & flag);
+ return !!(trace_probe_load_flag(tp) & flag);
}
static inline void trace_probe_set_flag(struct trace_probe *tp,
unsigned int flag)
{
- tp->event->flags |= flag;
+ smp_store_release(&tp->event->flags, tp->event->flags | flag);
}
static inline void trace_probe_clear_flag(struct trace_probe *tp,
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 8b0bcc0d8f41..430d09c49462 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1547,6 +1547,7 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs,
struct trace_uprobe *tu;
struct uprobe_dispatch_data udd;
struct uprobe_cpu_buffer *ucb = NULL;
+ unsigned int flags;
int ret = 0;
tu = container_of(con, struct trace_uprobe, consumer);
@@ -1561,11 +1562,12 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs,
if (WARN_ON_ONCE(!uprobe_cpu_buffer))
return 0;
- if (trace_probe_test_flag(&tu->tp, TP_FLAG_TRACE))
+ flags = trace_probe_load_flag(&tu->tp);
+ if (flags & TP_FLAG_TRACE)
ret |= uprobe_trace_func(tu, regs, &ucb);
#ifdef CONFIG_PERF_EVENTS
- if (trace_probe_test_flag(&tu->tp, TP_FLAG_PROFILE))
+ if (flags & TP_FLAG_PROFILE)
ret |= uprobe_perf_func(tu, regs, &ucb);
#endif
uprobe_buffer_put(ucb);
@@ -1579,6 +1581,7 @@ static int uretprobe_dispatcher(struct uprobe_consumer *con,
struct trace_uprobe *tu;
struct uprobe_dispatch_data udd;
struct uprobe_cpu_buffer *ucb = NULL;
+ unsigned int flags;
tu = container_of(con, struct trace_uprobe, consumer);
@@ -1590,11 +1593,12 @@ static int uretprobe_dispatcher(struct uprobe_consumer *con,
if (WARN_ON_ONCE(!uprobe_cpu_buffer))
return 0;
- if (trace_probe_test_flag(&tu->tp, TP_FLAG_TRACE))
+ flags = trace_probe_load_flag(&tu->tp);
+ if (flags & TP_FLAG_TRACE)
uretprobe_trace_func(tu, func, regs, &ucb);
#ifdef CONFIG_PERF_EVENTS
- if (trace_probe_test_flag(&tu->tp, TP_FLAG_PROFILE))
+ if (flags & TP_FLAG_PROFILE)
uretprobe_perf_func(tu, func, regs, &ucb);
#endif
uprobe_buffer_put(ucb);
--
2.39.5
On Wed, Oct 01, 2025 at 03:20:25AM +0100, chenyuan_fl@163.com wrote: > > v1->v2: Fix race analysis (per Masami) - kprobe arms in class->reg(). > v2->v3: Adopt RELEASE/ACQUIRE semantics per Peter/John's suggestions, > aligning with Steven's clarification on barrier purposes. > v3->v4: Introduce load_flag() (Masami) and optimize barrier usage in > checks/clear (Peter). Stuff like this ought to go below the --- such that git-am and similar tools throw it out. > Signed-off-by: Yuan Chen <chenyuan@kylinos.cn> > --- > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > index ccae62d4fb91..b1b793b8f191 100644 > --- a/kernel/trace/trace_kprobe.c > +++ b/kernel/trace/trace_kprobe.c > @@ -1813,14 +1813,15 @@ static int kprobe_register(struct trace_event_call *event, > static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs) > { > struct trace_kprobe *tk = container_of(kp, struct trace_kprobe, rp.kp); > + unsigned int flags = trace_probe_load_flag(&tk->tp); > int ret = 0; > > raw_cpu_inc(*tk->nhit); > > - if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE)) > + if (flags & TP_FLAG_TRACE) > kprobe_trace_func(tk, regs); > #ifdef CONFIG_PERF_EVENTS > - if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE)) > + if (flags & TP_FLAG_PROFILE) > ret = kprobe_perf_func(tk, regs); > #endif > return ret; > @@ -1832,6 +1833,7 @@ kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs) > { > struct kretprobe *rp = get_kretprobe(ri); > struct trace_kprobe *tk; > + unsigned int flags; > > /* > * There is a small chance that get_kretprobe(ri) returns NULL when > @@ -1844,10 +1846,11 @@ kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs) > tk = container_of(rp, struct trace_kprobe, rp); > raw_cpu_inc(*tk->nhit); > > - if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE)) > + flags = trace_probe_load_flag(&tk->tp); > + if (flags & TP_FLAG_TRACE) > kretprobe_trace_func(tk, ri, regs); > #ifdef CONFIG_PERF_EVENTS > - if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE)) > + if (flags & TP_FLAG_PROFILE) > kretprobe_perf_func(tk, ri, regs); > #endif > return 0; /* We don't tweak kernel, so just return 0 */ These two are inconsistent in that the first does load_flag before inc(nhit) while the second does it after. I don't think it matters, but I noticed the inconsistent style and had to tell. Anyway, patch looks reasonable now. Rest is up to Steve, this is his code.
On Wed, 1 Oct 2025 14:32:00 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > Anyway, patch looks reasonable now. Rest is up to Steve, this is his > code. Actually, it's Masami's ;-) And it's up to him. -- Steve
On Mon, Sep 29, 2025 at 04:48:36AM -0400, Steven Rostedt wrote: > On Mon, 29 Sep 2025 07:57:31 +0100 > chenyuan_fl@163.com wrote: > > > From: Yuan Chen <chenyuan@kylinos.cn> > > > > There is a critical race condition in kprobe initialization that can lead to > > NULL pointer dereference and kernel crash. > > > > [1135630.084782] Unable to handle kernel paging request at virtual address 0000710a04630000 > > [1135630.434828] kprobe_perf_func+0x30/0x260 > > [1135630.441661] kprobe_dispatcher+0x44/0x60 > > [1135630.448396] aggr_pre_handler+0x70/0xc8 > > [1135630.454959] kprobe_breakpoint_handler+0x140/0x1e0 > > [1135630.462435] brk_handler+0xbc/0xd8 > > [1135630.468437] do_debug_exception+0x84/0x138 > > > > kernel/trace/trace_kprobe.c > > 1308: head = this_cpu_ptr(call->perf_events); > > 1309: if (hlist_empty(head)) > > 1310: return 0; > > > > crash> struct trace_event_call -o > > struct trace_event_call { > > ... > > [120] struct hlist_head *perf_events; //(call->perf_event) > > ... > > } > > > > crash> struct trace_event_call ffffaf015340e528 > > struct trace_event_call { > > ... > > perf_events = 0xffff0ad5fa89f088, //this value is correct, but x21 = 0 > > ... > > } > > > > Race Condition Analysis: > > > > The race occurs between kprobe activation and perf_events initialization: > > > > CPU0 CPU1 > > ==== ==== > > perf_kprobe_init > > perf_trace_event_init > > tp_event->perf_events = list;(1) > > tp_event->class->reg (2)← KPROBE ACTIVE > > Debug exception triggers > > ... > > kprobe_dispatcher > > kprobe_perf_func (tk->tp.flags & TP_FLAG_PROFILE) > > head = this_cpu_ptr(call->perf_events)(3) > > (perf_events is still NULL) > > > > Problem: > > 1. CPU0 executes (1) assigning tp_event->perf_events = list > > 2. CPU0 executes (2) enabling kprobe functionality via class->reg() This is kprobe_register() doing enable_trace_kprobe() ? > > 3. CPU1 triggers and reaches kprobe_dispatcher > > 4. CPU1 checks TP_FLAG_PROFILE - condition passes (step 2 completed) > > 5. CPU1 calls kprobe_perf_func() and crashes at (3) because > > call->perf_events is still NULL > > > > The issue: Assignment in step 1 may not be visible to CPU1 due to > > missing memory barriers before step 2 sets TP_FLAG_PROFILE flag. > > > > Add smp_mb() barrier between perf_events assignment and enabling > > profile functionality to ensure visibility ordering across CPUs. Yeah, that cannot be right. > > Signed-off-by: Yuan Chen <chenyuan@kylinos.cn> > > --- > > kernel/trace/trace_event_perf.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c > > index a6bb7577e8c5..6eff8c9d6bae 100644 > > --- a/kernel/trace/trace_event_perf.c > > +++ b/kernel/trace/trace_event_perf.c > > @@ -113,6 +113,11 @@ static int perf_trace_event_reg(struct trace_event_call *tp_event, > > > > tp_event->perf_events = list; > > > > + /* Ensure perf_events assignment is visible to all CPUs before enabling > > + * profile functionality > > + */ > > + smp_mb(); > > So from other discussions I had with John and Sebastian (both Cc'd), > memory barriers are not for "making memory visible", but instead are > for interactions between memory and two different CPUs, where both CPUs > have memory barriers. Correct, barriers have to be paired. The above doesn't have enough clues for me to know what code is affected, but if we're talking about kprobe_register(PERF_REG) := enable_trace_kprobe(,NULL), when it might be that trace_probe_set_flag() should be an smp_store_release(), while trace_probe_test_flag(PROFILE) in kprobe_dispatch() needs to be a smp_load_acquire(). Without the acquire it might still be possible for the CPU to lift the call->perf_event load up before the event->flags load, rendering your wmb pointless. The guarantee you're looking for is that if the flag is set, it sees a fully formed event. This is done with RELEASE on publish and ACQUIRE on access.
© 2016 - 2025 Red Hat, Inc.