kernel: kprobes: fix cur_kprobe corruption during

[PATCH v3 0/1] kernel: kprobes: fix cur_kprobe corruption during

Posted by Khaja Hussain Shaik Khaji 1 month, 1 week ago

This patch fixes a kprobes failure observed due to lost current_kprobe
on arm64 during kretprobe entry handling under interrupt load.

v1 attempted to address this by simulating BTI instructions as NOPs and 
v2 attempted to address this by disabling preemption across the
out-of-line (XOL) execution window. Further analysis showed that this
hypothesis was incorrect: the failure is not caused by scheduling or
preemption during XOL.

The actual root cause is re-entrant invocation of kprobe_busy_begin()
from an active kprobe context. On arm64, IRQs are re-enabled before
invoking kprobe handlers, allowing an interrupt during kretprobe
entry_handler to trigger kprobe_flush_task(), which calls
kprobe_busy_begin/end and corrupts current_kprobe and kprobe_status.

[ 2280.630526] Call trace:
[ 2280.633044]  dump_backtrace+0x104/0x14c
[ 2280.636985]  show_stack+0x20/0x30
[ 2280.640390]  dump_stack_lvl+0x58/0x74
[ 2280.644154]  dump_stack+0x20/0x30
[ 2280.647562]  kprobe_busy_begin+0xec/0xf0
[ 2280.651593]  kprobe_flush_task+0x2c/0x60
[ 2280.655624]  delayed_put_task_struct+0x2c/0x124
[ 2280.660282]  rcu_core+0x56c/0x984
[ 2280.663695]  rcu_core_si+0x18/0x28
[ 2280.667189]  handle_softirqs+0x160/0x30c
[ 2280.671220]  __do_softirq+0x1c/0x2c
[ 2280.674807]  ____do_softirq+0x18/0x28
[ 2280.678569]  call_on_irq_stack+0x48/0x88
[ 2280.682599]  do_softirq_own_stack+0x24/0x34
[ 2280.686900]  irq_exit_rcu+0x5c/0xbc
[ 2280.690489]  el1_interrupt+0x40/0x60
[ 2280.694167]  el1h_64_irq_handler+0x20/0x30
[ 2280.698372]  el1h_64_irq+0x64/0x68
[ 2280.701872]  _raw_spin_unlock_irq+0x14/0x54
[ 2280.706173]  dwc3_msm_notify_event+0x6e8/0xbe8
[ 2280.710743]  entry_dwc3_gadget_pullup+0x3c/0x6c
[ 2280.715393]  pre_handler_kretprobe+0x1cc/0x304
[ 2280.719956]  kprobe_breakpoint_handler+0x1b0/0x388
[ 2280.724878]  brk_handler+0x8c/0x128
[ 2280.728464]  do_debug_exception+0x94/0x120
[ 2280.732670]  el1_dbg+0x60/0x7c
[ 2280.735815]  el1h_64_sync_handler+0x48/0xb8
[ 2280.740114]  el1h_64_sync+0x64/0x68
[ 2280.743701]  dwc3_gadget_pullup+0x0/0x124
[ 2280.747827]  soft_connect_store+0xb4/0x15c
[ 2280.752031]  dev_attr_store+0x20/0x38
[ 2280.755798]  sysfs_kf_write+0x44/0x5c
[ 2280.759564]  kernfs_fop_write_iter+0xf4/0x198
[ 2280.764033]  vfs_write+0x1d0/0x2b0
[ 2280.767529]  ksys_write+0x80/0xf0
[ 2280.770940]  __arm64_sys_write+0x24/0x34
[ 2280.774974]  invoke_syscall+0x54/0x118
[ 2280.778822]  el0_svc_common+0xb4/0xe8
[ 2280.782587]  do_el0_svc+0x24/0x34
[ 2280.785999]  el0_svc+0x40/0xa4
[ 2280.789140]  el0t_64_sync_handler+0x8c/0x108
[ 2280.793526]  el0t_64_sync+0x198/0x19c

This v3 patch makes kprobe_busy_begin/end re-entrant safe by preserving
the active kprobe state using a per-CPU depth counter and saved state.
The detailed failure analysis and justification are included in the
commit message.

Changes since v2:
  - Dropped the scheduling/preemption-based approach.
  - Identified the re-entrant kprobe_busy_begin() root cause.
  - Fixed kprobe_busy_begin/end to preserve active kprobe state.
  - Link to v2: https://lore.kernel.org/all/20260217133855.3142192-2-khaja.khaji@oss.qualcomm.com/

Khaja Hussain Shaik Khaji (1):
  kernel: kprobes: fix cur_kprobe corruption during re-entrant
    kprobe_busy_begin() calls

 kernel/kprobes.c | 34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

-- 
2.34.1

Re: [PATCH v3 0/1] kernel: kprobes: fix cur_kprobe corruption during

Posted by Mark Rutland 1 month, 1 week ago

On Mon, Mar 02, 2026 at 04:23:46PM +0530, Khaja Hussain Shaik Khaji wrote:
> This patch fixes a kprobes failure observed due to lost current_kprobe
> on arm64 during kretprobe entry handling under interrupt load.
> 
> v1 attempted to address this by simulating BTI instructions as NOPs and 
> v2 attempted to address this by disabling preemption across the
> out-of-line (XOL) execution window. Further analysis showed that this
> hypothesis was incorrect: the failure is not caused by scheduling or
> preemption during XOL.
> 
> The actual root cause is re-entrant invocation of kprobe_busy_begin()
> from an active kprobe context. On arm64, IRQs are re-enabled before
> invoking kprobe handlers, allowing an interrupt during kretprobe
> entry_handler to trigger kprobe_flush_task(), which calls
> kprobe_busy_begin/end and corrupts current_kprobe and kprobe_status.
> 
> [ 2280.630526] Call trace:
> [ 2280.633044]  dump_backtrace+0x104/0x14c
> [ 2280.636985]  show_stack+0x20/0x30
> [ 2280.640390]  dump_stack_lvl+0x58/0x74
> [ 2280.644154]  dump_stack+0x20/0x30
> [ 2280.647562]  kprobe_busy_begin+0xec/0xf0
> [ 2280.651593]  kprobe_flush_task+0x2c/0x60
> [ 2280.655624]  delayed_put_task_struct+0x2c/0x124
> [ 2280.660282]  rcu_core+0x56c/0x984
> [ 2280.663695]  rcu_core_si+0x18/0x28
> [ 2280.667189]  handle_softirqs+0x160/0x30c
> [ 2280.671220]  __do_softirq+0x1c/0x2c
> [ 2280.674807]  ____do_softirq+0x18/0x28
> [ 2280.678569]  call_on_irq_stack+0x48/0x88
> [ 2280.682599]  do_softirq_own_stack+0x24/0x34
> [ 2280.686900]  irq_exit_rcu+0x5c/0xbc
> [ 2280.690489]  el1_interrupt+0x40/0x60
> [ 2280.694167]  el1h_64_irq_handler+0x20/0x30
> [ 2280.698372]  el1h_64_irq+0x64/0x68
> [ 2280.701872]  _raw_spin_unlock_irq+0x14/0x54
> [ 2280.706173]  dwc3_msm_notify_event+0x6e8/0xbe8
> [ 2280.710743]  entry_dwc3_gadget_pullup+0x3c/0x6c
> [ 2280.715393]  pre_handler_kretprobe+0x1cc/0x304
> [ 2280.719956]  kprobe_breakpoint_handler+0x1b0/0x388
> [ 2280.724878]  brk_handler+0x8c/0x128
> [ 2280.728464]  do_debug_exception+0x94/0x120
> [ 2280.732670]  el1_dbg+0x60/0x7c

The el1_dbg() function was removed in commit:

  31575e11ecf7 ("arm64: debug: split brk64 exception entry")

... which was merged in v6.17.

Are you able to reproduce the issue with v6.17 or later?

Which specific kernel version did you see this with?

The arm64 entry code has changed substantially in recent months (fixing
a bunch of latent issues), and we need to know which specific version
you're looking at. It's possible that your issue has already been fixed.

Mark.

> [ 2280.735815]  el1h_64_sync_handler+0x48/0xb8
> [ 2280.740114]  el1h_64_sync+0x64/0x68
> [ 2280.743701]  dwc3_gadget_pullup+0x0/0x124
> [ 2280.747827]  soft_connect_store+0xb4/0x15c
> [ 2280.752031]  dev_attr_store+0x20/0x38
> [ 2280.755798]  sysfs_kf_write+0x44/0x5c
> [ 2280.759564]  kernfs_fop_write_iter+0xf4/0x198
> [ 2280.764033]  vfs_write+0x1d0/0x2b0
> [ 2280.767529]  ksys_write+0x80/0xf0
> [ 2280.770940]  __arm64_sys_write+0x24/0x34
> [ 2280.774974]  invoke_syscall+0x54/0x118
> [ 2280.778822]  el0_svc_common+0xb4/0xe8
> [ 2280.782587]  do_el0_svc+0x24/0x34
> [ 2280.785999]  el0_svc+0x40/0xa4
> [ 2280.789140]  el0t_64_sync_handler+0x8c/0x108
> [ 2280.793526]  el0t_64_sync+0x198/0x19c

Re: [PATCH v3 0/1] kernel: kprobes: fix cur_kprobe corruption during re-entrant kprobe_busy_begin() calls

Posted by Khaja Hussain Shaik Khaji 1 month, 1 week ago

On Mon, Mar 02, 2026 at 04:23:46PM +0530, Mark Rutland wrote:
> The el1_dbg() function was removed in commit:
>
>   31575e11ecf7 ("arm64: debug: split brk64 exception entry")
>
> ... which was merged in v6.17.
>
> Are you able to reproduce the issue with v6.17 or later?
>
> Which specific kernel version did you see this with?

The call trace was captured on v6.9-rc1.

I have not yet tested on v6.17 or later. I will test and report back.

That said, the fix is in kernel/kprobes.c and addresses a generic
re-entrancy issue in kprobe_busy_begin/end that is not specific to the
arm64 entry path. The race -- where kprobe_busy_begin() is called
re-entrantly from within an active kprobe context (e.g. via softirq
during kretprobe entry_handler) -- can occur on any architecture where
IRQs are re-enabled before invoking kprobe handlers.

I will verify whether the issue is still reproducible on v6.17+ and
report back.

Thanks,
Khaja

Re: [PATCH v3 0/1] kernel: kprobes: fix cur_kprobe corruption during re-entrant kprobe_busy_begin() calls

Posted by Mark Rutland 1 month, 1 week ago

On Mon, Mar 02, 2026 at 05:53:38PM +0530, Khaja Hussain Shaik Khaji wrote:
> On Mon, Mar 02, 2026 at 04:23:46PM +0530, Mark Rutland wrote:
> > The el1_dbg() function was removed in commit:
> >
> >   31575e11ecf7 ("arm64: debug: split brk64 exception entry")
> >
> > ... which was merged in v6.17.
> >
> > Are you able to reproduce the issue with v6.17 or later?
> >
> > Which specific kernel version did you see this with?
> 
> The call trace was captured on v6.9-rc1.

Why are you using an -rc1 release from almost two years ago?

> I have not yet tested on v6.17 or later. I will test and report back.
> 
> That said, the fix is in kernel/kprobes.c and addresses a generic
> re-entrancy issue in kprobe_busy_begin/end that is not specific to the
> arm64 entry path. The race -- where kprobe_busy_begin() is called
> re-entrantly from within an active kprobe context (e.g. via softirq
> during kretprobe entry_handler) -- can occur on any architecture where
> IRQs are re-enabled before invoking kprobe handlers.

AFAICT, re-enabling IRQs in that path would be a bug, and re-entrancy is
simply not expected. Please see my other reply on that front.

> I will verify whether the issue is still reproducible on v6.17+ and
> report back.

Thanks, that would be much appreciated. As would anything you can share
on the specifics of your kretprobe entry_handler

Mark.