[PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT

Jiayuan Chen posted 1 patch 1 month, 2 weeks ago
kernel/bpf/cpumap.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
[PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
Posted by Jiayuan Chen 1 month, 2 weeks ago
From: Jiayuan Chen <jiayuan.chen@shopee.com>

On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.

The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() and does not disable
preemption. spin_lock() also becomes a sleeping rt_mutex. Together,
this allows CFS scheduling to preempt a task during bq_flush_to_queue(),
enabling another task on the same CPU to enter bq_enqueue() and operate
on the same per-CPU bq concurrently.

This leads to several races:

1. Double __list_del_clearprev(): after spin_unlock() in
   bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
   bq_flush_to_queue() on the same bq when bq->count reaches
   CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
   on the same bq->flush_node, the second call dereferences the
   prev pointer that was already set to NULL by the first.

2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
   the packet queue while bq_flush_to_queue() is processing it.

The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:

  Task A (xdp_do_flush)          Task B (cpu_map_enqueue)
  ----------------------         ------------------------
  bq_flush_to_queue(bq)
    spin_lock(&q->producer_lock)
    /* flush bq->q[] to ptr_ring */
    bq->count = 0
    spin_unlock(&q->producer_lock)
                                   bq_enqueue(rcpu, xdpf)
    <-- CFS preempts Task A -->      bq->q[bq->count++] = xdpf
                                     /* ... more enqueues until full ... */
                                     bq_flush_to_queue(bq)
                                       spin_lock(&q->producer_lock)
                                       /* flush to ptr_ring */
                                       spin_unlock(&q->producer_lock)
                                       __list_del_clearprev(flush_node)
                                         /* sets flush_node.prev = NULL */
    <-- Task A resumes -->
    __list_del_clearprev(flush_node)
      flush_node.prev->next = ...
      /* prev is NULL -> kernel oops */

Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
maps to preempt_disable/enable with zero additional overhead. On
PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
access to the bq. Use local_lock_nested_bh() since these paths already
run under local_bh_disable().

An alternative approach of snapshotting bq->count and bq->q[] before
releasing the producer_lock was considered, but it requires copying
the entire bq->q[] array on every flush, adding unnecessary overhead.

To reproduce, insert an mdelay(100) between spin_unlock() and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.

Panic:
===
 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP PTI
 CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
 RIP: 0010:bq_flush_to_queue+0x145/0x200
 Call Trace:
  <TASK>
  __cpu_map_flush+0x2c/0x70
  xdp_do_flush+0x64/0x1b0
  xdp_test_run_batch.constprop.0+0x4d4/0x6d0
  bpf_test_run_xdp_live+0x24b/0x3e0
  bpf_prog_test_run_xdp+0x4a1/0x6e0
  __sys_bpf+0x44a/0x2760
  __x64_sys_bpf+0x1a/0x30
  x64_sys_call+0x146c/0x26e0
  do_syscall_64+0xd5/0x5a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  </TASK>

Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
- Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
  local_lock()/local_unlock(), since these paths already run under
  local_bh_disable(). (Sebastian Andrzej Siewior)
- Replace "Caller must hold bq->bq_lock" comment with
  lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
- Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
  softirq-BKL lock on PREEMPT_RT") which is the actual commit that
  makes the race possible. (Sebastian Andrzej Siewior)
---
 kernel/bpf/cpumap.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 04171fbc39cb..32b43cb9061b 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/workqueue.h>
 #include <linux/kthread.h>
+#include <linux/local_lock.h>
 #include <linux/completion.h>
 #include <trace/events/xdp.h>
 #include <linux/btf_ids.h>
@@ -52,6 +53,7 @@ struct xdp_bulk_queue {
 	struct list_head flush_node;
 	struct bpf_cpu_map_entry *obj;
 	unsigned int count;
+	local_lock_t bq_lock;
 };
 
 /* Struct for every remote "destination" CPU in map */
@@ -451,6 +453,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	for_each_possible_cpu(i) {
 		bq = per_cpu_ptr(rcpu->bulkq, i);
 		bq->obj = rcpu;
+		local_lock_init(&bq->bq_lock);
 	}
 
 	/* Alloc queue */
@@ -722,6 +725,8 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
 	struct ptr_ring *q;
 	int i;
 
+	lockdep_assert_held(&bq->bq_lock);
+
 	if (unlikely(!bq->count))
 		return;
 
@@ -749,11 +754,15 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
 }
 
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
- * Thus, safe percpu variable access.
+ * Thus, safe percpu variable access. PREEMPT_RT relies on
+ * local_lock_nested_bh() to serialise access to the per-CPU bq.
  */
 static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
-	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
+	struct xdp_bulk_queue *bq;
+
+	local_lock_nested_bh(&rcpu->bulkq->bq_lock);
+	bq = this_cpu_ptr(rcpu->bulkq);
 
 	if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
 		bq_flush_to_queue(bq);
@@ -774,6 +783,8 @@ static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 
 		list_add(&bq->flush_node, flush_list);
 	}
+
+	local_unlock_nested_bh(&rcpu->bulkq->bq_lock);
 }
 
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf,
@@ -810,7 +821,9 @@ void __cpu_map_flush(struct list_head *flush_list)
 	struct xdp_bulk_queue *bq, *tmp;
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
+		local_lock_nested_bh(&bq->obj->bulkq->bq_lock);
 		bq_flush_to_queue(bq);
+		local_unlock_nested_bh(&bq->obj->bulkq->bq_lock);
 
 		/* If already running, costs spin_lock_irqsave + smb_mb */
 		wake_up_process(bq->obj->kthread);
-- 
2.43.0
Re: [PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
Posted by Sebastian Andrzej Siewior 1 month, 2 weeks ago
On 2026-02-12 10:36:33 [+0800], Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
…
> local_bh_disable() only calls migrate_disable() and does not disable
> preemption. spin_lock() also becomes a sleeping rt_mutex. Together,
There is no spin_lock() otherwise there would be no problem.

…
> Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
> in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
> maps to preempt_disable/enable with zero additional overhead. On
> PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
> access to the bq. Use local_lock_nested_bh() since these paths already
> run under local_bh_disable().

So you use local_lock_nested_bh() and not local_lock() but you mention
local_lock. The difference is that the former does not add any
preempt_disable() on !RT. At this point I am curious how much of this
was written by you and how much is auto generated. 

> An alternative approach of snapshotting bq->count and bq->q[] before
> releasing the producer_lock was considered, but it requires copying
> the entire bq->q[] array on every flush, adding unnecessary overhead.

But you still have list_head which is not protected.

> To reproduce, insert an mdelay(100) between spin_unlock() and
> __list_del_clearprev() in bq_flush_to_queue(), then run reproducer
> provided by syzkaller.
> 
> Panic:
> ===
>  BUG: kernel NULL pointer dereference, address: 0000000000000000
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 0 P4D 0
>  Oops: Oops: 0002 [#1] SMP PTI
>  CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
>  RIP: 0010:bq_flush_to_queue+0x145/0x200
>  Call Trace:
>   <TASK>
>   __cpu_map_flush+0x2c/0x70
>   xdp_do_flush+0x64/0x1b0
>   xdp_test_run_batch.constprop.0+0x4d4/0x6d0
>   bpf_test_run_xdp_live+0x24b/0x3e0
>   bpf_prog_test_run_xdp+0x4a1/0x6e0
>   __sys_bpf+0x44a/0x2760
>   __x64_sys_bpf+0x1a/0x30
>   x64_sys_call+0x146c/0x26e0
>   do_syscall_64+0xd5/0x5a0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   </TASK>

This could be omitted. It is obvious once you see it. I somehow missed
this alloc_percpu instance while looking for this kind of bugs.
Another one is hiding in devmap.c. Mind to take a look? I think I
skip this entire folder…

> Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
> Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
> v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
> - Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
>   local_lock()/local_unlock(), since these paths already run under
>   local_bh_disable(). (Sebastian Andrzej Siewior)
> - Replace "Caller must hold bq->bq_lock" comment with
>   lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
> - Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
>   softirq-BKL lock on PREEMPT_RT") which is the actual commit that
>   makes the race possible. (Sebastian Andrzej Siewior)
> ---
>  kernel/bpf/cpumap.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)

The changes below look good.

Sebastian
Re: [PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
Posted by Jiayuan Chen 1 month, 2 weeks ago
2026/2/12 22:33, "Sebastian Andrzej Siewior" <bigeasy@linutronix.de mailto:bigeasy@linutronix.de?to=%22Sebastian%20Andrzej%20Siewior%22%20%3Cbigeasy%40linutronix.de%3E > wrote:




Thanks for the review, Sebastian.


> There is no spin_lock() otherwise there would be no problem.


Right, will fix the description. The per-CPU bq has no lock at all,
which is the root cause...
> …
> 
> > 
> > Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
> >  in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
> >  maps to preempt_disable/enable with zero additional overhead. On
> >  PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
> >  access to the bq. Use local_lock_nested_bh() since these paths already
> >  run under local_bh_disable().
> > 
> So you use local_lock_nested_bh() and not local_lock() but you mention
> local_lock. The difference is that the former does not add any
> preempt_disable() on !RT. At this point I am curious how much of this
> was written by you and how much is auto generated. 


Sorry about the inconsistencies. The commit message became messy after
several rounds of editing across versions, and I failed to update all
the descriptions to match the final code.


> > 
> > An alternative approach of snapshotting bq->count and bq->q[] before
> >  releasing the producer_lock was considered, but it requires copying
> >  the entire bq->q[] array on every flush, adding unnecessary overhead.
> > 
> But you still have list_head which is not protected.

The snapshot alternative is incomplete as described. Will drop that paragraph.

> > 
> > To reproduce, insert an mdelay(100) between spin_unlock() and
> >  __list_del_clearprev() in bq_flush_to_queue(), then run reproducer
> >  provided by syzkaller.
> >  
> >  Panic:
> >  ===
> >  BUG: kernel NULL pointer dereference, address: 0000000000000000
> >  #PF: supervisor write access in kernel mode
> >  #PF: error_code(0x0002) - not-present page
> >  PGD 0 P4D 0
> >  Oops: Oops: 0002 [#1] SMP PTI
> >  CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
> >  RIP: 0010:bq_flush_to_queue+0x145/0x200
> >  Call Trace:
> >  <TASK>
> >  __cpu_map_flush+0x2c/0x70
> >  xdp_do_flush+0x64/0x1b0
> >  xdp_test_run_batch.constprop.0+0x4d4/0x6d0
> >  bpf_test_run_xdp_live+0x24b/0x3e0
> >  bpf_prog_test_run_xdp+0x4a1/0x6e0
> >  __sys_bpf+0x44a/0x2760
> >  __x64_sys_bpf+0x1a/0x30
> >  x64_sys_call+0x146c/0x26e0
> >  do_syscall_64+0xd5/0x5a0
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >  </TASK>
> > 
> This could be omitted. It is obvious once you see it. I somehow missed

Will remove the panic trace.

> this alloc_percpu instance while looking for this kind of bugs.
> Another one is hiding in devmap.c. Mind to take a look? I think I
> skip this entire folder…


I see the same pattern there - xdp_dev_bulk_queue is per-CPU with no preemption
protection in bq_enqueue() and __dev_flush().

> > 
> > Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
> >  Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
> >  Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
> >  Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> >  ---
> >  v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
> >  - Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
> >  local_lock()/local_unlock(), since these paths already run under
> >  local_bh_disable(). (Sebastian Andrzej Siewior)
> >  - Replace "Caller must hold bq->bq_lock" comment with
> >  lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
> >  - Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
> >  softirq-BKL lock on PREEMPT_RT") which is the actual commit that
> >  makes the race possible. (Sebastian Andrzej Siewior)
> >  ---
> >  kernel/bpf/cpumap.c | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> > 
> The changes below look good.

Thanks!

> Sebastian