[PATCH bpf v3] bpf: Fix RCU stall in bpf_fd_array_map_clear()

Sechang Lim posted 1 patch 1 day, 21 hours ago
kernel/bpf/arraymap.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
[PATCH bpf v3] bpf: Fix RCU stall in bpf_fd_array_map_clear()
Posted by Sechang Lim 1 day, 21 hours ago
Add a missing cond_resched() in bpf_fd_array_map_clear() loop.

For PROG_ARRAY maps with many entries this loop calls
prog_array_map_poke_run() per entry which can be expensive, and
without yielding this can cause RCU stalls under load:

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu: 	(detected by 0, t=6502 jiffies, g=729293, q=305 ncpus=1)
  rcu: All QSes seen, last rcu_preempt kthread activity 6502 (4295096514-4295090012), jiffies_till_next_fqs=1, root ->qsmask 0x0
  rcu: rcu_preempt kthread starved for 6502 jiffies! g729293 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
  rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
  rcu: RCU grace-period kthread stack dump:
  task:rcu_preempt     state:R  running task     stack:0     pid:15    tgid:15    ppid:2      task_flags:0x208040 flags:0x00004000
  Call Trace:
   <TASK>
   context_switch kernel/sched/core.c:5382 [inline]
   __schedule+0x697/0x1430 kernel/sched/core.c:6767
   __schedule_loop kernel/sched/core.c:6845 [inline]
   schedule+0x10a/0x3e0 kernel/sched/core.c:6860
   schedule_timeout+0x145/0x2c0 kernel/time/sleep_timeout.c:99
   rcu_gp_fqs_loop+0x255/0x1350 kernel/rcu/tree.c:2046
   rcu_gp_kthread+0x347/0x680 kernel/rcu/tree.c:2248
   kthread+0x465/0x880 kernel/kthread.c:464
   ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:153
   ret_from_fork_asm+0x19/0x30 arch/x86/entry/entry_64.S:245
   </TASK>
  rcu: Stack dump where RCU GP kthread last ran:
  CPU: 0 UID: 0 PID: 30932 Comm: kworker/0:2 Not tainted 6.14.0-13195-g967e8def1100 #2 PREEMPT(undef)
  Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  Workqueue: events prog_array_map_clear_deferred
  RIP: 0010:write_comp_data+0x38/0x90 kernel/kcov.c:246
  Call Trace:
   <TASK>
   prog_array_map_poke_run+0x77/0x380 kernel/bpf/arraymap.c:1096
   __fd_array_map_delete_elem+0x197/0x310 kernel/bpf/arraymap.c:925
   bpf_fd_array_map_clear kernel/bpf/arraymap.c:1000 [inline]
   prog_array_map_clear_deferred+0x119/0x1b0 kernel/bpf/arraymap.c:1141
   process_one_work+0x898/0x19d0 kernel/workqueue.c:3238
   process_scheduled_works kernel/workqueue.c:3319 [inline]
   worker_thread+0x770/0x10b0 kernel/workqueue.c:3400
   kthread+0x465/0x880 kernel/kthread.c:464
   ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:153
   ret_from_fork_asm+0x19/0x30 arch/x86/entry/entry_64.S:245
   </TASK>

Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Fixes: da765a2f5993 ("bpf: Add poke dependency tracking for prog array maps")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 kernel/bpf/arraymap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 33de68c95..5e25e0353 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1015,8 +1015,10 @@ static void bpf_fd_array_map_clear(struct bpf_map *map, bool need_defer)
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
 
-	for (i = 0; i < array->map.max_entries; i++)
+	for (i = 0; i < array->map.max_entries; i++) {
 		__fd_array_map_delete_elem(map, &i, need_defer);
+		cond_resched();
+	}
 }
 
 static void prog_array_map_seq_show_elem(struct bpf_map *map, void *key,
-- 
2.43.0
Re: [PATCH bpf v3] bpf: Fix RCU stall in bpf_fd_array_map_clear()
Posted by Leon Hwang 1 day, 18 hours ago
On 31/3/26 10:30, Sechang Lim wrote:
[...]
> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
> Fixes: da765a2f5993 ("bpf: Add poke dependency tracking for prog array maps")
> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
> ---

After looking at v2, there's no functional change for v2 -> v3.

I think, you should send a PING in v2 after some days instead of sending
v3. If v2 will be applied, the tag will be picked up btw.

Besides, change logs are missing here.

v2 -> v3:
* ...
v2: [its lore link]

v1 -> v2:
* ...
v1: [its lore link]

Also, you should check sashiko's review [1].

[1]
https://sashiko.dev/#/patchset/20260331023056.484354-1-rhkrqnwk98%40gmail.com

>  kernel/bpf/arraymap.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 33de68c95..5e25e0353 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -1015,8 +1015,10 @@ static void bpf_fd_array_map_clear(struct bpf_map *map, bool need_defer)
>  	struct bpf_array *array = container_of(map, struct bpf_array, map);
>  	int i;
>  
> -	for (i = 0; i < array->map.max_entries; i++)
> +	for (i = 0; i < array->map.max_entries; i++) {
>  		__fd_array_map_delete_elem(map, &i, need_defer);
> +		cond_resched();

Since bpf_fd_array_map_clear() is used across prog_array,
perf_event_array, cgroup_array, and array_of_map, and this patch aims to
avoid RCU stalls for prog_array, does this cond_resched() punish
perf_event_array, cgroup_array, and array_of_map?

Thanks,
Leon

> +	}
>  }
>  
>  static void prog_array_map_seq_show_elem(struct bpf_map *map, void *key,
Re: [PATCH bpf v3] bpf: Fix RCU stall in bpf_fd_array_map_clear()
Posted by Sechang Lim 1 day, 16 hours ago
On 31/3/26 13:19, Leon Hwang wrote:
> After looking at v2, there's no functional change for v2 -> v3.
>
> I think, you should send a PING in v2 after some days instead of sending
> v3. If v2 will be applied, the tag will be picked up btw.
>
> Besides, change logs are missing here.

You're right, I should have just pinged v2 instead of sending v3.
The only change was fixing a CC typo (eddyz78 -> eddyz87), no
functional change. Apologies for the missing changelog as well.

> Since bpf_fd_array_map_clear() is used across prog_array,
> perf_event_array, cgroup_array, and array_of_map, and this patch aims to
> avoid RCU stalls for prog_array, does this cond_resched() punish
> perf_event_array, cgroup_array, and array_of_map?

map_poke_run is only set in prog_array_map_ops, so the
expensive path (poke_mutex + map_poke_run) in
__fd_array_map_delete_elem() is exclusive to prog_array.
For perf_event_array, cgroup_array, and array_of_map, each
iteration is just xchg + put_ptr, which is lightweight enough
that cond_resched() will not trigger rescheduling in practice.

Thanks,
Sechang