[PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'

Kunwu Chan posted 1 patch 1 year, 1 month ago
kernel/bpf/lpm_trie.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
[PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Kunwu Chan 1 year, 1 month ago
From: Kunwu Chan <chentao@kylinos.cn>

When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
and bpf program has owned a raw_spinlock under a interrupt handler,
which results in invalid lock acquire context.

[ BUG: Invalid wait context ]
6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
ffff8880261e7a00 (&trie->lock){....}-{3:3},
at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
other info that might help us debug this:
context-{3:3}
5 locks held by swapper/0/0:
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
 #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: spin_lock include/linux/spinlock.h:351 [inline]
 #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
 #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
at: __queue_work+0x759/0xf50
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
 #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
stack backtrace:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
6.12.0-rc5-next-20241031-syzkaller #0
Hardware name: Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
 check_wait_context kernel/locking/lockdep.c:4898 [inline]
 __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
 trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
 bpf_prog_2c29ac5cdc6b1842+0x43/0x47
 bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
 __bpf_prog_run include/linux/filter.h:701 [inline]
 bpf_prog_run include/linux/filter.h:708 [inline]
 __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
 bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
 trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
 __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
 queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390
 queue_work include/linux/workqueue.h:662 [inline]
 stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
 vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
 vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
 vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
 __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
 handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
 handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
 handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
 generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
 handle_irq arch/x86/kernel/irq.c:247 [inline]
 call_irq_handler arch/x86/kernel/irq.c:259 [inline]
 __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
 common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
 </IRQ>

Reported-by: syzbot+b506de56cbbb63148c33@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/6723db4a.050a0220.35b515.0168.GAE@google.com/
Fixes: 66150d0dde03 ("bpf, lpm: Make locking RT friendly")
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
---
 kernel/bpf/lpm_trie.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9b60eda0f727..373cdcfa0505 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -35,7 +35,7 @@ struct lpm_trie {
 	size_t				n_entries;
 	size_t				max_prefixlen;
 	size_t				data_size;
-	spinlock_t			lock;
+	raw_spinlock_t			lock;
 };
 
 /* This trie implements a longest prefix match algorithm that can be used to
@@ -330,7 +330,7 @@ static long trie_update_elem(struct bpf_map *map,
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	spin_lock_irqsave(&trie->lock, irq_flags);
+	raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Allocate and fill a new node */
 
@@ -437,7 +437,7 @@ static long trie_update_elem(struct bpf_map *map,
 		kfree(im_node);
 	}
 
-	spin_unlock_irqrestore(&trie->lock, irq_flags);
+	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
 	kfree_rcu(free_node, rcu);
 
 	return ret;
@@ -459,7 +459,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	spin_lock_irqsave(&trie->lock, irq_flags);
+	raw_spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Walk the tree looking for an exact key/length match and keeping
 	 * track of the path we traverse.  We will need to know the node
@@ -535,7 +535,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
 	free_node = node;
 
 out:
-	spin_unlock_irqrestore(&trie->lock, irq_flags);
+	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
 	kfree_rcu(free_parent, rcu);
 	kfree_rcu(free_node, rcu);
 
@@ -581,7 +581,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 			  offsetof(struct bpf_lpm_trie_key_u8, data);
 	trie->max_prefixlen = trie->data_size * 8;
 
-	spin_lock_init(&trie->lock);
+	raw_spin_lock_init(&trie->lock);
 
 	return &trie->map;
 }
-- 
2.34.1
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Thomas Gleixner 1 year, 1 month ago
On Fri, Nov 08 2024 at 14:32, Kunwu Chan wrote:
> When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
> and bpf program has owned a raw_spinlock under a interrupt handler,
> which results in invalid lock acquire context.

This explanation is just wrong.

The problem has nothing to do with an interrupt handler. Interrupt
handlers on RT kernels are force threaded.

>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>  _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
>  trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
>  bpf_prog_2c29ac5cdc6b1842+0x43/0x47
>  bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
>  __bpf_prog_run include/linux/filter.h:701 [inline]
>  bpf_prog_run include/linux/filter.h:708 [inline]
>  __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
>  bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
>  trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
>  __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338

The problematic lock nesting is the work queue pool lock, which is a raw
spinlock.

> @@ -330,7 +330,7 @@ static long trie_update_elem(struct bpf_map *map,
>  	if (key->prefixlen > trie->max_prefixlen)
>  		return -EINVAL;
>  
> -	spin_lock_irqsave(&trie->lock, irq_flags);
> +	raw_spin_lock_irqsave(&trie->lock, irq_flags);
>  
>  	/* Allocate and fill a new node */

Making this a raw spinlock moves the problem from the BPF trie code into
the memory allocator. On RT the memory allocator cannot be invoked under
a raw spinlock.

Thanks,

        tglx
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Kunwu Chan 1 year, 1 month ago
Thanks all for the reply.

On 2024/11/12 23:08, Thomas Gleixner wrote:
> On Fri, Nov 08 2024 at 14:32, Kunwu Chan wrote:
>> When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
>> and bpf program has owned a raw_spinlock under a interrupt handler,
>> which results in invalid lock acquire context.
> This explanation is just wrong.
>
> The problem has nothing to do with an interrupt handler. Interrupt
> handlers on RT kernels are force threaded.
>
>>   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
>>   trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
>>   bpf_prog_2c29ac5cdc6b1842+0x43/0x47
>>   bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
>>   __bpf_prog_run include/linux/filter.h:701 [inline]
>>   bpf_prog_run include/linux/filter.h:708 [inline]
>>   __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
>>   bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
>>   trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
>>   __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
> The problematic lock nesting is the work queue pool lock, which is a raw
> spinlock.
>
>> @@ -330,7 +330,7 @@ static long trie_update_elem(struct bpf_map *map,
>>   	if (key->prefixlen > trie->max_prefixlen)
>>   		return -EINVAL;
>>   
>> -	spin_lock_irqsave(&trie->lock, irq_flags);
>> +	raw_spin_lock_irqsave(&trie->lock, irq_flags);
>>   
>>   	/* Allocate and fill a new node */
> Making this a raw spinlock moves the problem from the BPF trie code into
> the memory allocator. On RT the memory allocator cannot be invoked under
> a raw spinlock.
I'am newbiee in this field. But actually when i change it to a raw 
spinlock, the problem syzbot reported dispeared.
If don't change like this, we should do what to deal with this problem, 
if you have any good idea, pls tell me to do.
> Thanks,
>
>          tglx
>
-- 
Thanks,
   Kunwu.Chan
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Thomas Gleixner 1 year, 1 month ago
On Thu, Nov 14 2024 at 10:43, Kunwu Chan wrote:
> On 2024/11/12 23:08, Thomas Gleixner wrote:
>>> @@ -330,7 +330,7 @@ static long trie_update_elem(struct bpf_map *map,
>>>   	if (key->prefixlen > trie->max_prefixlen)
>>>   		return -EINVAL;
>>>   
>>> -	spin_lock_irqsave(&trie->lock, irq_flags);
>>> +	raw_spin_lock_irqsave(&trie->lock, irq_flags);
>>>   
>>>   	/* Allocate and fill a new node */
>> Making this a raw spinlock moves the problem from the BPF trie code into
>> the memory allocator. On RT the memory allocator cannot be invoked under
>> a raw spinlock.
> I'am newbiee in this field. But actually when i change it to a raw 
> spinlock, the problem syzbot reported dispeared.

Yes, because the actual code path which is going to trigger this, is not
reached. But it will be reached at some point.

IIRC, BPF has it's own allocator which can be used everywhere.

Thanks,

        tglx
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Sebastian Andrzej Siewior 1 year, 1 month ago
On 2024-11-15 23:29:31 [+0100], Thomas Gleixner wrote:
> IIRC, BPF has it's own allocator which can be used everywhere.

Thomas Weißschuh made something. It appears to work. Need to take a
closer look.

> Thanks,
> 
>         tglx

Sebastian
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Alexei Starovoitov 1 year, 1 month ago
On Sat, Nov 16, 2024 at 1:21 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2024-11-15 23:29:31 [+0100], Thomas Gleixner wrote:
> > IIRC, BPF has it's own allocator which can be used everywhere.
>
> Thomas Weißschuh made something. It appears to work. Need to take a
> closer look.

Any more details?
bpf_mem_alloc is a stop gap.
As Vlastimil Babka suggested long ago:
https://lwn.net/Articles/974138/
"...next on the target list is the special allocator used by the BPF
subsystem. This allocator is intended to succeed in any calling
context, including in non-maskable interrupts (NMIs). BPF maintainer
Alexei Starovoitov is evidently in favor of this removal if SLUB is
able to handle the same use cases..."

Here is the first step:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Thomas Weißschuh 1 year, 1 month ago
On 2024-11-16 08:01:49-0800, Alexei Starovoitov wrote:
> On Sat, Nov 16, 2024 at 1:21 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > On 2024-11-15 23:29:31 [+0100], Thomas Gleixner wrote:
> > > IIRC, BPF has it's own allocator which can be used everywhere.
> >
> > Thomas Weißschuh made something. It appears to work. Need to take a
> > closer look.
> 
> Any more details?
> bpf_mem_alloc is a stop gap.

It is indeed using bpf_mem_alloc.
It is a fairly straightforward conversion, using one cache for
intermediate and one for non-intermediate nodes.

I'll try to send it early next week.

> As Vlastimil Babka suggested long ago:
> https://lwn.net/Articles/974138/
> "...next on the target list is the special allocator used by the BPF
> subsystem. This allocator is intended to succeed in any calling
> context, including in non-maskable interrupts (NMIs). BPF maintainer
> Alexei Starovoitov is evidently in favor of this removal if SLUB is
> able to handle the same use cases..."
> 
> Here is the first step:
> https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Alexei Starovoitov 1 year, 1 month ago
On Sat, Nov 16, 2024 at 8:15 AM Thomas Weißschuh <linux@weissschuh.net> wrote:
>
> On 2024-11-16 08:01:49-0800, Alexei Starovoitov wrote:
> > On Sat, Nov 16, 2024 at 1:21 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2024-11-15 23:29:31 [+0100], Thomas Gleixner wrote:
> > > > IIRC, BPF has it's own allocator which can be used everywhere.
> > >
> > > Thomas Weißschuh made something. It appears to work. Need to take a
> > > closer look.
> >
> > Any more details?
> > bpf_mem_alloc is a stop gap.
>
> It is indeed using bpf_mem_alloc.
> It is a fairly straightforward conversion, using one cache for
> intermediate and one for non-intermediate nodes.

Sounds like you're proposing to allocate two lpm specific bpf_ma-s ?
Just use bpf_global_ma.
More ma-s means more memory pinned in bpf specific freelists.
That's the main reason to teach slub and page_alloc about bpf requirements.
All memory should be shared by all subsystems.
Custom memory pools / freelists, whether it's bpf, networking
or whatever else, is a pain point for somebody.
The kernel needs to be optimal for all use cases.

> I'll try to send it early next week.

Looking forward.

> > As Vlastimil Babka suggested long ago:
> > https://lwn.net/Articles/974138/
> > "...next on the target list is the special allocator used by the BPF
> > subsystem. This allocator is intended to succeed in any calling
> > context, including in non-maskable interrupts (NMIs). BPF maintainer
> > Alexei Starovoitov is evidently in favor of this removal if SLUB is
> > able to handle the same use cases..."
> >
> > Here is the first step:
> > https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Hou Tao 1 year, 1 month ago
Hi,

On 11/17/2024 12:42 AM, Alexei Starovoitov wrote:
> On Sat, Nov 16, 2024 at 8:15 AM Thomas Weißschuh <linux@weissschuh.net> wrote:
>> On 2024-11-16 08:01:49-0800, Alexei Starovoitov wrote:
>>> On Sat, Nov 16, 2024 at 1:21 AM Sebastian Andrzej Siewior
>>> <bigeasy@linutronix.de> wrote:
>>>> On 2024-11-15 23:29:31 [+0100], Thomas Gleixner wrote:
>>>>> IIRC, BPF has it's own allocator which can be used everywhere.
>>>> Thomas Weißschuh made something. It appears to work. Need to take a
>>>> closer look.
>>> Any more details?
>>> bpf_mem_alloc is a stop gap.
>> It is indeed using bpf_mem_alloc.
>> It is a fairly straightforward conversion, using one cache for
>> intermediate and one for non-intermediate nodes.
> Sounds like you're proposing to allocate two lpm specific bpf_ma-s ?
> Just use bpf_global_ma.
> More ma-s means more memory pinned in bpf specific freelists.
> That's the main reason to teach slub and page_alloc about bpf requirements.
> All memory should be shared by all subsystems.
> Custom memory pools / freelists, whether it's bpf, networking
> or whatever else, is a pain point for somebody.
> The kernel needs to be optimal for all use cases.

I have been working on it since last week [1] and have already written a
patch (a patch in a patch set) for it. In my patch, these two allocators
will be merged if they are mergable and now the merge is decided by the
return value of kmalloc_size_roundup(). Also considering about using
bpf_global_ma instead, but the biggest problem for bpf_global_ma is the
memory accounting. The allocated memory will be accounted under root
memory cgroup instead of the memory cgroup of users.

[1]:
https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@huaweicloud.com/
>
>> I'll try to send it early next week.
> Looking forward.
>
>>> As Vlastimil Babka suggested long ago:
>>> https://lwn.net/Articles/974138/
>>> "...next on the target list is the special allocator used by the BPF
>>> subsystem. This allocator is intended to succeed in any calling
>>> context, including in non-maskable interrupts (NMIs). BPF maintainer
>>> Alexei Starovoitov is evidently in favor of this removal if SLUB is
>>> able to handle the same use cases..."
>>>
>>> Here is the first step:
>>> https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/
>
> .

Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Alexei Starovoitov 1 year, 1 month ago
On Thu, Nov 7, 2024 at 10:32 PM Kunwu Chan <kunwu.chan@linux.dev> wrote:
>
> From: Kunwu Chan <chentao@kylinos.cn>
>
> When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
> and bpf program has owned a raw_spinlock under a interrupt handler,
> which results in invalid lock acquire context.
>
> [ BUG: Invalid wait context ]
> 6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
> -----------------------------
> swapper/0/0 is trying to lock:
> ffff8880261e7a00 (&trie->lock){....}-{3:3},
> at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
> other info that might help us debug this:
> context-{3:3}
> 5 locks held by swapper/0/0:
>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
> at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
> at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
> at: spin_lock include/linux/spinlock.h:351 [inline]
>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
> at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
>  #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
> at: __queue_work+0x759/0xf50
>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
> stack backtrace:
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
> 6.12.0-rc5-next-20241031-syzkaller #0
> Hardware name: Google Compute Engine/Google Compute Engine,
> BIOS Google 09/13/2024
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:94 [inline]
>  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
>  print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
>  check_wait_context kernel/locking/lockdep.c:4898 [inline]
>  __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
>  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>  _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
>  trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462

This trace is from non-RT kernel where spin_lock == raw_spin_lock.

I don't think Hou's explanation earlier is correct.
https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@huaweicloud.com/

>  bpf_prog_2c29ac5cdc6b1842+0x43/0x47
>  bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
>  __bpf_prog_run include/linux/filter.h:701 [inline]
>  bpf_prog_run include/linux/filter.h:708 [inline]
>  __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
>  bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
>  trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
>  __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
>  queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390

here irqs are disabled, but raw_spin_lock in lpm should be fine.

>  queue_work include/linux/workqueue.h:662 [inline]
>  stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
>  vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
>  vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
>  vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
>  __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
>  handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
>  handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
>  handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
>  generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
>  handle_irq arch/x86/kernel/irq.c:247 [inline]
>  call_irq_handler arch/x86/kernel/irq.c:259 [inline]
>  __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
>  common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
>  </IRQ>
>
> Reported-by: syzbot+b506de56cbbb63148c33@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/bpf/6723db4a.050a0220.35b515.0168.GAE@google.com/
> Fixes: 66150d0dde03 ("bpf, lpm: Make locking RT friendly")
> Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
> ---
>  kernel/bpf/lpm_trie.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 9b60eda0f727..373cdcfa0505 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -35,7 +35,7 @@ struct lpm_trie {
>         size_t                          n_entries;
>         size_t                          max_prefixlen;
>         size_t                          data_size;
> -       spinlock_t                      lock;
> +       raw_spinlock_t                  lock;
>  };

We're certainly not going back.

pw-bot: cr
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Hou Tao 1 year, 1 month ago
Hi Alexei,

On 11/9/2024 4:22 AM, Alexei Starovoitov wrote:
> On Thu, Nov 7, 2024 at 10:32 PM Kunwu Chan <kunwu.chan@linux.dev> wrote:
>> From: Kunwu Chan <chentao@kylinos.cn>
>>
>> When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
>> and bpf program has owned a raw_spinlock under a interrupt handler,
>> which results in invalid lock acquire context.
>>
>> [ BUG: Invalid wait context ]
>> 6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
>> -----------------------------
>> swapper/0/0 is trying to lock:
>> ffff8880261e7a00 (&trie->lock){....}-{3:3},
>> at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
>> other info that might help us debug this:
>> context-{3:3}
>> 5 locks held by swapper/0/0:
>>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
>> at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
>>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
>> at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
>>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
>> at: spin_lock include/linux/spinlock.h:351 [inline]
>>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
>> at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
>>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
>>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
>>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>> at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
>>  #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
>> at: __queue_work+0x759/0xf50
>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>> at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>> at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
>> stack backtrace:
>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
>> 6.12.0-rc5-next-20241031-syzkaller #0
>> Hardware name: Google Compute Engine/Google Compute Engine,
>> BIOS Google 09/13/2024
>> Call Trace:
>>  <IRQ>
>>  __dump_stack lib/dump_stack.c:94 [inline]
>>  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
>>  print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
>>  check_wait_context kernel/locking/lockdep.c:4898 [inline]
>>  __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
>>  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
>>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>  _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
>>  trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
> This trace is from non-RT kernel where spin_lock == raw_spin_lock.

Yes. However, I think the reason for the warning is that lockdep
considers the case is possible under PREEMPT_RT and it violates the rule
of lock [1].

[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=560af5dc839eef08a273908f390cfefefb82aa04
>
> I don't think Hou's explanation earlier is correct.
> https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@huaweicloud.com/

OK. Is the bpf mem allocator part OK for you ?
>
>>  bpf_prog_2c29ac5cdc6b1842+0x43/0x47
>>  bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
>>  __bpf_prog_run include/linux/filter.h:701 [inline]
>>  bpf_prog_run include/linux/filter.h:708 [inline]
>>  __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
>>  bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
>>  trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
>>  __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
>>  queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390
> here irqs are disabled, but raw_spin_lock in lpm should be fine.
>
>>  queue_work include/linux/workqueue.h:662 [inline]
>>  stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
>>  vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
>>  vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
>>  vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
>>  __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
>>  handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
>>  handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
>>  handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
>>  generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
>>  handle_irq arch/x86/kernel/irq.c:247 [inline]
>>  call_irq_handler arch/x86/kernel/irq.c:259 [inline]
>>  __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
>>  common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
>>  </IRQ>
>>
>> Reported-by: syzbot+b506de56cbbb63148c33@syzkaller.appspotmail.com
>> Closes: https://lore.kernel.org/bpf/6723db4a.050a0220.35b515.0168.GAE@google.com/
>> Fixes: 66150d0dde03 ("bpf, lpm: Make locking RT friendly")
>> Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
>> ---
>>  kernel/bpf/lpm_trie.c | 12 ++++++------
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9b60eda0f727..373cdcfa0505 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -35,7 +35,7 @@ struct lpm_trie {
>>         size_t                          n_entries;
>>         size_t                          max_prefixlen;
>>         size_t                          data_size;
>> -       spinlock_t                      lock;
>> +       raw_spinlock_t                  lock;
>>  };
> We're certainly not going back.

Only switching from spinlock_t to raw_spinlock_t is not enough, running
it under PREEMPT_RT after the change will still trigger the similar
lockdep warning. That is because kmalloc() may acquire a spinlock_t as
well. However, after changing the kmalloc and its variants to bpf memory
allocator, I think the switch to raw_spinlock_t will be safe. I have
already written a draft patch set. Will post after after polishing and
testing it. WDYT ?
>
> pw-bot: cr

Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Alexei Starovoitov 1 year, 1 month ago
On Fri, Nov 8, 2024 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi Alexei,
>
> On 11/9/2024 4:22 AM, Alexei Starovoitov wrote:
> > On Thu, Nov 7, 2024 at 10:32 PM Kunwu Chan <kunwu.chan@linux.dev> wrote:
> >> From: Kunwu Chan <chentao@kylinos.cn>
> >>
> >> When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
> >> and bpf program has owned a raw_spinlock under a interrupt handler,
> >> which results in invalid lock acquire context.
> >>
> >> [ BUG: Invalid wait context ]
> >> 6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
> >> -----------------------------
> >> swapper/0/0 is trying to lock:
> >> ffff8880261e7a00 (&trie->lock){....}-{3:3},
> >> at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
> >> other info that might help us debug this:
> >> context-{3:3}
> >> 5 locks held by swapper/0/0:
> >>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
> >> at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
> >>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
> >> at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
> >>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
> >> at: spin_lock include/linux/spinlock.h:351 [inline]
> >>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
> >> at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
> >>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> >> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
> >>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> >> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
> >>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> >> at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
> >>  #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
> >> at: __queue_work+0x759/0xf50
> >>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> >> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
> >>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> >> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
> >>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> >> at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
> >>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
> >> at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
> >> stack backtrace:
> >> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
> >> 6.12.0-rc5-next-20241031-syzkaller #0
> >> Hardware name: Google Compute Engine/Google Compute Engine,
> >> BIOS Google 09/13/2024
> >> Call Trace:
> >>  <IRQ>
> >>  __dump_stack lib/dump_stack.c:94 [inline]
> >>  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
> >>  print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
> >>  check_wait_context kernel/locking/lockdep.c:4898 [inline]
> >>  __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
> >>  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
> >>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
> >>  _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
> >>  trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
> > This trace is from non-RT kernel where spin_lock == raw_spin_lock.
>
> Yes. However, I think the reason for the warning is that lockdep
> considers the case is possible under PREEMPT_RT and it violates the rule
> of lock [1].
>
> [1]:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=560af5dc839eef08a273908f390cfefefb82aa04
> >
> > I don't think Hou's explanation earlier is correct.
> > https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@huaweicloud.com/
>
> OK. Is the bpf mem allocator part OK for you ?
> >
> >>  bpf_prog_2c29ac5cdc6b1842+0x43/0x47
> >>  bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
> >>  __bpf_prog_run include/linux/filter.h:701 [inline]
> >>  bpf_prog_run include/linux/filter.h:708 [inline]
> >>  __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
> >>  bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
> >>  trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
> >>  __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
> >>  queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390
> > here irqs are disabled, but raw_spin_lock in lpm should be fine.
> >
> >>  queue_work include/linux/workqueue.h:662 [inline]
> >>  stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
> >>  vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
> >>  vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
> >>  vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
> >>  __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
> >>  handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
> >>  handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
> >>  handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
> >>  generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
> >>  handle_irq arch/x86/kernel/irq.c:247 [inline]
> >>  call_irq_handler arch/x86/kernel/irq.c:259 [inline]
> >>  __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
> >>  common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
> >>  </IRQ>
> >>
> >> Reported-by: syzbot+b506de56cbbb63148c33@syzkaller.appspotmail.com
> >> Closes: https://lore.kernel.org/bpf/6723db4a.050a0220.35b515.0168.GAE@google.com/
> >> Fixes: 66150d0dde03 ("bpf, lpm: Make locking RT friendly")
> >> Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
> >> ---
> >>  kernel/bpf/lpm_trie.c | 12 ++++++------
> >>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> >> index 9b60eda0f727..373cdcfa0505 100644
> >> --- a/kernel/bpf/lpm_trie.c
> >> +++ b/kernel/bpf/lpm_trie.c
> >> @@ -35,7 +35,7 @@ struct lpm_trie {
> >>         size_t                          n_entries;
> >>         size_t                          max_prefixlen;
> >>         size_t                          data_size;
> >> -       spinlock_t                      lock;
> >> +       raw_spinlock_t                  lock;
> >>  };
> > We're certainly not going back.
>
> Only switching from spinlock_t to raw_spinlock_t is not enough, running
> it under PREEMPT_RT after the change will still trigger the similar
> lockdep warning. That is because kmalloc() may acquire a spinlock_t as
> well. However, after changing the kmalloc and its variants to bpf memory
> allocator, I think the switch to raw_spinlock_t will be safe. I have
> already written a draft patch set. Will post after after polishing and
> testing it. WDYT ?

Switching lpm to bpf_mem_alloc would address the issue.
Why do you want a switch to raw_spin_lock as well?
kfree_rcu() is already done outside of the lock.
Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Hou Tao 1 year, 1 month ago
Hi,

On 11/10/2024 8:04 AM, Alexei Starovoitov wrote:
> On Fri, Nov 8, 2024 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi Alexei,
>>
>> On 11/9/2024 4:22 AM, Alexei Starovoitov wrote:
>>> On Thu, Nov 7, 2024 at 10:32 PM Kunwu Chan <kunwu.chan@linux.dev> wrote:
>>>> From: Kunwu Chan <chentao@kylinos.cn>
>>>>
>>>> When PREEMPT_RT is enabled, 'spinlock_t' becomes preemptible
>>>> and bpf program has owned a raw_spinlock under a interrupt handler,
>>>> which results in invalid lock acquire context.
>>>>
>>>> [ BUG: Invalid wait context ]
>>>> 6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
>>>> -----------------------------
>>>> swapper/0/0 is trying to lock:
>>>> ffff8880261e7a00 (&trie->lock){....}-{3:3},
>>>> at: trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
>>>> other info that might help us debug this:
>>>> context-{3:3}
>>>> 5 locks held by swapper/0/0:
>>>>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
>>>> at: vp_vring_interrupt drivers/virtio/virtio_pci_common.c:80 [inline]
>>>>  #0: ffff888020bb75c8 (&vp_dev->lock){-...}-{3:3},
>>>> at: vp_interrupt+0x142/0x200 drivers/virtio/virtio_pci_common.c:113
>>>>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
>>>> at: spin_lock include/linux/spinlock.h:351 [inline]
>>>>  #1: ffff88814174a120 (&vb->stop_update_lock){-...}-{3:3},
>>>> at: stats_request+0x6f/0x230 drivers/virtio/virtio_balloon.c:438
>>>>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>>>> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
>>>>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>>>> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
>>>>  #2: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>>>> at: __queue_work+0x199/0xf50 kernel/workqueue.c:2259
>>>>  #3: ffff8880b863dd18 (&pool->lock){-.-.}-{2:2},
>>>> at: __queue_work+0x759/0xf50
>>>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>>>> at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
>>>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>>>> at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
>>>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>>>> at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
>>>>  #4: ffffffff8e939f20 (rcu_read_lock){....}-{1:3},
>>>> at: bpf_trace_run1+0x1d6/0x520 kernel/trace/bpf_trace.c:2380
>>>> stack backtrace:
>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
>>>> 6.12.0-rc5-next-20241031-syzkaller #0
>>>> Hardware name: Google Compute Engine/Google Compute Engine,
>>>> BIOS Google 09/13/2024
>>>> Call Trace:
>>>>  <IRQ>
>>>>  __dump_stack lib/dump_stack.c:94 [inline]
>>>>  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
>>>>  print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
>>>>  check_wait_context kernel/locking/lockdep.c:4898 [inline]
>>>>  __lock_acquire+0x15a8/0x2100 kernel/locking/lockdep.c:5176
>>>>  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
>>>>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>>>  _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
>>>>  trie_delete_elem+0x96/0x6a0 kernel/bpf/lpm_trie.c:462
>>> This trace is from non-RT kernel where spin_lock == raw_spin_lock.
>> Yes. However, I think the reason for the warning is that lockdep
>> considers the case is possible under PREEMPT_RT and it violates the rule
>> of lock [1].
>>
>> [1]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=560af5dc839eef08a273908f390cfefefb82aa04
>>> I don't think Hou's explanation earlier is correct.
>>> https://lore.kernel.org/bpf/e14d8882-4760-7c9c-0cfc-db04eda494ee@huaweicloud.com/
>> OK. Is the bpf mem allocator part OK for you ?
>>>>  bpf_prog_2c29ac5cdc6b1842+0x43/0x47
>>>>  bpf_dispatcher_nop_func include/linux/bpf.h:1290 [inline]
>>>>  __bpf_prog_run include/linux/filter.h:701 [inline]
>>>>  bpf_prog_run include/linux/filter.h:708 [inline]
>>>>  __bpf_trace_run kernel/trace/bpf_trace.c:2340 [inline]
>>>>  bpf_trace_run1+0x2ca/0x520 kernel/trace/bpf_trace.c:2380
>>>>  trace_workqueue_activate_work+0x186/0x1f0 include/trace/events/workqueue.h:59
>>>>  __queue_work+0xc7b/0xf50 kernel/workqueue.c:2338
>>>>  queue_work_on+0x1c2/0x380 kernel/workqueue.c:2390
>>> here irqs are disabled, but raw_spin_lock in lpm should be fine.
>>>
>>>>  queue_work include/linux/workqueue.h:662 [inline]
>>>>  stats_request+0x1a3/0x230 drivers/virtio/virtio_balloon.c:441
>>>>  vring_interrupt+0x21d/0x380 drivers/virtio/virtio_ring.c:2595
>>>>  vp_vring_interrupt drivers/virtio/virtio_pci_common.c:82 [inline]
>>>>  vp_interrupt+0x192/0x200 drivers/virtio/virtio_pci_common.c:113
>>>>  __handle_irq_event_percpu+0x29a/0xa80 kernel/irq/handle.c:158
>>>>  handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
>>>>  handle_irq_event+0x89/0x1f0 kernel/irq/handle.c:210
>>>>  handle_fasteoi_irq+0x48a/0xae0 kernel/irq/chip.c:720
>>>>  generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
>>>>  handle_irq arch/x86/kernel/irq.c:247 [inline]
>>>>  call_irq_handler arch/x86/kernel/irq.c:259 [inline]
>>>>  __common_interrupt+0x136/0x230 arch/x86/kernel/irq.c:285
>>>>  common_interrupt+0xb4/0xd0 arch/x86/kernel/irq.c:278
>>>>  </IRQ>
>>>>
>>>> Reported-by: syzbot+b506de56cbbb63148c33@syzkaller.appspotmail.com
>>>> Closes: https://lore.kernel.org/bpf/6723db4a.050a0220.35b515.0168.GAE@google.com/
>>>> Fixes: 66150d0dde03 ("bpf, lpm: Make locking RT friendly")
>>>> Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
>>>> ---
>>>>  kernel/bpf/lpm_trie.c | 12 ++++++------
>>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>>> index 9b60eda0f727..373cdcfa0505 100644
>>>> --- a/kernel/bpf/lpm_trie.c
>>>> +++ b/kernel/bpf/lpm_trie.c
>>>> @@ -35,7 +35,7 @@ struct lpm_trie {
>>>>         size_t                          n_entries;
>>>>         size_t                          max_prefixlen;
>>>>         size_t                          data_size;
>>>> -       spinlock_t                      lock;
>>>> +       raw_spinlock_t                  lock;
>>>>  };
>>> We're certainly not going back.
>> Only switching from spinlock_t to raw_spinlock_t is not enough, running
>> it under PREEMPT_RT after the change will still trigger the similar
>> lockdep warning. That is because kmalloc() may acquire a spinlock_t as
>> well. However, after changing the kmalloc and its variants to bpf memory
>> allocator, I think the switch to raw_spinlock_t will be safe. I have
>> already written a draft patch set. Will post after after polishing and
>> testing it. WDYT ?
> Switching lpm to bpf_mem_alloc would address the issue.
> Why do you want a switch to raw_spin_lock as well?
> kfree_rcu() is already done outside of the lock.

After switching to raw_spinlock_t, the lpm trie could be used under
interrupt context even under PREEMPT_RT.
> .

Re: [PATCH] bpf: Convert lpm_trie::lock to 'raw_spinlock_t'
Posted by Sebastian Sewior 1 year, 1 month ago
On 2024-11-10 10:08:00 [+0800], Hou Tao wrote:
> >> well. However, after changing the kmalloc and its variants to bpf memory
> >> allocator, I think the switch to raw_spinlock_t will be safe. I have
> >> already written a draft patch set. Will post after after polishing and
> >> testing it. WDYT ?
> > Switching lpm to bpf_mem_alloc would address the issue.
> > Why do you want a switch to raw_spin_lock as well?
> > kfree_rcu() is already done outside of the lock.
> 
> After switching to raw_spinlock_t, the lpm trie could be used under
> interrupt context even under PREEMPT_RT.

I would have to dig why the lock has been moved away from raw_spinlock_t
and why we need it back and what changed since. I have some vague memory
that there was a test case which added plenty of items and cleaning it
up created latency spikes.
Note that interrupts are threaded on PREEMPT_RT. Using it in "interrupt
context" would mean you need this in the primary handler/ hardirq.

Sebastian