[v1] net: bridge: reduce multicast checks in fast path

[PATCH 0/9] net: bridge: reduce multicast checks in fast path

Posted by Linus Lüssing 5 months, 1 week ago

This patchset introduces new state variables to combine and reduce the
number of checks we would otherwise perform on every multicast packet
in fast/data path.
  
The second reason for introducing these new, internal multicast active
variables is to later propagate a safety mechanism which was introduced
in b00589af3b04 ("bridge: disable snooping if there is no querier") to
switchdev/DSA, too. That is to notify switchdev/DSA if multicast
snooping can safely be applied without potential packet loss.
    
Regards, Linus

---

# Changelog

Changelog to / follow-up of: [PATCH net-next 0/5] net: bridge: propagate safe mcast snooping to switchdev + DSA
-> https://lkml.org/lkml/2025/5/22/1413

* removed the switchdev/DSA changes for now
* splitting "[PATCH net-next 1/5] net: bridge: mcast: explicitly track active state"
  into:
  * net: bridge: mcast: track active state, IGMP/MLD querier appearance
  * net: bridge: mcast: track active state, foreign IGMP/MLD querier disappearance
  * net: bridge: mcast: track active state, IPv6 address availability
  * net: bridge: mcast: track active state, own MLD querier disappearance
  * net: bridge: mcast: use combined active state in fast/data path
  * net: bridge: mcast: track active state, bridge up/down

* rebased to current net-next/main:
  * from_timer() -> timer_container_of()

* net: bridge: mcast: export ip{4,6}_active state to netlink:
  * changing NLA_U8 to NLA_REJECT to make it read-only

* moved br_multicast_update_active() call from br_ip{4,6}_multicast_query_expired()
  (own querier timer callback) to br_ip{4,6}_multicast_querier_expired()
  (other querier timer callback)
  * even though both should have worked as br_multicast_querier_expired()
    would call br_multicast_start_querier()->...->br_multicast_query_expired(),
    even if the own querier is disabled, but let's use the more direct way

* simplified br_multicast_update_active():
  * no return value for now, don't track if the active state has changed,
    these aren't necessary (yet)
  * removed __br_multicast_update_active() variant as was used to force
    an inactive state in __br_multicast_stop(), instead using an
    netif_running(brmctx->br->dev) check in br_multicast_update_active()
  * replaced br_ip{4,6}_multicast_check_active() with simpler
    br_ip{4,6}_multicast_update_active() and
    br_ip{4,6}_multicast_querier_exists()
  * fixing build errors with CONFIG_IPV6 unset
* simplified br_multicast_toggle_enabled()
  * no return value for now
  * fixes "old used uninitialized" issue

* removed const from __br_multicast_querier_exists()'s "bool is_ipv6"
* replaced "struct ethhdr *eth" in br_multicast_{snooping,querier}_active()
  with direct ethernet protocol integer attributes
* added a few comments in br_multicast_update_active() calling functions

Re: [PATCH 0/9] net: bridge: reduce multicast checks in fast path

Posted by Jakub Kicinski 5 months, 1 week ago

On Fri, 29 Aug 2025 10:53:41 +0200 Linus Lüssing wrote:
> This patchset introduces new state variables to combine and reduce the
> number of checks we would otherwise perform on every multicast packet
> in fast/data path.
>   
> The second reason for introducing these new, internal multicast active
> variables is to later propagate a safety mechanism which was introduced
> in b00589af3b04 ("bridge: disable snooping if there is no querier") to
> switchdev/DSA, too. That is to notify switchdev/DSA if multicast
> snooping can safely be applied without potential packet loss.

Please leave the git-generated diff stat in the cover letter.
Please include tree designation in the subject, per:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html

I'll leave the real review to the experts but this series appears
to make kselftests unhappy:

[  106.423894] WARNING: CPU: 3 PID: 1121 at net/bridge/br_multicast.c:1388 __br_multicast_stop+0xa0/0xc0 [bridge]
[  106.424022] Modules linked in: sch_ingress 8021q act_mirred cls_matchall sch_red dummy bridge stp llc sch_tbf vrf veth
[  106.424144] CPU: 3 UID: 0 PID: 1121 Comm: ip Not tainted 6.17.0-rc2-virtme #1 PREEMPT(voluntary) 
[  106.424235] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  106.424301] RIP: 0010:__br_multicast_stop+0xa0/0xc0 [bridge]
[  106.424371] Code: 89 df e8 f3 fd ff ff 80 bb 2c 01 00 00 00 75 19 80 bb 0c 02 00 00 00 75 1d 48 8b 3b 5b 48 81 c7 4c 03 00 00 e9 b1 c2 0b f8 90 <0f> 0b 90 80 bb 0c 02 00 00 00 74 e3 90 0f 0b 90 48 8b 3b 5b 48 81
[  106.424544] RSP: 0018:ffffb78500283888 EFLAGS: 00010202
[  106.424586] RAX: 0000000000000000 RBX: ffff979907d81af0 RCX: 0000000000000000
[  106.424665] RDX: ffff979907d81c50 RSI: 0000000000000001 RDI: ffff979907d819c0
[  106.424753] RBP: ffff979907d819c0 R08: ffffffffb94f0180 R09: ffffb78500283b40
[  106.424836] R10: ffff97990342d250 R11: ffff979907d81000 R12: ffff979907d819c0
[  106.424913] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000001
[  106.424992] FS:  00007f7e9de78800(0000) GS:ffff979985967000(0000) knlGS:0000000000000000
[  106.425073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  106.425144] CR2: 0000000000447b60 CR3: 0000000004d3b003 CR4: 0000000000772ef0
[  106.425226] PKRU: 55555554
[  106.425250] Call Trace:
[  106.425278]  <TASK>
[  106.425305]  br_multicast_toggle_vlan_snooping+0x1a9/0x1e0 [bridge]
[  106.425383]  br_boolopt_multi_toggle+0x54/0x90 [bridge]
[  106.425443]  br_changelink+0x4be/0x510 [bridge]
[  106.425510]  ? ns_capable+0x2d/0x60
[  106.425557]  rtnl_newlink+0x73f/0xbc0
[  106.425605]  ? rtnl_setlink+0x2c0/0x2c0
[  106.425633]  rtnetlink_rcv_msg+0x358/0x400
[  106.425676]  ? update_load_avg+0x6f/0x350
[  106.425726]  ? rtnl_calcit.isra.0+0x110/0x110
[  106.425778]  netlink_rcv_skb+0x57/0x100
[  106.425819]  netlink_unicast+0x252/0x380
[  106.425858]  ? __alloc_skb+0xdb/0x190
[  106.425904]  netlink_sendmsg+0x1be/0x3e0
[  106.425945]  ____sys_sendmsg+0x132/0x260
[  106.425984]  ? copy_msghdr_from_user+0x6c/0xa0
[  106.426039]  ___sys_sendmsg+0x87/0xd0
[  106.426083]  ? __handle_mm_fault+0xa41/0xe50
[  106.426144]  __sys_sendmsg+0x71/0xd0
[  106.426184]  do_syscall_64+0xa4/0x260
[  106.426224]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[  106.426279] RIP: 0033:0x7f7e9e0451e7

Re: [PATCH 0/9] net: bridge: reduce multicast checks in fast path

Posted by Nikolay Aleksandrov 5 months, 1 week ago

On 8/29/25 18:47, Jakub Kicinski wrote:
> On Fri, 29 Aug 2025 10:53:41 +0200 Linus Lüssing wrote:
>> This patchset introduces new state variables to combine and reduce the
>> number of checks we would otherwise perform on every multicast packet
>> in fast/data path.
>>    
>> The second reason for introducing these new, internal multicast active
>> variables is to later propagate a safety mechanism which was introduced
>> in b00589af3b04 ("bridge: disable snooping if there is no querier") to
>> switchdev/DSA, too. That is to notify switchdev/DSA if multicast
>> snooping can safely be applied without potential packet loss.
> 
> Please leave the git-generated diff stat in the cover letter.
> Please include tree designation in the subject, per:
> https://www.kernel.org/doc/html/next/process/maintainer-netdev.html
> 
> I'll leave the real review to the experts but this series appears
> to make kselftests unhappy:
> 

just fyi my email wasn't working for 2 days and unfortunately I missed this set
I took a look now on patchwork, I do have comments but it's difficult to reply as
I don't have the emails and have to do it manually to each, so I'd rather wait
for v2.

a few notes for v2:
- please use READ/WRTE_ONCE() for variables that are used without locking
- please make locking symmetric, I saw that br_multicast_open() expects the lock to be already held, while
   __br_multicast_stop() takes it itself
- target net-next
- is the mcast lock really necessary, would atomic ops do for this tracking?
- can you provide the full view somewhere, how would this tracking be used? I fear
   there might still be races.
- please add more details exactly what we save on the fast-path, I know but it'd be nice
   to have it in the commit message as well, all commits just say "reduce checks, save cycles"
   but there are no details what we save

I will try to give more detailed comments in v2.

Thank you,
  Nik

Re: [PATCH 0/9] net: bridge: reduce multicast checks in fast path

Posted by Linus Lüssing 5 months, 1 week ago

Hi Nik, thanks for the suggestions and review again!

On Fri, Aug 29, 2025 at 07:23:24PM +0300, Nikolay Aleksandrov wrote:
> 
> a few notes for v2:
> - please use READ/WRTE_ONCE() for variables that are used without locking

Just to understand the issue, otherwise the data path would assume
an old inactive or active state for a prolonged time after toggling?
Or what would happen?

> - please make locking symmetric, I saw that br_multicast_open() expects the lock to be already held, while
>   __br_multicast_stop() takes it itself

I think that's what I tried before submitting. Initially wanted
to have the locking inside, but then it would deadlock on
br_multicast_toggle()->br_multicast_open()->... as this is the one
place where br_multicast_open() is called with the multicast spinlock
already held.

On the other hand, moving the spinlock out of / around
__br_multicast_stop() would lead to a sleeping-while-atomic bug
when calling timer_delete_sync() in there. And if I were to change
these to a timer_delete() I guess there is no way to do the sync
part after unlocking? There is no equivalent to something like the
flush_workqueue() / drain_workqueue() for workqueues, but for
simple timers instead, right?

I would also love to go for the first approach, taking the
spinlock inside of __br_multicast_open(). But not quite sure how
to best and safely change br_multicast_toggle() then.

> - is the mcast lock really necessary, would atomic ops do for this tracking?

Hm, not sure. We'd be checking multiple variables before toggling
the new brmctx->ip{4,6}_active. As the checked variables can
change from multiple contexts couldn't the following happen then?

Start: ip*_active = false, snooping_enabled = false,
       vlan_snooping_enabled = true, vlan{id:42}->snooping_enabled = false

Thread A)                     Thread B)
--------------------------------------------------------------------------
                              br_multicast_toggle(br, 1, ...)
			      -> loads vlan{id:42}->snooping_enabled: false
--------------------------------------------------------------------------
br_multicast_toggle_one_vlan(vlan{id:42}, true)
-> vlan->priv_flags: set per-vlan-mc-snooping to true
-> br_multicast_update_active()
   checks snooping_enabled: false
   -> keeping vlan's ip*_active at false
--------------------------------------------------------------------------
                              -> sets snooping_enabled: true
                              -> br_multicast_update_active()
			         -> checks vlan{id:42}->snooping_enabled:
				    false (from earlier load above)
                                 -> keeping vlan's ip*_active at false

Result: vlan's ip*_active is still false even though it should be
true now?

.o(Or were netlink and sysfs handlers somehow ensured to never run in
parallel?)

I'm not that versed with atomic's and explicit memory barriers,
could that prevent something like that even if multiple variables
are involved? Only used lockless atomic_t's for atomic_inc()/_dec() so far.
And otherwise used explicit locking.

> - can you provide the full view somewhere, how would this tracking be used? I fear
>   there might still be races.

My original switchdev integration plan so far was roughly still the same
as in the previous pull-request:

https://patchwork.kernel.org/project/netdevbpf/patch/20250522195952.29265-5-linus.luessing@c0d3.blue/

And using it in rtl83xx in OpenWrt like this:
https://github.com/openwrt/openwrt/pull/18780/commits/708bbc4b73bc90cd13839c613e6634aa5faeeace#diff-b5c9a9cc66e207d77fea5d063dca2cce20cf4ae3a28fc0a5d5eebd47a024d5c3

But haven't updated these yet onto this PR, will do that later.

Re: [PATCH 0/9] net: bridge: reduce multicast checks in fast path

Posted by Nikolay Aleksandrov 5 months, 1 week ago

On 9/2/25 23:16, Linus Lüssing wrote:
> Hi Nik, thanks for the suggestions and review again!
> 
> 
> On Fri, Aug 29, 2025 at 07:23:24PM +0300, Nikolay Aleksandrov wrote:
>>
>> a few notes for v2:
>> - please use READ/WRTE_ONCE() for variables that are used without locking
> 
> Just to understand the issue, otherwise the data path would assume
> an old inactive or active state for a prolonged time after toggling?
> Or what would happen?
>
> 

It's more about marking these as used without locking so KCSAN won't flag them and also
to clearly show people that intent.

>> - please make locking symmetric, I saw that br_multicast_open() expects the lock to be already held, while
>>    __br_multicast_stop() takes it itself
> 
> I think that's what I tried before submitting. Initially wanted
> to have the locking inside, but then it would deadlock on
> br_multicast_toggle()->br_multicast_open()->... as this is the one
> place where br_multicast_open() is called with the multicast spinlock
> already held.
> 
> On the other hand, moving the spinlock out of / around
> __br_multicast_stop() would lead to a sleeping-while-atomic bug
> when calling timer_delete_sync() in there. And if I were to change
> these to a timer_delete() I guess there is no way to do the sync
> part after unlocking? There is no equivalent to something like the
> flush_workqueue() / drain_workqueue() for workqueues, but for
> simple timers instead, right?
> 
> I would also love to go for the first approach, taking the
> spinlock inside of __br_multicast_open(). But not quite sure how
> to best and safely change br_multicast_toggle() then.
> 
> 

Well, this is not an easy one to solve, would require more thought and
changes to get the proper locking, but it certainly shouldn't be left
asymmetric - having one take the lock, and expecting that it's already taken
for the other, that is definitely unacceptable. Please spend more time on this
and think about how it can be changed. Right now I don't have the time to dig
in and make a proper proposal, but I'm happy to review and discuss proposed
solutions.

>> - is the mcast lock really necessary, would atomic ops do for this tracking?
> 
> Hm, not sure. We'd be checking multiple variables before toggling
> the new brmctx->ip{4,6}_active. As the checked variables can
> change from multiple contexts couldn't the following happen then?
> 
> 
> Start: ip*_active = false, snooping_enabled = false,
>         vlan_snooping_enabled = true, vlan{id:42}->snooping_enabled = false
> 
> Thread A)                     Thread B)
> --------------------------------------------------------------------------
>                                br_multicast_toggle(br, 1, ...)
> 			      -> loads vlan{id:42}->snooping_enabled: false
> --------------------------------------------------------------------------
> br_multicast_toggle_one_vlan(vlan{id:42}, true)
> -> vlan->priv_flags: set per-vlan-mc-snooping to true
> -> br_multicast_update_active()
>     checks snooping_enabled: false
>     -> keeping vlan's ip*_active at false
> --------------------------------------------------------------------------
>                                -> sets snooping_enabled: true
>                                -> br_multicast_update_active()
> 			         -> checks vlan{id:42}->snooping_enabled:
> 				    false (from earlier load above)
>                                   -> keeping vlan's ip*_active at false
> 
> Result: vlan's ip*_active is still false even though it should be
> true now?
> 
> .o(Or were netlink and sysfs handlers somehow ensured to never run in
> parallel?)
> 

They are, netlink and sysfs are supposed to take the same locks so they
cannot run in parallel changing options.

> 
> I'm not that versed with atomic's and explicit memory barriers,
> could that prevent something like that even if multiple variables
> are involved? Only used lockless atomic_t's for atomic_inc()/_dec() so far.
> And otherwise used explicit locking.
> 
> 
> 
>> - can you provide the full view somewhere, how would this tracking be used? I fear
>>    there might still be races.
> 
> My original switchdev integration plan so far was roughly still the same
> as in the previous pull-request:
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/20250522195952.29265-5-linus.luessing@c0d3.blue/
> 
> And using it in rtl83xx in OpenWrt like this:
> https://github.com/openwrt/openwrt/pull/18780/commits/708bbc4b73bc90cd13839c613e6634aa5faeeace#diff-b5c9a9cc66e207d77fea5d063dca2cce20cf4ae3a28fc0a5d5eebd47a024d5c3
> 
> But haven't updated these yet onto this PR, will do that later.

Thanks.

Re: [PATCH 0/9] net: bridge: reduce multicast checks in fast path

Posted by Linus Lüssing 5 months, 1 week ago

On Tue, Sep 02, 2025 at 10:16:04PM +0200, Linus Lüssing wrote:
> On the other hand, moving the spinlock out of / around
> __br_multicast_stop() would lead to a sleeping-while-atomic bug
> when calling timer_delete_sync() in there. And if I were to change
> these to a timer_delete() I guess there is no way to do the sync
> part after unlocking? There is no equivalent to something like the
> flush_workqueue() / drain_workqueue() for workqueues, but for
> simple timers instead, right?

I'm wondering if it would be sufficient to use timer_del() on
.ndo_stop->br_dev_stop()->br_multicast_stop().

And use timer_del_sync() only on
.ndo_uninit->br_dev_uninit()-> br_multicast_dev_del()->
br_multicast_ctx_deinit() and
br_vlan_put_master()->br_multicast_ctx_deinit().

So basically only sync / wait for timer callbacks to finish if
their context is about to be free'd, on bridge or VLAN destruction?

Re: [PATCH 0/9] net: bridge: reduce multicast checks in fast path

Posted by Nikolay Aleksandrov 5 months, 1 week ago

On 9/3/25 02:00, Linus Lüssing wrote:
> On Tue, Sep 02, 2025 at 10:16:04PM +0200, Linus Lüssing wrote:
>> On the other hand, moving the spinlock out of / around
>> __br_multicast_stop() would lead to a sleeping-while-atomic bug
>> when calling timer_delete_sync() in there. And if I were to change
>> these to a timer_delete() I guess there is no way to do the sync
>> part after unlocking? There is no equivalent to something like the
>> flush_workqueue() / drain_workqueue() for workqueues, but for
>> simple timers instead, right?
> 
> I'm wondering if it would be sufficient to use timer_del() on
> .ndo_stop->br_dev_stop()->br_multicast_stop().
> 
> And use timer_del_sync() only on
> .ndo_uninit->br_dev_uninit()-> br_multicast_dev_del()->
> br_multicast_ctx_deinit() and
> br_vlan_put_master()->br_multicast_ctx_deinit().
> 
> 
> So basically only sync / wait for timer callbacks to finish if
> their context is about to be free'd, on bridge or VLAN destruction?

You should make sure the state is sane if the callbacks can be executed after
br_multicast_stop(). There are many timers that can affect different aspects,
it might be doable but needs careful inspection and code ordering.