This patchset introduces new state variables to combine and reduce the number of checks we would otherwise perform on every multicast packet in fast/data path. The second reason for introducing these new, internal multicast active variables is to later propagate a safety mechanism which was introduced in b00589af3b04 ("bridge: disable snooping if there is no querier") to switchdev/DSA, too. That is to notify switchdev/DSA if multicast snooping can safely be applied without potential packet loss. Regards, Linus --- # Changelog Changelog to / follow-up of: [PATCH net-next 0/5] net: bridge: propagate safe mcast snooping to switchdev + DSA -> https://lkml.org/lkml/2025/5/22/1413 * removed the switchdev/DSA changes for now * splitting "[PATCH net-next 1/5] net: bridge: mcast: explicitly track active state" into: * net: bridge: mcast: track active state, IGMP/MLD querier appearance * net: bridge: mcast: track active state, foreign IGMP/MLD querier disappearance * net: bridge: mcast: track active state, IPv6 address availability * net: bridge: mcast: track active state, own MLD querier disappearance * net: bridge: mcast: use combined active state in fast/data path * net: bridge: mcast: track active state, bridge up/down * rebased to current net-next/main: * from_timer() -> timer_container_of() * net: bridge: mcast: export ip{4,6}_active state to netlink: * changing NLA_U8 to NLA_REJECT to make it read-only * moved br_multicast_update_active() call from br_ip{4,6}_multicast_query_expired() (own querier timer callback) to br_ip{4,6}_multicast_querier_expired() (other querier timer callback) * even though both should have worked as br_multicast_querier_expired() would call br_multicast_start_querier()->...->br_multicast_query_expired(), even if the own querier is disabled, but let's use the more direct way * simplified br_multicast_update_active(): * no return value for now, don't track if the active state has changed, these aren't necessary (yet) * removed __br_multicast_update_active() variant as was used to force an inactive state in __br_multicast_stop(), instead using an netif_running(brmctx->br->dev) check in br_multicast_update_active() * replaced br_ip{4,6}_multicast_check_active() with simpler br_ip{4,6}_multicast_update_active() and br_ip{4,6}_multicast_querier_exists() * fixing build errors with CONFIG_IPV6 unset * simplified br_multicast_toggle_enabled() * no return value for now * fixes "old used uninitialized" issue * removed const from __br_multicast_querier_exists()'s "bool is_ipv6" * replaced "struct ethhdr *eth" in br_multicast_{snooping,querier}_active() with direct ethernet protocol integer attributes * added a few comments in br_multicast_update_active() calling functions
On Fri, 29 Aug 2025 10:53:41 +0200 Linus Lüssing wrote: > This patchset introduces new state variables to combine and reduce the > number of checks we would otherwise perform on every multicast packet > in fast/data path. > > The second reason for introducing these new, internal multicast active > variables is to later propagate a safety mechanism which was introduced > in b00589af3b04 ("bridge: disable snooping if there is no querier") to > switchdev/DSA, too. That is to notify switchdev/DSA if multicast > snooping can safely be applied without potential packet loss. Please leave the git-generated diff stat in the cover letter. Please include tree designation in the subject, per: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html I'll leave the real review to the experts but this series appears to make kselftests unhappy: [ 106.423894] WARNING: CPU: 3 PID: 1121 at net/bridge/br_multicast.c:1388 __br_multicast_stop+0xa0/0xc0 [bridge] [ 106.424022] Modules linked in: sch_ingress 8021q act_mirred cls_matchall sch_red dummy bridge stp llc sch_tbf vrf veth [ 106.424144] CPU: 3 UID: 0 PID: 1121 Comm: ip Not tainted 6.17.0-rc2-virtme #1 PREEMPT(voluntary) [ 106.424235] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 106.424301] RIP: 0010:__br_multicast_stop+0xa0/0xc0 [bridge] [ 106.424371] Code: 89 df e8 f3 fd ff ff 80 bb 2c 01 00 00 00 75 19 80 bb 0c 02 00 00 00 75 1d 48 8b 3b 5b 48 81 c7 4c 03 00 00 e9 b1 c2 0b f8 90 <0f> 0b 90 80 bb 0c 02 00 00 00 74 e3 90 0f 0b 90 48 8b 3b 5b 48 81 [ 106.424544] RSP: 0018:ffffb78500283888 EFLAGS: 00010202 [ 106.424586] RAX: 0000000000000000 RBX: ffff979907d81af0 RCX: 0000000000000000 [ 106.424665] RDX: ffff979907d81c50 RSI: 0000000000000001 RDI: ffff979907d819c0 [ 106.424753] RBP: ffff979907d819c0 R08: ffffffffb94f0180 R09: ffffb78500283b40 [ 106.424836] R10: ffff97990342d250 R11: ffff979907d81000 R12: ffff979907d819c0 [ 106.424913] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000001 [ 106.424992] FS: 00007f7e9de78800(0000) GS:ffff979985967000(0000) knlGS:0000000000000000 [ 106.425073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.425144] CR2: 0000000000447b60 CR3: 0000000004d3b003 CR4: 0000000000772ef0 [ 106.425226] PKRU: 55555554 [ 106.425250] Call Trace: [ 106.425278] <TASK> [ 106.425305] br_multicast_toggle_vlan_snooping+0x1a9/0x1e0 [bridge] [ 106.425383] br_boolopt_multi_toggle+0x54/0x90 [bridge] [ 106.425443] br_changelink+0x4be/0x510 [bridge] [ 106.425510] ? ns_capable+0x2d/0x60 [ 106.425557] rtnl_newlink+0x73f/0xbc0 [ 106.425605] ? rtnl_setlink+0x2c0/0x2c0 [ 106.425633] rtnetlink_rcv_msg+0x358/0x400 [ 106.425676] ? update_load_avg+0x6f/0x350 [ 106.425726] ? rtnl_calcit.isra.0+0x110/0x110 [ 106.425778] netlink_rcv_skb+0x57/0x100 [ 106.425819] netlink_unicast+0x252/0x380 [ 106.425858] ? __alloc_skb+0xdb/0x190 [ 106.425904] netlink_sendmsg+0x1be/0x3e0 [ 106.425945] ____sys_sendmsg+0x132/0x260 [ 106.425984] ? copy_msghdr_from_user+0x6c/0xa0 [ 106.426039] ___sys_sendmsg+0x87/0xd0 [ 106.426083] ? __handle_mm_fault+0xa41/0xe50 [ 106.426144] __sys_sendmsg+0x71/0xd0 [ 106.426184] do_syscall_64+0xa4/0x260 [ 106.426224] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 106.426279] RIP: 0033:0x7f7e9e0451e7
On 8/29/25 18:47, Jakub Kicinski wrote: > On Fri, 29 Aug 2025 10:53:41 +0200 Linus Lüssing wrote: >> This patchset introduces new state variables to combine and reduce the >> number of checks we would otherwise perform on every multicast packet >> in fast/data path. >> >> The second reason for introducing these new, internal multicast active >> variables is to later propagate a safety mechanism which was introduced >> in b00589af3b04 ("bridge: disable snooping if there is no querier") to >> switchdev/DSA, too. That is to notify switchdev/DSA if multicast >> snooping can safely be applied without potential packet loss. > > Please leave the git-generated diff stat in the cover letter. > Please include tree designation in the subject, per: > https://www.kernel.org/doc/html/next/process/maintainer-netdev.html > > I'll leave the real review to the experts but this series appears > to make kselftests unhappy: > just fyi my email wasn't working for 2 days and unfortunately I missed this set I took a look now on patchwork, I do have comments but it's difficult to reply as I don't have the emails and have to do it manually to each, so I'd rather wait for v2. a few notes for v2: - please use READ/WRTE_ONCE() for variables that are used without locking - please make locking symmetric, I saw that br_multicast_open() expects the lock to be already held, while __br_multicast_stop() takes it itself - target net-next - is the mcast lock really necessary, would atomic ops do for this tracking? - can you provide the full view somewhere, how would this tracking be used? I fear there might still be races. - please add more details exactly what we save on the fast-path, I know but it'd be nice to have it in the commit message as well, all commits just say "reduce checks, save cycles" but there are no details what we save I will try to give more detailed comments in v2. Thank you, Nik
Hi Nik, thanks for the suggestions and review again! On Fri, Aug 29, 2025 at 07:23:24PM +0300, Nikolay Aleksandrov wrote: > > a few notes for v2: > - please use READ/WRTE_ONCE() for variables that are used without locking Just to understand the issue, otherwise the data path would assume an old inactive or active state for a prolonged time after toggling? Or what would happen? > - please make locking symmetric, I saw that br_multicast_open() expects the lock to be already held, while > __br_multicast_stop() takes it itself I think that's what I tried before submitting. Initially wanted to have the locking inside, but then it would deadlock on br_multicast_toggle()->br_multicast_open()->... as this is the one place where br_multicast_open() is called with the multicast spinlock already held. On the other hand, moving the spinlock out of / around __br_multicast_stop() would lead to a sleeping-while-atomic bug when calling timer_delete_sync() in there. And if I were to change these to a timer_delete() I guess there is no way to do the sync part after unlocking? There is no equivalent to something like the flush_workqueue() / drain_workqueue() for workqueues, but for simple timers instead, right? I would also love to go for the first approach, taking the spinlock inside of __br_multicast_open(). But not quite sure how to best and safely change br_multicast_toggle() then. > - is the mcast lock really necessary, would atomic ops do for this tracking? Hm, not sure. We'd be checking multiple variables before toggling the new brmctx->ip{4,6}_active. As the checked variables can change from multiple contexts couldn't the following happen then? Start: ip*_active = false, snooping_enabled = false, vlan_snooping_enabled = true, vlan{id:42}->snooping_enabled = false Thread A) Thread B) -------------------------------------------------------------------------- br_multicast_toggle(br, 1, ...) -> loads vlan{id:42}->snooping_enabled: false -------------------------------------------------------------------------- br_multicast_toggle_one_vlan(vlan{id:42}, true) -> vlan->priv_flags: set per-vlan-mc-snooping to true -> br_multicast_update_active() checks snooping_enabled: false -> keeping vlan's ip*_active at false -------------------------------------------------------------------------- -> sets snooping_enabled: true -> br_multicast_update_active() -> checks vlan{id:42}->snooping_enabled: false (from earlier load above) -> keeping vlan's ip*_active at false Result: vlan's ip*_active is still false even though it should be true now? .o(Or were netlink and sysfs handlers somehow ensured to never run in parallel?) I'm not that versed with atomic's and explicit memory barriers, could that prevent something like that even if multiple variables are involved? Only used lockless atomic_t's for atomic_inc()/_dec() so far. And otherwise used explicit locking. > - can you provide the full view somewhere, how would this tracking be used? I fear > there might still be races. My original switchdev integration plan so far was roughly still the same as in the previous pull-request: https://patchwork.kernel.org/project/netdevbpf/patch/20250522195952.29265-5-linus.luessing@c0d3.blue/ And using it in rtl83xx in OpenWrt like this: https://github.com/openwrt/openwrt/pull/18780/commits/708bbc4b73bc90cd13839c613e6634aa5faeeace#diff-b5c9a9cc66e207d77fea5d063dca2cce20cf4ae3a28fc0a5d5eebd47a024d5c3 But haven't updated these yet onto this PR, will do that later.
On 9/2/25 23:16, Linus Lüssing wrote: > Hi Nik, thanks for the suggestions and review again! > > > On Fri, Aug 29, 2025 at 07:23:24PM +0300, Nikolay Aleksandrov wrote: >> >> a few notes for v2: >> - please use READ/WRTE_ONCE() for variables that are used without locking > > Just to understand the issue, otherwise the data path would assume > an old inactive or active state for a prolonged time after toggling? > Or what would happen? > > It's more about marking these as used without locking so KCSAN won't flag them and also to clearly show people that intent. >> - please make locking symmetric, I saw that br_multicast_open() expects the lock to be already held, while >> __br_multicast_stop() takes it itself > > I think that's what I tried before submitting. Initially wanted > to have the locking inside, but then it would deadlock on > br_multicast_toggle()->br_multicast_open()->... as this is the one > place where br_multicast_open() is called with the multicast spinlock > already held. > > On the other hand, moving the spinlock out of / around > __br_multicast_stop() would lead to a sleeping-while-atomic bug > when calling timer_delete_sync() in there. And if I were to change > these to a timer_delete() I guess there is no way to do the sync > part after unlocking? There is no equivalent to something like the > flush_workqueue() / drain_workqueue() for workqueues, but for > simple timers instead, right? > > I would also love to go for the first approach, taking the > spinlock inside of __br_multicast_open(). But not quite sure how > to best and safely change br_multicast_toggle() then. > > Well, this is not an easy one to solve, would require more thought and changes to get the proper locking, but it certainly shouldn't be left asymmetric - having one take the lock, and expecting that it's already taken for the other, that is definitely unacceptable. Please spend more time on this and think about how it can be changed. Right now I don't have the time to dig in and make a proper proposal, but I'm happy to review and discuss proposed solutions. >> - is the mcast lock really necessary, would atomic ops do for this tracking? > > Hm, not sure. We'd be checking multiple variables before toggling > the new brmctx->ip{4,6}_active. As the checked variables can > change from multiple contexts couldn't the following happen then? > > > Start: ip*_active = false, snooping_enabled = false, > vlan_snooping_enabled = true, vlan{id:42}->snooping_enabled = false > > Thread A) Thread B) > -------------------------------------------------------------------------- > br_multicast_toggle(br, 1, ...) > -> loads vlan{id:42}->snooping_enabled: false > -------------------------------------------------------------------------- > br_multicast_toggle_one_vlan(vlan{id:42}, true) > -> vlan->priv_flags: set per-vlan-mc-snooping to true > -> br_multicast_update_active() > checks snooping_enabled: false > -> keeping vlan's ip*_active at false > -------------------------------------------------------------------------- > -> sets snooping_enabled: true > -> br_multicast_update_active() > -> checks vlan{id:42}->snooping_enabled: > false (from earlier load above) > -> keeping vlan's ip*_active at false > > Result: vlan's ip*_active is still false even though it should be > true now? > > .o(Or were netlink and sysfs handlers somehow ensured to never run in > parallel?) > They are, netlink and sysfs are supposed to take the same locks so they cannot run in parallel changing options. > > I'm not that versed with atomic's and explicit memory barriers, > could that prevent something like that even if multiple variables > are involved? Only used lockless atomic_t's for atomic_inc()/_dec() so far. > And otherwise used explicit locking. > > > >> - can you provide the full view somewhere, how would this tracking be used? I fear >> there might still be races. > > My original switchdev integration plan so far was roughly still the same > as in the previous pull-request: > > https://patchwork.kernel.org/project/netdevbpf/patch/20250522195952.29265-5-linus.luessing@c0d3.blue/ > > And using it in rtl83xx in OpenWrt like this: > https://github.com/openwrt/openwrt/pull/18780/commits/708bbc4b73bc90cd13839c613e6634aa5faeeace#diff-b5c9a9cc66e207d77fea5d063dca2cce20cf4ae3a28fc0a5d5eebd47a024d5c3 > > But haven't updated these yet onto this PR, will do that later. Thanks.
On Tue, Sep 02, 2025 at 10:16:04PM +0200, Linus Lüssing wrote: > On the other hand, moving the spinlock out of / around > __br_multicast_stop() would lead to a sleeping-while-atomic bug > when calling timer_delete_sync() in there. And if I were to change > these to a timer_delete() I guess there is no way to do the sync > part after unlocking? There is no equivalent to something like the > flush_workqueue() / drain_workqueue() for workqueues, but for > simple timers instead, right? I'm wondering if it would be sufficient to use timer_del() on .ndo_stop->br_dev_stop()->br_multicast_stop(). And use timer_del_sync() only on .ndo_uninit->br_dev_uninit()-> br_multicast_dev_del()-> br_multicast_ctx_deinit() and br_vlan_put_master()->br_multicast_ctx_deinit(). So basically only sync / wait for timer callbacks to finish if their context is about to be free'd, on bridge or VLAN destruction?
On 9/3/25 02:00, Linus Lüssing wrote: > On Tue, Sep 02, 2025 at 10:16:04PM +0200, Linus Lüssing wrote: >> On the other hand, moving the spinlock out of / around >> __br_multicast_stop() would lead to a sleeping-while-atomic bug >> when calling timer_delete_sync() in there. And if I were to change >> these to a timer_delete() I guess there is no way to do the sync >> part after unlocking? There is no equivalent to something like the >> flush_workqueue() / drain_workqueue() for workqueues, but for >> simple timers instead, right? > > I'm wondering if it would be sufficient to use timer_del() on > .ndo_stop->br_dev_stop()->br_multicast_stop(). > > And use timer_del_sync() only on > .ndo_uninit->br_dev_uninit()-> br_multicast_dev_del()-> > br_multicast_ctx_deinit() and > br_vlan_put_master()->br_multicast_ctx_deinit(). > > > So basically only sync / wait for timer callbacks to finish if > their context is about to be free'd, on bridge or VLAN destruction? You should make sure the state is sane if the callbacks can be executed after br_multicast_stop(). There are many timers that can affect different aspects, it might be doable but needs careful inspection and code ordering.
© 2016 - 2025 Red Hat, Inc.