net: Split ndo_set_rx_mode into snapshot

[PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot

Posted by I Viswanath 2 months, 2 weeks ago

This is an implementation of the idea provided by Jakub here

https://lore.kernel.org/netdev/20250923163727.5e97abdb@kernel.org/

ndo_set_rx_mode is problematic because it cannot sleep. 

To address this, this series proposes dividing the concept of setting
rx_mode into 2 stages: snapshot and deferred I/O. To achieve this, we
reinterpret set_rx_mode and add create a new ndo write_rx_mode as
explained below:

The new set_rx_mode will be responsible for customizing the rx_mode
snapshot which will be used by write_rx_mode to update the hardware

In brief, the new flow looks something like:

prepare_rx_mode():
    ndo_set_rx_mode();
    ready_snapshot();

write_rx_mode():
    commit_and_use_snapshot();
    ndo_write_rx_mode();

write_rx_mode() is called from a work item and doesn't hold the 
netif_addr_lock lock during ndo_write_rx_mode() making it sleepable
in that section.

This model should work correctly if the following conditions hold:

1. write_rx_mode should use the rx_mode set by the most recent
    call to prepare_rx_mode before its execution.

2. If a prepare_rx_mode call happens during execution of write_rx_mode,
    write_rx_mode should be rescheduled.

3. All calls to modify rx_mode should pass through the prepare_rx_mode +
    schedule write_rx_mode execution flow. netif_rx_mode_schedule_work 
    has been implemented in core for this.

1 and 2 are guaranteed because of the properties of work queues

Drivers need to ensure 3

To use this model, a driver needs to implement the
ndo_write_rx_mode callback, change the set_rx_mode callback
appropriately and replace all calls to modify rx mode with
netif_rx_mode_schedule_work

Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
---
Questions I have:

1) Would there ever be a situation in which you will have to wait for I/O 
to complete in a call to set_rx_mode before proceeding further? That is, 
Does netif_rx_mode_schedule_work need the flush argument?

2) Does priv_ptr in netif_rx_mode_config make sense? For virtio_net, I 
can get the vi pointer by doing netdev_priv(dev) and am wondering 
if this would be a common thing

3) From a previous discussion: 
https://lore.kernel.org/netdev/417c677f-268a-4163-b07e-deea8f9b9b40@intel.com/

On Thu, 23 Oct 2025 at 05:16, Jacob Keller  wrote:
> Is there any mechanism to make this guarantee either implemented or at
> least verified by the core? If not that, what about some sort of way to
> lint driver code and make sure its correct?

I am not sure how to automate this but basically we need warnings to be 
generated when the the set_rx_mode implementations are called statically in 
code (From my understanding, usually in the open callback or the timeout function) 
but not when they are called through ops->set_rx_mode. 
Can Coccinelle do something like this?

v1:
Link: https://lore.kernel.org/netdev/20251020134857.5820-1-viswanathiyyappan@gmail.com/

v2:
- Exported set_and_schedule_rx_config as a symbol for use in modules
- Fixed incorrect cleanup for the case of rx_work alloc failing in alloc_netdev_mqs
- Removed the locked version (cp_set_rx_mode) and renamed __cp_set_rx_mode to cp_set_rx_mode
Link: https://lore.kernel.org/netdev/20251026175445.1519537-1-viswanathiyyappan@gmail.com/

v3:
- Added RFT tag
- Corrected mangled patch
Link: https://lore.kernel.org/netdev/20251028174222.1739954-1-viswanathiyyappan@gmail.com/

v4:
- Completely reworked the snapshot mechanism as per v3 comments
- Implemented the callback for virtio-net instead of 8139cp driver
- Removed RFC tag
Link: https://lore.kernel.org/netdev/20251118164333.24842-1-viswanathiyyappan@gmail.com/

v5:
- Fix broken code and titles
- Remove RFT tag

Here’s an enumeration of all the cases I have tested that should be exhaustive:

RX behaviour verification:

1) Dest is UC/MC addr X in UC/MC list:
	no mode: Recv
	allmulti: Recv
	promisc: Recv	

2) Dest is UC addr X not in UC list:
	no_mode: Drop
	allmulti: Drop
	promisc: Recv

3) Dest is MC addr X not in MC list:
	no_mode: Drop
	allmulti: Recv
	promisc: Recv

Packets injected from host using scapy on a TAP device as follows:
sendp(Ether(src=tap0_mac, dst=X) / IP() / UDP() / "test", iface="tap0")

And on the VM side, rx was checked via cat /proc/net/dev

Teardown path:

Relevant as they flush the work item. ens4 is the virtio-net interface.

virtnet_remove:
ip maddr add 01:00:5e:00:03:02 dev ens4; echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove

virtnet_freeze_down:
ip maddr add 01:00:5e:00:03:02 dev ens4; echo mem > /sys/power/state

---

I Viswanath (2):
  net: refactor set_rx_mode into snapshot and deferred I/O
  virtio-net: Implement ndo_write_rx_mode callback

 drivers/net/virtio_net.c  |  58 +++++------
 include/linux/netdevice.h | 104 ++++++++++++++++++-
 net/core/dev.c            | 208 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 330 insertions(+), 40 deletions(-)

-- 
2.34.1

Re: [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot

Posted by Jakub Kicinski 2 months, 2 weeks ago

On Thu, 20 Nov 2025 19:43:52 +0530 I Viswanath wrote:
> Teardown path:
> 
> Relevant as they flush the work item. ens4 is the virtio-net interface.
> 
> virtnet_remove:
> ip maddr add 01:00:5e:00:03:02 dev ens4; echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove
> 
> virtnet_freeze_down:
> ip maddr add 01:00:5e:00:03:02 dev ens4; echo mem > /sys/power/state

Running 

make -C tools/testing/selftests TARGETS="drivers/net/virtio_net" run_tests

[    1.967073] BUG: kernel NULL pointer dereference, address: 0000000000000018
[    1.967179] #PF: supervisor read access in kernel mode
[    1.967237] #PF: error_code(0x0000) - not-present page
[    1.967296] PGD 0 P4D 0 
[    1.967327] Oops: Oops: 0000 [#1] SMP
[    1.967372] CPU: 2 UID: 0 PID: 220 Comm: basic_features. Not tainted 6.18.0-rc5-virtme #1 PREEMPT(voluntary) 
[    1.967500] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    1.967576] RIP: 0010:__flush_work+0x33/0x3a0
[    1.967651] Code: 41 55 41 54 55 53 48 83 ec 60 44 0f b6 25 0d ab 91 01 65 48 8b 05 2d ff 8d 01 48 89 44 24 58 31 c0 45 84 e4 0f 84 35 03 00 00 <48> 83 7f 18 00 48 89 fd 0f 84 30 03 00 00 41 89 f5 e8 07 24 07 00
[    1.967861] RSP: 0018:ffffab9bc0597cf0 EFLAGS: 00010202
[    1.967920] RAX: 0000000000000000 RBX: ffff9d08c2c549c0 RCX: ffffab9bc0597d28
[    1.968010] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    1.968100] RBP: ffff9d08c1cd5000 R08: ffff9d08c1db7b70 R09: 0000000000000000
[    1.968189] R10: ffff9d08c1db7f80 R11: ffff9d08c152e480 R12: 0000000000000001
[    1.968281] R13: ffffffffbd9ffe00 R14: ffff9d08c193e140 R15: 0000000000000008
[    1.968371] FS:  00007fb66173b000(0000) GS:ffff9d0940ce9000(0000) knlGS:0000000000000000
[    1.968472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.968546] CR2: 0000000000000018 CR3: 00000000045e5006 CR4: 0000000000772ef0
[    1.968640] PKRU: 55555554
[    1.968669] Call Trace:
[    1.968700]  <TASK>
[    1.968729]  ? kernfs_should_drain_open_files+0x2e/0x40
[    1.968796]  ? __rtnl_unlock+0x37/0x60
[    1.968849]  ? netdev_run_todo+0x63/0x550
[    1.968894]  ? kernfs_name_hash+0x12/0x80
[    1.968938]  virtnet_remove+0x65/0xb0
[    1.968984]  virtio_dev_remove+0x3c/0x80
[    1.969029]  device_release_driver_internal+0x193/0x200
[    1.969090]  unbind_store+0x9d/0xb0
[    1.969136]  kernfs_fop_write_iter+0x12b/0x1c0
[    1.969197]  vfs_write+0x33a/0x470
[    1.969242]  ksys_write+0x65/0xe0
[    1.969287]  do_syscall_64+0xa4/0xfd0
[    1.969333]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[    1.969393] RIP: 0033:0x7fb66183b257
[    1.969434] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[    1.969640] RSP: 002b:00007fffca552fe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[    1.969729] RAX: ffffffffffffffda RBX: 00007fb661937780 RCX: 00007fb66183b257
[    1.969820] RDX: 0000000000000008 RSI: 0000558903e5a280 RDI: 0000000000000001
[    1.969911] RBP: 0000000000000008 R08: 0000000000000003 R09: 0000000000000077
[    1.970004] R10: 0000000000000063 R11: 0000000000000246 R12: 0000000000000008
[    1.970096] R13: 0000558903e5a280 R14: 0000000000000008 R15: 00007fb6619329c0
[    1.970189]  </TASK>
[    1.970218] Modules linked in:
[    1.970266] CR2: 0000000000000018
[    1.970311] ---[ end trace 0000000000000000 ]---
[    1.970372] RIP: 0010:__flush_work+0x33/0x3a0
[    1.970441] Code: 41 55 41 54 55 53 48 83 ec 60 44 0f b6 25 0d ab 91 01 65 48 8b 05 2d ff 8d 01 48 89 44 24 58 31 c0 45 84 e4 0f 84 35 03 00 00 <48> 83 7f 18 00 48 89 fd 0f 84 30 03 00 00 41 89 f5 e8 07 24 07 00
[    1.970656] RSP: 0018:ffffab9bc0597cf0 EFLAGS: 00010202
[    1.970717] RAX: 0000000000000000 RBX: ffff9d08c2c549c0 RCX: ffffab9bc0597d28
[    1.970806] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    1.970897] RBP: ffff9d08c1cd5000 R08: ffff9d08c1db7b70 R09: 0000000000000000
[    1.970988] R10: ffff9d08c1db7f80 R11: ffff9d08c152e480 R12: 0000000000000001
[    1.971081] R13: ffffffffbd9ffe00 R14: ffff9d08c193e140 R15: 0000000000000008
[    1.971174] FS:  00007fb66173b000(0000) GS:ffff9d0940ce9000(0000) knlGS:0000000000000000
[    1.971264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.971337] CR2: 0000000000000018 CR3: 00000000045e5006 CR4: 0000000000772ef0
[    1.971431] PKRU: 55555554
[    1.971460] note: basic_features.[220] exited with irqs disabled
-- 
pw-bot: cr

Re: [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot

Posted by I Viswanath 2 months, 2 weeks ago

On Thu, 20 Nov 2025 at 20:47, Jakub Kicinski <kuba@kernel.org> wrote:

> Running
>
> make -C tools/testing/selftests TARGETS="drivers/net/virtio_net" run_tests

This bug seems to be caused by a call to probe() followed by remove()
without ever calling
dev_open() as dev->rx_mode_ctx is allocated there. Modifying
netif_rx_mode_flush_work()
to call flush_work only when netif_running() is true, seems to fix
this specific bug.

However, I found the following deadlock while trying to reproduce that:

dev_close():
    rtnl_lock();
    cancel_work_sync(); // wait for netif_rx_mode_write_active to complete

netif_rx_mode_write_active(): // From work item

    rtnl_lock(); // Wait for the rtnl lock to be released

I can't find a good way to solve this without changing alloc logic to
be partly in
alloc_netdev_mqs since we need the work struct to be alive after
closing. Does this
look good if that's really the most reasonable solution:

struct netif_rx_mode_ctx *rx_mode_ctx;

struct netif_rx_mode_ctx {
    struct work_struct rx_mode_work;
    struct netif_rx_mode_active_ctx *active_ctx;
    int state;
}

struct netif_rx_mode_active_ctx {
        struct net_device               *dev;
        struct netif_rx_mode_config     *ready;
        struct netif_rx_mode_config     *pending;
}

rx_mode_ctx will be handled in alloc_netdev_mqs()/free_netdev() while active_ctx
will be handled in dev_open()/dev_close()

Never call flush_work/cancel_work_sync for this work in core
as that is a guaranteed deadlock because of how everything is serialized