Documentation/netlink/specs/netdev.yaml | 25 +++++ .../networking/net_cachelines/net_device.rst | 5 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +- drivers/net/ethernet/intel/idpf/idpf_txrx.h | 2 +- drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 +- .../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- include/linux/netdevice.h | 38 +++++++- include/uapi/linux/netdev.h | 3 + net/core/dev.c | 95 ++++++++++++++++--- net/core/dev.h | 90 ++++++++++++++++++ net/core/net-sysfs.c | 4 +- net/core/netdev-genl-gen.c | 14 +++ net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 57 +++++++++++ tools/include/uapi/linux/netdev.h | 3 + 15 files changed, 320 insertions(+), 25 deletions(-)
Greetings: Welcome to RFC v4. Very important and significant changes have been made since RFC v3 [1], please see the changelog below for details. A couple important call outs for this revision for reviewers: 1. idpf embeds a napi_struct in an internal data structure and includes an assertion on the size of napi_struct. The maintainers have stated that they think anyone touching napi_struct should update the assertion [2], so I've done this in patch 3. Even though the assertion has been updated, I've given the cacheline placement of napi_struct within idpf's internals no thought or consideration. Would appreciate other opinions on this; I think idpf should be fixed. It seems unreasonable to me that anyone changing the size of a struct in the core should need to think about cachelines in idpf. 2. This revision seems to work (see below for a full walk through). Is this the behavior we want? Am I missing some use case or some behavioral thing other folks need? 3. Re a previous point made by Stanislav regarding "taking over a NAPI ID" when the channel count changes: mlx5 seems to call napi_disable followed by netif_napi_del for the old queues and then calls napi_enable for the new ones. In this RFC, the NAPI ID generation is deferred to napi_enable. This means we won't end up with two of the same NAPI IDs added to the hash at the same time (I am pretty sure). Can we assume all drivers will napi_disable the old queues before napi_enable the new ones? If yes, we might not need to worry about a NAPI ID takeover function. 4. I made the decision to remove the WARN_ON_ONCE that (I think?) Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs != rxqs);) because this was triggering on every kernel boot with my mlx5 NIC. 5. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite thinking this is a bit strange. I think it's strange that we might be short some number of NAPI configs, but it seems like most people are in favor of this approach, so I've left it. I'd appreciate thoughts from reviewers on the above items, if at all possible. Once those are addressed, modulo any feedback on the code/white space wrapping I still need to do, I think this is close to an official submission. Now, on to the implementation: This implementation allocates an array of "struct napi_config" in net_device and each NAPI instance is assigned an index into the config array. Per-NAPI settings like: - NAPI ID - gro_flush_timeout - defer_hard_irqs are persisted in napi_config and restored on napi_disable/napi_enable respectively. To help illustrate how this would end up working, I've added patches for 3 drivers, of which I have access to only 1: - mlx5 which is the basis of the examples below - mlx4 which has TX only NAPIs, just to highlight that case. I have only compile tested this patch; I don't have this hardware. - bnxt which I have only compiled tested. I don't have this hardware. NOTE: I only tested this on mlx5; I have no access to the other hardware for which I provided patches. Hopefully other folks can help test :) This iteration seems to persist NAPI IDs and settings even when resizing queues, see below, so I think maybe this is getting close to where we want to land in terms of functionality. Here's how it works when I test it on my system: # start with 2 queues $ ethtool -l eth4 | grep Combined | tail -1 Combined: 2 First, output the current NAPI settings: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 344, 'ifindex': 7, 'irq': 327}] Now, set the global sysfs parameters: $ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout' $ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs' Output current NAPI settings again: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now set NAPI ID 345, via its NAPI ID to specific values: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set \ --json='{"id": 345, "defer-hard-irqs": 111, "gro-flush-timeout": 11111}' None Now output current NAPI settings again to ensure only NAPI ID 345 changed: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 11111, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now, increase gro-flush-timeout only: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set --json='{"id": 345, "gro-flush-timeout": 44444}' None Now output the current NAPI settings once more: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 44444, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now set NAPI ID 345 to have gro_flush_timeout of 0: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set --json='{"id": 345, "gro-flush-timeout": 0}' None Check that NAPI ID 345 has a value of 0: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Change the queue count, ensuring that NAPI ID 345 retains its settings: $ sudo ethtool -L eth4 combined 4 Check that the new queues have the system wide settings but that NAPI ID 345 remains unchanged: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 347, 'ifindex': 7, 'irq': 529}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now reduce the queue count below where NAPI ID 345 is indexed: $ sudo ethtool -L eth4 combined 1 Check the output: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same values: $ sudo ethtool -L eth4 combined 2 $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Create new queues to ensure the sysfs globals are used for the new NAPIs but that NAPI ID 345 is unchanged: $ sudo ethtool -L eth4 combined 8 $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [...] {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Last, but not least, let's try writing the sysfs parameters to ensure all NAPIs are rewritten: $ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout' $ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs' Check that worked: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [...] {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 344, 'ifindex': 7, 'irq': 327}] Thanks, Joe [1]: https://lore.kernel.org/netdev/20240912100738.16567-1-jdamato@fastly.com/ [2]: https://lore.kernel.org/netdev/20240925180017.82891-1-jdamato@fastly.com/T/#m56b743bd16304a626848b14f90cecb661f464b74 rfcv4: - Updated commit messages of most patches - Renamed netif_napi_add_storage to netif_napi_add_config in patch 5 - Added a NULL check in netdev_set_defer_hard_irqs and netdev_set_gro_flush_timeout for netdev->napi_config in patch 5 - Removed the WARN_ON_ONCE suggested in an earlier revision in alloc_netdev_mqs from patch 5; it triggers every time on my mlx5 machine at boot and needlessly spams the log - Added a locking adjustment suggested by Stanislav to patch 6 to protect napi_id in patch 5 - Removed napi_hash_del from netif_napi_del in patch 5. netif_napi_del calls __netif_napi_del which itself calls napi_hash_del. The original code thus resulted in two napi_hash_del calls, which is incorrect. - Removed the napi_hash_add from netif_napi_add_weight in patch 5. NAPIs are added to the hash when napi_enable is called, instead. - Moved the napi_restore_config to the top of napi_enable in patch 5. - Simplified the logic in __netif_napi_del and removed napi_hash_del. NAPIs are removed in napi_disable. - Fixed merge conflicts in patch 6 so it applies cleanly rfcv3: - Renamed napi_storage to napi_config - Reordered patches - Added defer_hard_irqs and gro_flush_timeout to napi_struct - Attempt to save and restore settings on napi_disable/napi_enable - Removed weight as a parameter to netif_napi_add_storage - Updated driver patches to no longer pass in weight rfcv2: - Almost total rewrite from v1 Joe Damato (9): net: napi: Make napi_defer_hard_irqs per-NAPI netdev-genl: Dump napi_defer_hard_irqs net: napi: Make gro_flush_timeout per-NAPI netdev-genl: Dump gro_flush_timeout net: napi: Add napi_config netdev-genl: Support setting per-NAPI config values bnxt: Add support for persistent NAPI config mlx5: Add support for persistent NAPI config mlx4: Add support for persistent NAPI config to RX CQs Documentation/netlink/specs/netdev.yaml | 25 +++++ .../networking/net_cachelines/net_device.rst | 5 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +- drivers/net/ethernet/intel/idpf/idpf_txrx.h | 2 +- drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 +- .../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- include/linux/netdevice.h | 38 +++++++- include/uapi/linux/netdev.h | 3 + net/core/dev.c | 95 ++++++++++++++++--- net/core/dev.h | 90 ++++++++++++++++++ net/core/net-sysfs.c | 4 +- net/core/netdev-genl-gen.c | 14 +++ net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 57 +++++++++++ tools/include/uapi/linux/netdev.h | 3 + 15 files changed, 320 insertions(+), 25 deletions(-) -- 2.25.1
On 10/01, Joe Damato wrote: > Greetings: > > Welcome to RFC v4. > > Very important and significant changes have been made since RFC v3 [1], > please see the changelog below for details. > > A couple important call outs for this revision for reviewers: > > 1. idpf embeds a napi_struct in an internal data structure and > includes an assertion on the size of napi_struct. The maintainers > have stated that they think anyone touching napi_struct should update > the assertion [2], so I've done this in patch 3. > > Even though the assertion has been updated, I've given the > cacheline placement of napi_struct within idpf's internals no > thought or consideration. > > Would appreciate other opinions on this; I think idpf should be > fixed. It seems unreasonable to me that anyone changing the size of > a struct in the core should need to think about cachelines in idpf. [..] > 2. This revision seems to work (see below for a full walk through). Is > this the behavior we want? Am I missing some use case or some > behavioral thing other folks need? The walk through looks good! > 3. Re a previous point made by Stanislav regarding "taking over a NAPI > ID" when the channel count changes: mlx5 seems to call napi_disable > followed by netif_napi_del for the old queues and then calls > napi_enable for the new ones. In this RFC, the NAPI ID generation > is deferred to napi_enable. This means we won't end up with two of > the same NAPI IDs added to the hash at the same time (I am pretty > sure). [..] > Can we assume all drivers will napi_disable the old queues before > napi_enable the new ones? If yes, we might not need to worry about > a NAPI ID takeover function. With the explicit driver opt-in via netif_napi_add_config, this shouldn't matter? When somebody gets to converting the drivers that don't follow this common pattern they'll have to solve the takeover part :-)
On Thu, Oct 03, 2024 at 04:29:37PM -0700, Stanislav Fomichev wrote: > On 10/01, Joe Damato wrote: [...] > > 2. This revision seems to work (see below for a full walk through). Is > > this the behavior we want? Am I missing some use case or some > > behavioral thing other folks need? > > The walk through looks good! Thanks for taking a look. > > 3. Re a previous point made by Stanislav regarding "taking over a NAPI > > ID" when the channel count changes: mlx5 seems to call napi_disable > > followed by netif_napi_del for the old queues and then calls > > napi_enable for the new ones. In this RFC, the NAPI ID generation > > is deferred to napi_enable. This means we won't end up with two of > > the same NAPI IDs added to the hash at the same time (I am pretty > > sure). > > > [..] > > > Can we assume all drivers will napi_disable the old queues before > > napi_enable the new ones? If yes, we might not need to worry about > > a NAPI ID takeover function. > > With the explicit driver opt-in via netif_napi_add_config, this > shouldn't matter? When somebody gets to converting the drivers that > don't follow this common pattern they'll have to solve the takeover > part :-) That is true; that's a good point. I'll let the RFC hang out on the list for another day or two just to give Jakub time to catch up on his mails ;) but if you all agree... this might be ready to be resent as a PATCH instead of an RFC.
On Thu, Oct 03, 2024 at 04:53:13PM -0700, Joe Damato wrote: > On Thu, Oct 03, 2024 at 04:29:37PM -0700, Stanislav Fomichev wrote: > > On 10/01, Joe Damato wrote: > > [...] > > > > 2. This revision seems to work (see below for a full walk through). Is > > > this the behavior we want? Am I missing some use case or some > > > behavioral thing other folks need? > > > > The walk through looks good! > > Thanks for taking a look. > > > > 3. Re a previous point made by Stanislav regarding "taking over a NAPI > > > ID" when the channel count changes: mlx5 seems to call napi_disable > > > followed by netif_napi_del for the old queues and then calls > > > napi_enable for the new ones. In this RFC, the NAPI ID generation > > > is deferred to napi_enable. This means we won't end up with two of > > > the same NAPI IDs added to the hash at the same time (I am pretty > > > sure). > > > > > > [..] > > > > > Can we assume all drivers will napi_disable the old queues before > > > napi_enable the new ones? If yes, we might not need to worry about > > > a NAPI ID takeover function. > > > > With the explicit driver opt-in via netif_napi_add_config, this > > shouldn't matter? When somebody gets to converting the drivers that > > don't follow this common pattern they'll have to solve the takeover > > part :-) > > That is true; that's a good point. Actually, sorry, that isn't strictly true. NAPI ID generation is moved for everything to napi_enable; they just are (or are not) persisted depending on whether the driver opted in to add_config or not. So, the change does affect all drivers. NAPI IDs won't be generated and added to the hash until napi_enable and they will be removed from the hash in napi_disable... even if you didn't opt-in to having storage. Opt-ing in to storage via netif_napi_add_config just means that your NAPI IDs (and other settings) will be persistent. Sorry about my confusion when replying earlier.
On 10/03, Joe Damato wrote: > On Thu, Oct 03, 2024 at 04:53:13PM -0700, Joe Damato wrote: > > On Thu, Oct 03, 2024 at 04:29:37PM -0700, Stanislav Fomichev wrote: > > > On 10/01, Joe Damato wrote: > > > > [...] > > > > > > 2. This revision seems to work (see below for a full walk through). Is > > > > this the behavior we want? Am I missing some use case or some > > > > behavioral thing other folks need? > > > > > > The walk through looks good! > > > > Thanks for taking a look. > > > > > > 3. Re a previous point made by Stanislav regarding "taking over a NAPI > > > > ID" when the channel count changes: mlx5 seems to call napi_disable > > > > followed by netif_napi_del for the old queues and then calls > > > > napi_enable for the new ones. In this RFC, the NAPI ID generation > > > > is deferred to napi_enable. This means we won't end up with two of > > > > the same NAPI IDs added to the hash at the same time (I am pretty > > > > sure). > > > > > > > > > [..] > > > > > > > Can we assume all drivers will napi_disable the old queues before > > > > napi_enable the new ones? If yes, we might not need to worry about > > > > a NAPI ID takeover function. > > > > > > With the explicit driver opt-in via netif_napi_add_config, this > > > shouldn't matter? When somebody gets to converting the drivers that > > > don't follow this common pattern they'll have to solve the takeover > > > part :-) > > > > That is true; that's a good point. > > Actually, sorry, that isn't strictly true. NAPI ID generation is > moved for everything to napi_enable; they just are (or are not) > persisted depending on whether the driver opted in to add_config or > not. > > So, the change does affect all drivers. NAPI IDs won't be generated > and added to the hash until napi_enable and they will be removed > from the hash in napi_disable... even if you didn't opt-in to having > storage. > > Opt-ing in to storage via netif_napi_add_config just means that your > NAPI IDs (and other settings) will be persistent. > > Sorry about my confusion when replying earlier. AFAIA, all control operations (ethtool or similar ones via netlink), should grab rtnl lock. So as long as both enable/disable happen under rtnl (and in my mind they should), I don't think there is gonna be any user-visible side-effects of your change. But I might be wrong, let's see if others can come up with some corner cases..
© 2016 - 2024 Red Hat, Inc.