[RFC net-next v4 0/9] Add support for per-NAPI config via netlink

Joe Damato posted 9 patches 1 week ago
There is a newer version of this series
Documentation/netlink/specs/netdev.yaml       | 25 +++++
.../networking/net_cachelines/net_device.rst  |  5 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  3 +-
drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  2 +-
drivers/net/ethernet/mellanox/mlx4/en_cq.c    |  3 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
include/linux/netdevice.h                     | 38 +++++++-
include/uapi/linux/netdev.h                   |  3 +
net/core/dev.c                                | 95 ++++++++++++++++---
net/core/dev.h                                | 90 ++++++++++++++++++
net/core/net-sysfs.c                          |  4 +-
net/core/netdev-genl-gen.c                    | 14 +++
net/core/netdev-genl-gen.h                    |  1 +
net/core/netdev-genl.c                        | 57 +++++++++++
tools/include/uapi/linux/netdev.h             |  3 +
15 files changed, 320 insertions(+), 25 deletions(-)
[RFC net-next v4 0/9] Add support for per-NAPI config via netlink
Posted by Joe Damato 1 week ago
Greetings:

Welcome to RFC v4.

Very important and significant changes have been made since RFC v3 [1],
please see the changelog below for details.

A couple important call outs for this revision for reviewers:

  1. idpf embeds a napi_struct in an internal data structure and
     includes an assertion on the size of napi_struct. The maintainers
     have stated that they think anyone touching napi_struct should update
     the assertion [2], so I've done this in patch 3. 

     Even though the assertion has been updated, I've given the
     cacheline placement of napi_struct within idpf's internals no
     thought or consideration.

     Would appreciate other opinions on this; I think idpf should be
     fixed. It seems unreasonable to me that anyone changing the size of
     a struct in the core should need to think about cachelines in idpf.

  2. This revision seems to work (see below for a full walk through). Is
     this the behavior we want? Am I missing some use case or some
     behavioral thing other folks need?

  3. Re a previous point made by Stanislav regarding "taking over a NAPI
     ID" when the channel count changes: mlx5 seems to call napi_disable
     followed by netif_napi_del for the old queues and then calls
     napi_enable for the new ones. In this RFC, the NAPI ID generation
     is deferred to napi_enable. This means we won't end up with two of
     the same NAPI IDs added to the hash at the same time (I am pretty
     sure).

     Can we assume all drivers will napi_disable the old queues before
     napi_enable the new ones? If yes, we might not need to worry about
     a NAPI ID takeover function.
 
  4. I made the decision to remove the WARN_ON_ONCE that (I think?)
     Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs
     != rxqs);) because this was triggering on every kernel boot with my
     mlx5 NIC.

  5. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite
     thinking this is a bit strange. I think it's strange that we might
     be short some number of NAPI configs, but it seems like most people
     are in favor of this approach, so I've left it.

I'd appreciate thoughts from reviewers on the above items, if at all
possible. Once those are addressed, modulo any feedback on the
code/white space wrapping I still need to do, I think this is close to
an official submission.

Now, on to the implementation:

This implementation allocates an array of "struct napi_config" in
net_device and each NAPI instance is assigned an index into the config
array.

Per-NAPI settings like:
  - NAPI ID
  - gro_flush_timeout
  - defer_hard_irqs

are persisted in napi_config and restored on napi_disable/napi_enable
respectively.

To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
  - mlx5 which is the basis of the examples below
  - mlx4 which has TX only NAPIs, just to highlight that case. I have
    only compile tested this patch; I don't have this hardware.
  - bnxt which I have only compiled tested. I don't have this
    hardware.

NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)

This iteration seems to persist NAPI IDs and settings even when resizing
queues, see below, so I think maybe this is getting close to where we
want to land in terms of functionality.

Here's how it works when I test it on my system:

# start with 2 queues

$ ethtool -l eth4 | grep Combined | tail -1
Combined:	2

First, output the current NAPI settings:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, set the global sysfs parameters:

$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'

Output current NAPI settings again:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345, via its NAPI ID to specific values:

$ sudo ./tools/net/ynl/cli.py \
          --spec Documentation/netlink/specs/netdev.yaml \
          --do napi-set \
          --json='{"id": 345,
                   "defer-hard-irqs": 111,
                   "gro-flush-timeout": 11111}'
None

Now output current NAPI settings again to ensure only NAPI ID 345
changed:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 11111,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, increase gro-flush-timeout only:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 44444}'
None

Now output the current NAPI settings once more:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 44444,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345 to have gro_flush_timeout of 0:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 0}'
None

Check that NAPI ID 345 has a value of 0:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Change the queue count, ensuring that NAPI ID 345 retains its settings:

$ sudo ethtool -L eth4 combined 4

Check that the new queues have the system wide settings but that NAPI ID
345 remains unchanged:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 347,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now reduce the queue count below where NAPI ID 345 is indexed:

$ sudo ethtool -L eth4 combined 1

Check the output:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same
values:

$ sudo ethtool -L eth4 combined 2

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Create new queues to ensure the sysfs globals are used for the new NAPIs
but that NAPI ID 345 is unchanged:

$ sudo ethtool -L eth4 combined 8

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[...]
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:

$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'

Check that worked:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[...]
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Thanks,
Joe

[1]: https://lore.kernel.org/netdev/20240912100738.16567-1-jdamato@fastly.com/
[2]: https://lore.kernel.org/netdev/20240925180017.82891-1-jdamato@fastly.com/T/#m56b743bd16304a626848b14f90cecb661f464b74 

rfcv4:
  - Updated commit messages of most patches
  - Renamed netif_napi_add_storage to netif_napi_add_config in patch 5
  - Added a NULL check in netdev_set_defer_hard_irqs and
    netdev_set_gro_flush_timeout for netdev->napi_config in patch 5
  - Removed the WARN_ON_ONCE suggested in an earlier revision
    in alloc_netdev_mqs from patch 5; it triggers every time on my mlx5
    machine at boot and needlessly spams the log
  - Added a locking adjustment suggested by Stanislav to patch 6 to
    protect napi_id in patch 5
  - Removed napi_hash_del from netif_napi_del in patch 5. netif_napi_del
    calls __netif_napi_del which itself calls napi_hash_del. The
    original code thus resulted in two napi_hash_del calls, which is
    incorrect.
  - Removed the napi_hash_add from netif_napi_add_weight in patch 5.
    NAPIs are added to the hash when napi_enable is called, instead.
  - Moved the napi_restore_config to the top of napi_enable in patch 5.
  - Simplified the logic in __netif_napi_del and removed napi_hash_del.
    NAPIs are removed in napi_disable.
  - Fixed merge conflicts in patch 6 so it applies cleanly

rfcv3:
  - Renamed napi_storage to napi_config
  - Reordered patches
  - Added defer_hard_irqs and gro_flush_timeout to napi_struct
  - Attempt to save and restore settings on napi_disable/napi_enable
  - Removed weight as a parameter to netif_napi_add_storage
  - Updated driver patches to no longer pass in weight

rfcv2:
  - Almost total rewrite from v1

Joe Damato (9):
  net: napi: Make napi_defer_hard_irqs per-NAPI
  netdev-genl: Dump napi_defer_hard_irqs
  net: napi: Make gro_flush_timeout per-NAPI
  netdev-genl: Dump gro_flush_timeout
  net: napi: Add napi_config
  netdev-genl: Support setting per-NAPI config values
  bnxt: Add support for persistent NAPI config
  mlx5: Add support for persistent NAPI config
  mlx4: Add support for persistent NAPI config to RX CQs

 Documentation/netlink/specs/netdev.yaml       | 25 +++++
 .../networking/net_cachelines/net_device.rst  |  5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  3 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c    |  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 include/linux/netdevice.h                     | 38 +++++++-
 include/uapi/linux/netdev.h                   |  3 +
 net/core/dev.c                                | 95 ++++++++++++++++---
 net/core/dev.h                                | 90 ++++++++++++++++++
 net/core/net-sysfs.c                          |  4 +-
 net/core/netdev-genl-gen.c                    | 14 +++
 net/core/netdev-genl-gen.h                    |  1 +
 net/core/netdev-genl.c                        | 57 +++++++++++
 tools/include/uapi/linux/netdev.h             |  3 +
 15 files changed, 320 insertions(+), 25 deletions(-)

-- 
2.25.1
Re: [RFC net-next v4 0/9] Add support for per-NAPI config via netlink
Posted by Stanislav Fomichev 5 days, 9 hours ago
On 10/01, Joe Damato wrote:
> Greetings:
> 
> Welcome to RFC v4.
> 
> Very important and significant changes have been made since RFC v3 [1],
> please see the changelog below for details.
> 
> A couple important call outs for this revision for reviewers:
> 
>   1. idpf embeds a napi_struct in an internal data structure and
>      includes an assertion on the size of napi_struct. The maintainers
>      have stated that they think anyone touching napi_struct should update
>      the assertion [2], so I've done this in patch 3. 
> 
>      Even though the assertion has been updated, I've given the
>      cacheline placement of napi_struct within idpf's internals no
>      thought or consideration.
> 
>      Would appreciate other opinions on this; I think idpf should be
>      fixed. It seems unreasonable to me that anyone changing the size of
>      a struct in the core should need to think about cachelines in idpf.

[..]

>   2. This revision seems to work (see below for a full walk through). Is
>      this the behavior we want? Am I missing some use case or some
>      behavioral thing other folks need?

The walk through looks good!


>   3. Re a previous point made by Stanislav regarding "taking over a NAPI
>      ID" when the channel count changes: mlx5 seems to call napi_disable
>      followed by netif_napi_del for the old queues and then calls
>      napi_enable for the new ones. In this RFC, the NAPI ID generation
>      is deferred to napi_enable. This means we won't end up with two of
>      the same NAPI IDs added to the hash at the same time (I am pretty
>      sure).


[..]

>      Can we assume all drivers will napi_disable the old queues before
>      napi_enable the new ones? If yes, we might not need to worry about
>      a NAPI ID takeover function.

With the explicit driver opt-in via netif_napi_add_config, this
shouldn't matter? When somebody gets to converting the drivers that
don't follow this common pattern they'll have to solve the takeover
part :-)
Re: [RFC net-next v4 0/9] Add support for per-NAPI config via netlink
Posted by Joe Damato 5 days, 9 hours ago
On Thu, Oct 03, 2024 at 04:29:37PM -0700, Stanislav Fomichev wrote:
> On 10/01, Joe Damato wrote:

[...]
 
> >   2. This revision seems to work (see below for a full walk through). Is
> >      this the behavior we want? Am I missing some use case or some
> >      behavioral thing other folks need?
> 
> The walk through looks good!

Thanks for taking a look.

> >   3. Re a previous point made by Stanislav regarding "taking over a NAPI
> >      ID" when the channel count changes: mlx5 seems to call napi_disable
> >      followed by netif_napi_del for the old queues and then calls
> >      napi_enable for the new ones. In this RFC, the NAPI ID generation
> >      is deferred to napi_enable. This means we won't end up with two of
> >      the same NAPI IDs added to the hash at the same time (I am pretty
> >      sure).
> 
> 
> [..]
> 
> >      Can we assume all drivers will napi_disable the old queues before
> >      napi_enable the new ones? If yes, we might not need to worry about
> >      a NAPI ID takeover function.
> 
> With the explicit driver opt-in via netif_napi_add_config, this
> shouldn't matter? When somebody gets to converting the drivers that
> don't follow this common pattern they'll have to solve the takeover
> part :-)

That is true; that's a good point. I'll let the RFC hang out on the
list for another day or two just to give Jakub time to catch up on
his mails ;) but if you all agree... this might be ready to be
resent as a PATCH instead of an RFC.
Re: [RFC net-next v4 0/9] Add support for per-NAPI config via netlink
Posted by Joe Damato 5 days, 6 hours ago
On Thu, Oct 03, 2024 at 04:53:13PM -0700, Joe Damato wrote:
> On Thu, Oct 03, 2024 at 04:29:37PM -0700, Stanislav Fomichev wrote:
> > On 10/01, Joe Damato wrote:
> 
> [...]
>  
> > >   2. This revision seems to work (see below for a full walk through). Is
> > >      this the behavior we want? Am I missing some use case or some
> > >      behavioral thing other folks need?
> > 
> > The walk through looks good!
> 
> Thanks for taking a look.
> 
> > >   3. Re a previous point made by Stanislav regarding "taking over a NAPI
> > >      ID" when the channel count changes: mlx5 seems to call napi_disable
> > >      followed by netif_napi_del for the old queues and then calls
> > >      napi_enable for the new ones. In this RFC, the NAPI ID generation
> > >      is deferred to napi_enable. This means we won't end up with two of
> > >      the same NAPI IDs added to the hash at the same time (I am pretty
> > >      sure).
> > 
> > 
> > [..]
> > 
> > >      Can we assume all drivers will napi_disable the old queues before
> > >      napi_enable the new ones? If yes, we might not need to worry about
> > >      a NAPI ID takeover function.
> > 
> > With the explicit driver opt-in via netif_napi_add_config, this
> > shouldn't matter? When somebody gets to converting the drivers that
> > don't follow this common pattern they'll have to solve the takeover
> > part :-)
> 
> That is true; that's a good point.

Actually, sorry, that isn't strictly true. NAPI ID generation is
moved for everything to napi_enable; they just are (or are not)
persisted depending on whether the driver opted in to add_config or
not.

So, the change does affect all drivers. NAPI IDs won't be generated
and added to the hash until napi_enable and they will be removed
from the hash in napi_disable... even if you didn't opt-in to having
storage.

Opt-ing in to storage via netif_napi_add_config just means that your
NAPI IDs (and other settings) will be persistent.

Sorry about my confusion when replying earlier.
Re: [RFC net-next v4 0/9] Add support for per-NAPI config via netlink
Posted by Stanislav Fomichev 4 days, 17 hours ago
On 10/03, Joe Damato wrote:
> On Thu, Oct 03, 2024 at 04:53:13PM -0700, Joe Damato wrote:
> > On Thu, Oct 03, 2024 at 04:29:37PM -0700, Stanislav Fomichev wrote:
> > > On 10/01, Joe Damato wrote:
> > 
> > [...]
> >  
> > > >   2. This revision seems to work (see below for a full walk through). Is
> > > >      this the behavior we want? Am I missing some use case or some
> > > >      behavioral thing other folks need?
> > > 
> > > The walk through looks good!
> > 
> > Thanks for taking a look.
> > 
> > > >   3. Re a previous point made by Stanislav regarding "taking over a NAPI
> > > >      ID" when the channel count changes: mlx5 seems to call napi_disable
> > > >      followed by netif_napi_del for the old queues and then calls
> > > >      napi_enable for the new ones. In this RFC, the NAPI ID generation
> > > >      is deferred to napi_enable. This means we won't end up with two of
> > > >      the same NAPI IDs added to the hash at the same time (I am pretty
> > > >      sure).
> > > 
> > > 
> > > [..]
> > > 
> > > >      Can we assume all drivers will napi_disable the old queues before
> > > >      napi_enable the new ones? If yes, we might not need to worry about
> > > >      a NAPI ID takeover function.
> > > 
> > > With the explicit driver opt-in via netif_napi_add_config, this
> > > shouldn't matter? When somebody gets to converting the drivers that
> > > don't follow this common pattern they'll have to solve the takeover
> > > part :-)
> > 
> > That is true; that's a good point.
> 
> Actually, sorry, that isn't strictly true. NAPI ID generation is
> moved for everything to napi_enable; they just are (or are not)
> persisted depending on whether the driver opted in to add_config or
> not.
> 
> So, the change does affect all drivers. NAPI IDs won't be generated
> and added to the hash until napi_enable and they will be removed
> from the hash in napi_disable... even if you didn't opt-in to having
> storage.
> 
> Opt-ing in to storage via netif_napi_add_config just means that your
> NAPI IDs (and other settings) will be persistent.
> 
> Sorry about my confusion when replying earlier.

AFAIA, all control operations (ethtool or similar ones via netlink),
should grab rtnl lock. So as long as both enable/disable happen
under rtnl (and in my mind they should), I don't think there is gonna
be any user-visible side-effects of your change. But I might be wrong,
let's see if others can come up with some corner cases..