.../admin-guide/kernel-parameters.txt | 26 + .../networking/devlink/devlink-defaults.rst | 115 ++++ Documentation/networking/devlink/index.rst | 1 + .../net/ethernet/mellanox/mlx5/core/main.c | 2 + include/net/devlink.h | 1 + net/devlink/core.c | 591 ++++++++++++++++++ net/devlink/devl_internal.h | 3 + net/devlink/param.c | 70 +++ 8 files changed, 809 insertions(+) create mode 100644 Documentation/networking/devlink/devlink-defaults.rst
This series adds a devlink= kernel command line parameter for applying selected devlink settings during device initialization. Following a discussion with Jakub[1], I am sending this RFC to get the conversation moving. I started from Jakub's example/request and extended it to cover requirements from production systems and configurations that customers use. One important caveat is that the parsing logic in this RFC was written with AI assistance. I am also not sure whether the resulting syntax and parser are too complex for a kernel command line interface. This is part of why I am sending it as an RFC: to understand what direction and level of complexity would be acceptable to people. The implementation is intended to support the following properties: - A system may have multiple devlink devices that usually need the same configuration. For a configuration such as eswitch mode switchdev, a user should be able to specify multiple devices to which that configuration applies. - There may be ordering dependencies between options. For example, in mlx5, flow_steering_mode should be set before moving to switchdev. With this in mind, defaults are applied per device in the left-to-right order in which they appear on the command line. The intent is to let deployments set devlink defaults before normal userspace orchestration runs, while still using devlink concepts and driver callbacks rather than adding driver-specific module parameters. A default is scoped to one or more devlink handles, for example: devlink=[pci/0000:08:00.0]:esw:mode:switchdev devlink=[pci/0000:08:00.0]:param:flow_steering_mode:smfs devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev The infrastructure stores parsed defaults per devlink handle and applies them in command-line order when a matching devlink instance is ready. Duplicate defaults for the same handle are rejected so the resulting state is deterministic. The first supported command is eswitch mode configuration. The second is generic runtime devlink parameter setting. Parameter values are parsed according to the registered devlink parameter type and are applied only in runtime configuration mode. mlx5 wires this into device initialization after the devlink instance is registered and after mlx5 devlink operations and parameters are available, so both eswitch mode defaults and runtime parameter defaults can be applied to matching devlink devices. Patch 1 adds the generic devlink boot-default parser, storage, duplicate handling and devl_apply_defaults() API. Patch 2 adds eswitch mode defaults and documents the devlink= syntax. Patch 3 adds runtime devlink parameter defaults, including string to devlink parameter value conversion. Patch 4 calls devl_apply_defaults() from mlx5 device initialization. [1] https://lore.kernel.org/all/20260502184153.4fd8d06f@kernel.org/ Mark Bloch (4): devlink: Add infrastructure for boot-time defaults devlink: Add eswitch mode boot default devlink: Add runtime parameter boot defaults net/mlx5: Apply devlink boot defaults during init .../admin-guide/kernel-parameters.txt | 26 + .../networking/devlink/devlink-defaults.rst | 115 ++++ Documentation/networking/devlink/index.rst | 1 + .../net/ethernet/mellanox/mlx5/core/main.c | 2 + include/net/devlink.h | 1 + net/devlink/core.c | 591 ++++++++++++++++++ net/devlink/devl_internal.h | 3 + net/devlink/param.c | 70 +++ 8 files changed, 809 insertions(+) create mode 100644 Documentation/networking/devlink/devlink-defaults.rst base-commit: 7e0cccae6b45b12eaf71fc3ab8eb133bb50b28ad -- 2.34.1
Wed, May 06, 2026 at 02:37:35PM +0200, mbloch@nvidia.com wrote: >This series adds a devlink= kernel command line parameter for applying >selected devlink settings during device initialization. > >Following a discussion with Jakub[1], I am sending this RFC to get the >conversation moving. I started from Jakub's example/request and extended >it to cover requirements from production systems and configurations that >customers use. > >One important caveat is that the parsing logic in this RFC was written >with AI assistance. I am also not sure whether the resulting syntax and >parser are too complex for a kernel command line interface. This is part >of why I am sending it as an RFC: to understand what direction and level >of complexity would be acceptable to people. > >The implementation is intended to support the following properties: > >- A system may have multiple devlink devices that usually need the same > configuration. For a configuration such as eswitch mode switchdev, a > user should be able to specify multiple devices to which that > configuration applies. > >- There may be ordering dependencies between options. For example, in > mlx5, flow_steering_mode should be set before moving to switchdev. > With this in mind, defaults are applied per device in the left-to-right > order in which they appear on the command line. > >The intent is to let deployments set devlink defaults before normal >userspace orchestration runs, while still using devlink concepts and "defaults before normal userspace orchestrarion". I read it as config before config, which eventually could be skipped. >driver callbacks rather than adding driver-specific module parameters. >A default is scoped to one or more devlink handles, for example: > > devlink=[pci/0000:08:00.0]:esw:mode:switchdev > devlink=[pci/0000:08:00.0]:param:flow_steering_mode:smfs > devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev I don't like this. What you do, you are basically introducing user configuration tool on kernel cmdline. The same you would achieve with a proper userspace tool/daemon. I did try to come up with it and push it here: https://github.com/systemd/systemd/pull/37393 That didn't get merged for unknown reason, but the idea is sound. You provide configuration files for devlink object and systemd-devlinkd will apply when they appear. Wouldn't this help your case? [..]
On 06/05/2026 18:22, Jiri Pirko wrote: > Wed, May 06, 2026 at 02:37:35PM +0200, mbloch@nvidia.com wrote: >> This series adds a devlink= kernel command line parameter for applying >> selected devlink settings during device initialization. >> >> Following a discussion with Jakub[1], I am sending this RFC to get the >> conversation moving. I started from Jakub's example/request and extended >> it to cover requirements from production systems and configurations that >> customers use. >> >> One important caveat is that the parsing logic in this RFC was written >> with AI assistance. I am also not sure whether the resulting syntax and >> parser are too complex for a kernel command line interface. This is part >> of why I am sending it as an RFC: to understand what direction and level >> of complexity would be acceptable to people. >> >> The implementation is intended to support the following properties: >> >> - A system may have multiple devlink devices that usually need the same >> configuration. For a configuration such as eswitch mode switchdev, a >> user should be able to specify multiple devices to which that >> configuration applies. >> >> - There may be ordering dependencies between options. For example, in >> mlx5, flow_steering_mode should be set before moving to switchdev. >> With this in mind, defaults are applied per device in the left-to-right >> order in which they appear on the command line. >> >> The intent is to let deployments set devlink defaults before normal >> userspace orchestration runs, while still using devlink concepts and > > "defaults before normal userspace orchestrarion". I read it as config > before config, which eventually could be skipped. > > >> driver callbacks rather than adding driver-specific module parameters. >> A default is scoped to one or more devlink handles, for example: >> >> devlink=[pci/0000:08:00.0]:esw:mode:switchdev >> devlink=[pci/0000:08:00.0]:param:flow_steering_mode:smfs >> devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev > > I don't like this. What you do, you are basically introducing user > configuration tool on kernel cmdline. > > The same you would achieve with a proper userspace tool/daemon. > I did try to come up with it and push it here: > https://github.com/systemd/systemd/pull/37393 > That didn't get merged for unknown reason, but the idea is sound. You > provide configuration files for devlink object and systemd-devlinkd > will apply when they appear. Wouldn't this help your case? I agree that systemd-devlinkd is the right shape for normal devlink configuration, and it could probably replace the udev/devlink plumbing we use today. The case I am trying to cover is earlier than that. On BlueField/ECPF/DPU systems, the host PF driver cannot always finish probing independently of the ECPF side. When the ECPF is the eswitch manager, the host PF is kept in initializing state until the ECPF eswitch side is set up and mlx5 enables the external host PF HCA. That happens as part of moving the ECPF to switchdev. Today userspace observes the ECPF instance and then switches the mode through devlink, usually via udev or similar plumbing. That still leaves a window where the ECPF has probed, userspace has not applied the mode yet, and the host PF is waiting. With many ECPFs this becomes visible in host PF probe/boot time. A daemon reacting to the devlink object appearing can make the userspace side cleaner, but it still runs after the device has appeared and after userspace scheduling/uevent handling. Long term, for these DPU deployments, we would like mlx5 to initialize directly in switchdev. I am hesitant to make that unconditional because it changes existing behavior and there is no early opt-out before probe. The cmdline parameter was meant as an explicit opt-in middle step: ask the driver to apply the same devlink operation during init, before this path depends on userspace. We previously tried to address this with an mlx5 module parameter. By design, that was too coarse: it applied to all mlx5 devices handled by the module. That makes it usable only for narrow DPU-only configurations. The devlink-handle based cmdline syntax was intended to keep the opt-in scoped to the specific devices that need this early switchdev transition. Mark > > [..]
Wed, May 06, 2026 at 07:35:10PM +0200, mbloch@nvidia.com wrote: > > >On 06/05/2026 18:22, Jiri Pirko wrote: >> Wed, May 06, 2026 at 02:37:35PM +0200, mbloch@nvidia.com wrote: >>> This series adds a devlink= kernel command line parameter for applying >>> selected devlink settings during device initialization. >>> >>> Following a discussion with Jakub[1], I am sending this RFC to get the >>> conversation moving. I started from Jakub's example/request and extended >>> it to cover requirements from production systems and configurations that >>> customers use. >>> >>> One important caveat is that the parsing logic in this RFC was written >>> with AI assistance. I am also not sure whether the resulting syntax and >>> parser are too complex for a kernel command line interface. This is part >>> of why I am sending it as an RFC: to understand what direction and level >>> of complexity would be acceptable to people. >>> >>> The implementation is intended to support the following properties: >>> >>> - A system may have multiple devlink devices that usually need the same >>> configuration. For a configuration such as eswitch mode switchdev, a >>> user should be able to specify multiple devices to which that >>> configuration applies. >>> >>> - There may be ordering dependencies between options. For example, in >>> mlx5, flow_steering_mode should be set before moving to switchdev. >>> With this in mind, defaults are applied per device in the left-to-right >>> order in which they appear on the command line. >>> >>> The intent is to let deployments set devlink defaults before normal >>> userspace orchestration runs, while still using devlink concepts and >> >> "defaults before normal userspace orchestrarion". I read it as config >> before config, which eventually could be skipped. >> >> >>> driver callbacks rather than adding driver-specific module parameters. >>> A default is scoped to one or more devlink handles, for example: >>> >>> devlink=[pci/0000:08:00.0]:esw:mode:switchdev >>> devlink=[pci/0000:08:00.0]:param:flow_steering_mode:smfs >>> devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev >> >> I don't like this. What you do, you are basically introducing user >> configuration tool on kernel cmdline. >> >> The same you would achieve with a proper userspace tool/daemon. >> I did try to come up with it and push it here: >> https://github.com/systemd/systemd/pull/37393 >> That didn't get merged for unknown reason, but the idea is sound. You >> provide configuration files for devlink object and systemd-devlinkd >> will apply when they appear. Wouldn't this help your case? > >I agree that systemd-devlinkd is the right shape for normal >devlink configuration, and it could probably replace the udev/devlink >plumbing we use today. > >The case I am trying to cover is earlier than that. > >On BlueField/ECPF/DPU systems, the host PF driver cannot always finish >probing independently of the ECPF side. When the ECPF is the eswitch >manager, the host PF is kept in initializing state until the ECPF eswitch >side is set up and mlx5 enables the external host PF HCA. That happens as >part of moving the ECPF to switchdev. > >Today userspace observes the ECPF instance and then switches the >mode through devlink, usually via udev or similar plumbing. That still >leaves a window where the ECPF has probed, userspace has not applied the >mode yet, and the host PF is waiting. With many ECPFs this becomes visible >in host PF probe/boot time. A daemon reacting to the devlink object >appearing can make the userspace side cleaner, but it still runs after the >device has appeared and after userspace scheduling/uevent handling. > >Long term, for these DPU deployments, we would like mlx5 to initialize >directly in switchdev. I am hesitant to make that unconditional because it >changes existing behavior and there is no early opt-out before probe. The >cmdline parameter was meant as an explicit opt-in middle step: ask the >driver to apply the same devlink operation during init, before this path >depends on userspace. > >We previously tried to address this with an mlx5 module parameter. By >design, that was too coarse: it applied to all mlx5 devices handled by the >module. That makes it usable only for narrow DPU-only configurations. The >devlink-handle based cmdline syntax was intended to keep the opt-in scoped >to the specific devices that need this early switchdev transition. The switchdev mode was introduced at roughly the time CX4 was out. What stopped us from making it default for CX4+ ? Introducing this horrible plumbing only bacause we were not able to change the default sounds so absurd. Can we write the default mode as a bit in ASIC NV memory perhaps? Simple devlink cmode permanent param to write it, the driver can read this bit during init to decide the init flow path?
On 07/05/2026 14:03, Jiri Pirko wrote:
> Wed, May 06, 2026 at 07:35:10PM +0200, mbloch@nvidia.com wrote:
>>
>>
>> On 06/05/2026 18:22, Jiri Pirko wrote:
>>> Wed, May 06, 2026 at 02:37:35PM +0200, mbloch@nvidia.com wrote:
>>>> This series adds a devlink= kernel command line parameter for applying
>>>> selected devlink settings during device initialization.
>>>>
>>>> Following a discussion with Jakub[1], I am sending this RFC to get the
>>>> conversation moving. I started from Jakub's example/request and extended
>>>> it to cover requirements from production systems and configurations that
>>>> customers use.
>>>>
>>>> One important caveat is that the parsing logic in this RFC was written
>>>> with AI assistance. I am also not sure whether the resulting syntax and
>>>> parser are too complex for a kernel command line interface. This is part
>>>> of why I am sending it as an RFC: to understand what direction and level
>>>> of complexity would be acceptable to people.
>>>>
>>>> The implementation is intended to support the following properties:
>>>>
>>>> - A system may have multiple devlink devices that usually need the same
>>>> configuration. For a configuration such as eswitch mode switchdev, a
>>>> user should be able to specify multiple devices to which that
>>>> configuration applies.
>>>>
>>>> - There may be ordering dependencies between options. For example, in
>>>> mlx5, flow_steering_mode should be set before moving to switchdev.
>>>> With this in mind, defaults are applied per device in the left-to-right
>>>> order in which they appear on the command line.
>>>>
>>>> The intent is to let deployments set devlink defaults before normal
>>>> userspace orchestration runs, while still using devlink concepts and
>>>
>>> "defaults before normal userspace orchestrarion". I read it as config
>>> before config, which eventually could be skipped.
>>>
>>>
>>>> driver callbacks rather than adding driver-specific module parameters.
>>>> A default is scoped to one or more devlink handles, for example:
>>>>
>>>> devlink=[pci/0000:08:00.0]:esw:mode:switchdev
>>>> devlink=[pci/0000:08:00.0]:param:flow_steering_mode:smfs
>>>> devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev
>>>
>>> I don't like this. What you do, you are basically introducing user
>>> configuration tool on kernel cmdline.
>>>
>>> The same you would achieve with a proper userspace tool/daemon.
>>> I did try to come up with it and push it here:
>>> https://github.com/systemd/systemd/pull/37393
>>> That didn't get merged for unknown reason, but the idea is sound. You
>>> provide configuration files for devlink object and systemd-devlinkd
>>> will apply when they appear. Wouldn't this help your case?
>>
>> I agree that systemd-devlinkd is the right shape for normal
>> devlink configuration, and it could probably replace the udev/devlink
>> plumbing we use today.
>>
>> The case I am trying to cover is earlier than that.
>>
>> On BlueField/ECPF/DPU systems, the host PF driver cannot always finish
>> probing independently of the ECPF side. When the ECPF is the eswitch
>> manager, the host PF is kept in initializing state until the ECPF eswitch
>> side is set up and mlx5 enables the external host PF HCA. That happens as
>> part of moving the ECPF to switchdev.
>>
>> Today userspace observes the ECPF instance and then switches the
>> mode through devlink, usually via udev or similar plumbing. That still
>> leaves a window where the ECPF has probed, userspace has not applied the
>> mode yet, and the host PF is waiting. With many ECPFs this becomes visible
>> in host PF probe/boot time. A daemon reacting to the devlink object
>> appearing can make the userspace side cleaner, but it still runs after the
>> device has appeared and after userspace scheduling/uevent handling.
>>
>> Long term, for these DPU deployments, we would like mlx5 to initialize
>> directly in switchdev. I am hesitant to make that unconditional because it
>> changes existing behavior and there is no early opt-out before probe. The
>> cmdline parameter was meant as an explicit opt-in middle step: ask the
>> driver to apply the same devlink operation during init, before this path
>> depends on userspace.
>>
>> We previously tried to address this with an mlx5 module parameter. By
>> design, that was too coarse: it applied to all mlx5 devices handled by the
>> module. That makes it usable only for narrow DPU-only configurations. The
>> devlink-handle based cmdline syntax was intended to keep the opt-in scoped
>> to the specific devices that need this early switchdev transition.
>
> The switchdev mode was introduced at roughly the time CX4 was out. What
> stopped us from making it default for CX4+ ?
>
> Introducing this horrible plumbing only bacause we were not able to
> change the default sounds so absurd.
>
> Can we write the default mode as a bit in ASIC NV memory perhaps? Simple
> devlink cmode permanent param to write it, the driver can read this bit
> during init to decide the init flow path?
I don't think switchdev by default should mean CX4+ in general. If we get
there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
the host PF probe path can depend on the ECPF reaching switchdev. Changing the
default for regular host NIC deployments feels like a much larger compatibility
change.
For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
layer. This is boot/deployment policy, not a persistent hardware property, and
storing it in NV memory would make the state persist across kernels/hosts in a
surprising way.
I do agree the RFC probably went too far by making a generic devlink cmdline
configuration language. Maybe the smaller thing to discuss is only:
devlink=[pci/...]:esw:mode:{legacy|switchdev|switchdev_inactive}
No runtime params, no ordering between different operations, just early eswitch
mode for explicitly selected handles.
@Jakub, I know you wanted something more generic/extensible, but maybe the
generic case belongs in the devlinkd/systemd direction Jiri pointed at, while
the kernel cmdline handles only this early boot eswitch mode case.
Mark
Fri, May 08, 2026 at 07:59:04PM +0200, mbloch@nvidia.com wrote:
>
>
>On 07/05/2026 14:03, Jiri Pirko wrote:
>> Wed, May 06, 2026 at 07:35:10PM +0200, mbloch@nvidia.com wrote:
>>>
>>>
>>> On 06/05/2026 18:22, Jiri Pirko wrote:
>>>> Wed, May 06, 2026 at 02:37:35PM +0200, mbloch@nvidia.com wrote:
>>>>> This series adds a devlink= kernel command line parameter for applying
>>>>> selected devlink settings during device initialization.
>>>>>
>>>>> Following a discussion with Jakub[1], I am sending this RFC to get the
>>>>> conversation moving. I started from Jakub's example/request and extended
>>>>> it to cover requirements from production systems and configurations that
>>>>> customers use.
>>>>>
>>>>> One important caveat is that the parsing logic in this RFC was written
>>>>> with AI assistance. I am also not sure whether the resulting syntax and
>>>>> parser are too complex for a kernel command line interface. This is part
>>>>> of why I am sending it as an RFC: to understand what direction and level
>>>>> of complexity would be acceptable to people.
>>>>>
>>>>> The implementation is intended to support the following properties:
>>>>>
>>>>> - A system may have multiple devlink devices that usually need the same
>>>>> configuration. For a configuration such as eswitch mode switchdev, a
>>>>> user should be able to specify multiple devices to which that
>>>>> configuration applies.
>>>>>
>>>>> - There may be ordering dependencies between options. For example, in
>>>>> mlx5, flow_steering_mode should be set before moving to switchdev.
>>>>> With this in mind, defaults are applied per device in the left-to-right
>>>>> order in which they appear on the command line.
>>>>>
>>>>> The intent is to let deployments set devlink defaults before normal
>>>>> userspace orchestration runs, while still using devlink concepts and
>>>>
>>>> "defaults before normal userspace orchestrarion". I read it as config
>>>> before config, which eventually could be skipped.
>>>>
>>>>
>>>>> driver callbacks rather than adding driver-specific module parameters.
>>>>> A default is scoped to one or more devlink handles, for example:
>>>>>
>>>>> devlink=[pci/0000:08:00.0]:esw:mode:switchdev
>>>>> devlink=[pci/0000:08:00.0]:param:flow_steering_mode:smfs
>>>>> devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev
>>>>
>>>> I don't like this. What you do, you are basically introducing user
>>>> configuration tool on kernel cmdline.
>>>>
>>>> The same you would achieve with a proper userspace tool/daemon.
>>>> I did try to come up with it and push it here:
>>>> https://github.com/systemd/systemd/pull/37393
>>>> That didn't get merged for unknown reason, but the idea is sound. You
>>>> provide configuration files for devlink object and systemd-devlinkd
>>>> will apply when they appear. Wouldn't this help your case?
>>>
>>> I agree that systemd-devlinkd is the right shape for normal
>>> devlink configuration, and it could probably replace the udev/devlink
>>> plumbing we use today.
>>>
>>> The case I am trying to cover is earlier than that.
>>>
>>> On BlueField/ECPF/DPU systems, the host PF driver cannot always finish
>>> probing independently of the ECPF side. When the ECPF is the eswitch
>>> manager, the host PF is kept in initializing state until the ECPF eswitch
>>> side is set up and mlx5 enables the external host PF HCA. That happens as
>>> part of moving the ECPF to switchdev.
>>>
>>> Today userspace observes the ECPF instance and then switches the
>>> mode through devlink, usually via udev or similar plumbing. That still
>>> leaves a window where the ECPF has probed, userspace has not applied the
>>> mode yet, and the host PF is waiting. With many ECPFs this becomes visible
>>> in host PF probe/boot time. A daemon reacting to the devlink object
>>> appearing can make the userspace side cleaner, but it still runs after the
>>> device has appeared and after userspace scheduling/uevent handling.
>>>
>>> Long term, for these DPU deployments, we would like mlx5 to initialize
>>> directly in switchdev. I am hesitant to make that unconditional because it
>>> changes existing behavior and there is no early opt-out before probe. The
>>> cmdline parameter was meant as an explicit opt-in middle step: ask the
>>> driver to apply the same devlink operation during init, before this path
>>> depends on userspace.
>>>
>>> We previously tried to address this with an mlx5 module parameter. By
>>> design, that was too coarse: it applied to all mlx5 devices handled by the
>>> module. That makes it usable only for narrow DPU-only configurations. The
>>> devlink-handle based cmdline syntax was intended to keep the opt-in scoped
>>> to the specific devices that need this early switchdev transition.
>>
>> The switchdev mode was introduced at roughly the time CX4 was out. What
>> stopped us from making it default for CX4+ ?
>>
>> Introducing this horrible plumbing only bacause we were not able to
>> change the default sounds so absurd.
>>
>> Can we write the default mode as a bit in ASIC NV memory perhaps? Simple
>> devlink cmode permanent param to write it, the driver can read this bit
>> during init to decide the init flow path?
>
>I don't think switchdev by default should mean CX4+ in general. If we get
>there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>default for regular host NIC deployments feels like a much larger compatibility
>change.
We can't travel throught time, but if from CX5 onwards the default would
be switchdev, nobody would feel broken in terms of compatibility. That
is my point. Having "legacy" as default is simply wrong for never NIC
generations. That is why it is called "legacy" and it should have been
rotten through and out since CX4 times.
>
>For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>layer. This is boot/deployment policy, not a persistent hardware property, and
>storing it in NV memory would make the state persist across kernels/hosts in a
>surprising way.
Well, as any other nv config, it persists across kernels/hosts. Think
about it as "unbreak-my-not-legacy-device" bit.
>
>I do agree the RFC probably went too far by making a generic devlink cmdline
>configuration language. Maybe the smaller thing to discuss is only:
>
>devlink=[pci/...]:esw:mode:{legacy|switchdev|switchdev_inactive}
>
>No runtime params, no ordering between different operations, just early eswitch
>mode for explicitly selected handles.
FWIW, I'm still against this.
>
>@Jakub, I know you wanted something more generic/extensible, but maybe the
>generic case belongs in the devlinkd/systemd direction Jiri pointed at, while
>the kernel cmdline handles only this early boot eswitch mode case.
>
>Mark
On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
> >I don't think switchdev by default should mean CX4+ in general. If we get
> >there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
> >the host PF probe path can depend on the ECPF reaching switchdev. Changing the
> >default for regular host NIC deployments feels like a much larger compatibility
> >change.
>
> We can't travel throught time, but if from CX5 onwards the default would
> be switchdev, nobody would feel broken in terms of compatibility. That
> is my point. Having "legacy" as default is simply wrong for never NIC
> generations. That is why it is called "legacy" and it should have been
> rotten through and out since CX4 times.
legacy vs switchdev only describes the eswitch configuration.
As a non-SR-IOV user I really don't want to see the extra representors
hanging around my systems, confusing all daemons. IIRC mlx5 had some
limitations around the uplink representor. Maybe that's the disconnect.
But for a real, fully featured switchdev eswitches having the
PHY and PF representors on boot, always, will not make sense.
IOW it's not a question of the generation of the card but of
the deployment type / use case.
> >For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
> >layer. This is boot/deployment policy, not a persistent hardware property, and
> >storing it in NV memory would make the state persist across kernels/hosts in a
> >surprising way.
>
> Well, as any other nv config, it persists across kernels/hosts. Think
> about it as "unbreak-my-not-legacy-device" bit.
For most devices the switchdev mode does not change anything
substantial about the device. It's purely a kernel / driver config.
It changes what objects and default rules kernel / driver installs.
So I don't get why it would make sense to flash into the device
nvmem a Linux SW stack specific config.
> >I do agree the RFC probably went too far by making a generic devlink cmdline
> >configuration language. Maybe the smaller thing to discuss is only:
> >
> >devlink=[pci/...]:esw:mode:{legacy|switchdev|switchdev_inactive}
> >
> >No runtime params, no ordering between different operations, just early eswitch
> >mode for explicitly selected handles.
Yes, let's cut this down, AI went too far :) As I said we should just
document how we envision the format growing but for now we can literally
implement just the global "esw mode".
One note on the formatting, you mentioned:
devlink=[pci/0000:08:00.0,pci/0000:08:00.1]:param:flow_steering_mode:hmfs,[pci/0000:08:00.0,pci/0000:08:00.1]:esw:mode:switchdev
TBH when I used the square brackets I meant that the field is optional.
But I guess you used them like we use them for IPv6 addresses to
separate the : signs, makes sense.
Since AFAIU we only care about global default should we focus on
supporting:
devlink=*:esw:mode:switchdev
meaning all devices default to switchdev?
> FWIW, I'm still against this.
One more option, tho IDK if it actually is good enough for Mark,
would be to let user space "pause" devlink probing. So that the
systemd daemon can configure the device before it populates all
the netdev stuff. Basically make the devices probe into the reload_down
state, until user space configures them. IDK how much of the time
is spent building and tearing down the legacy mode on mlx5 but
the thinking is that we'd at least stave that wasted effort.
Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>> >I don't think switchdev by default should mean CX4+ in general. If we get
>> >there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>> >the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>> >default for regular host NIC deployments feels like a much larger compatibility
>> >change.
>>
>> We can't travel throught time, but if from CX5 onwards the default would
>> be switchdev, nobody would feel broken in terms of compatibility. That
>> is my point. Having "legacy" as default is simply wrong for never NIC
>> generations. That is why it is called "legacy" and it should have been
>> rotten through and out since CX4 times.
>
>legacy vs switchdev only describes the eswitch configuration.
>As a non-SR-IOV user I really don't want to see the extra representors
>hanging around my systems, confusing all daemons. IIRC mlx5 had some
>limitations around the uplink representor. Maybe that's the disconnect.
>But for a real, fully featured switchdev eswitches having the
>PHY and PF representors on boot, always, will not make sense.
As "a non-SR-IOV user", what extra representors you talk about? When you
have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
Everyhing is the same:
c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
c-220-136-220-218:~$ devlink dev
pci/0000:08:00.0: index 0
nested_devlink:
auxiliary/mlx5_core.eth.0
devlink_index/1: index 1
nested_devlink:
pci/0000:08:00.0
pci/0000:08:00.1
auxiliary/mlx5_core.eth.0: index 2
pci/0000:08:00.1: index 3
nested_devlink:
auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1: index 4
c-220-136-220-218:~$ devlink port
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
c-220-136-220-218:~$ ip link
...
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
altname enp8s0f0np0
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
altname enp8s0f1np1
>
>IOW it's not a question of the generation of the card but of
>the deployment type / use case.
I don't think so, not in the case of mlx5. The difference is only when
you work with sr-iov, you either use legacy way (ip vf) or the new one.
Same usecase.
>
>> >For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>> >layer. This is boot/deployment policy, not a persistent hardware property, and
>> >storing it in NV memory would make the state persist across kernels/hosts in a
>> >surprising way.
>>
>> Well, as any other nv config, it persists across kernels/hosts. Think
>> about it as "unbreak-my-not-legacy-device" bit.
>
>For most devices the switchdev mode does not change anything
>substantial about the device. It's purely a kernel / driver config.
>It changes what objects and default rules kernel / driver installs.
>So I don't get why it would make sense to flash into the device
>nvmem a Linux SW stack specific config.
I look at it from the perspective that from some CX generation,
switchdev mode should be default. So that is a device-based decision.
I believe as such it can optionally be permanenty configured (nv config)
on older device. Why not?
[...]
On Sat, 9 May 2026 09:01:23 +0200 Jiri Pirko wrote: > Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote: > >On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote: > >legacy vs switchdev only describes the eswitch configuration. > >As a non-SR-IOV user I really don't want to see the extra representors > >hanging around my systems, confusing all daemons. IIRC mlx5 had some > >limitations around the uplink representor. Maybe that's the disconnect. > >But for a real, fully featured switchdev eswitches having the > >PHY and PF representors on boot, always, will not make sense. > > As "a non-SR-IOV user", what extra representors you talk about? When you > have pfs only, you don't have anything extra. Just 1 netdev per-pf, one > devlink port per-pf. What's extra about it? When you don't have VFs/SFs. > Everyhing is the same: Some devices have separate uplink ports and PF representors. As I said, what you're proposing isn't going to work for all drivers. > >> Well, as any other nv config, it persists across kernels/hosts. > >> Think about it as "unbreak-my-not-legacy-device" bit. > > > >For most devices the switchdev mode does not change anything > >substantial about the device. It's purely a kernel / driver config. > >It changes what objects and default rules kernel / driver installs. > >So I don't get why it would make sense to flash into the device > >nvmem a Linux SW stack specific config. > > I look at it from the perspective that from some CX generation, > switchdev mode should be default. So that is a device-based decision. > I believe as such it can optionally be permanenty configured (nv config) > on older device. Why not? Feels a bit arbitrary and won't cover all cases. The question should be why you are nacking a more reasonable solution. Keeping Linux config in Linux params.
Sun, May 10, 2026 at 06:37:32PM +0200, kuba@kernel.org wrote:
>On Sat, 9 May 2026 09:01:23 +0200 Jiri Pirko wrote:
>> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>> >On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>> >legacy vs switchdev only describes the eswitch configuration.
>> >As a non-SR-IOV user I really don't want to see the extra representors
>> >hanging around my systems, confusing all daemons. IIRC mlx5 had some
>> >limitations around the uplink representor. Maybe that's the disconnect.
>> >But for a real, fully featured switchdev eswitches having the
>> >PHY and PF representors on boot, always, will not make sense.
>>
>> As "a non-SR-IOV user", what extra representors you talk about? When you
>> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
>> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
>> Everyhing is the same:
>
>Some devices have separate uplink ports and PF representors.
>As I said, what you're proposing isn't going to work for all drivers.
Well, the point is, mlx5 appears to the the one needing this, not other
drivers. What I'm trying to point at, mlx5 should not need this.
It makes things compicated, adding a ugly knob for no good reason.
Legacy/switchdev mode, in both, the non-sriov/eswitch user should not
see different behaviour. The mode is an eswitch attribute.
devlink dev eswitch set - sets devlink device eswitch attributes
mode { legacy | switchdev }
Set eswitch mode
legacy - Legacy SRIOV
switchdev - SRIOV switchdev offloads
Briefly looking over other drivers, looks like ice, bnxt, octeon, sfc,
there is no new entity created in case of switching to switchdev mode.
The only driver that creates separate pf entities seems to be nfp,
but the mode seems to be determined by the app being run (loaded
firmware).
Am I missing something?
>
>> >> Well, as any other nv config, it persists across kernels/hosts.
>> >> Think about it as "unbreak-my-not-legacy-device" bit.
>> >
>> >For most devices the switchdev mode does not change anything
>> >substantial about the device. It's purely a kernel / driver config.
>> >It changes what objects and default rules kernel / driver installs.
>> >So I don't get why it would make sense to flash into the device
>> >nvmem a Linux SW stack specific config.
>>
>> I look at it from the perspective that from some CX generation,
>> switchdev mode should be default. So that is a device-based decision.
>> I believe as such it can optionally be permanenty configured (nv config)
>> on older device. Why not?
>
>Feels a bit arbitrary and won't cover all cases. The question should be
What cases it does not cover? I don't follow.
>why you are nacking a more reasonable solution. Keeping Linux config in
>Linux params.
What's reasonable about adding basically a module option (kernel cmdline
is pretty much the same) for no reason?
On Mon, 11 May 2026 10:42:56 +0200 Jiri Pirko wrote:
> Sun, May 10, 2026 at 06:37:32PM +0200, kuba@kernel.org wrote:
> >On Sat, 9 May 2026 09:01:23 +0200 Jiri Pirko wrote:
> >> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
> >> As "a non-SR-IOV user", what extra representors you talk about? When you
> >> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
> >> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
> >> Everyhing is the same:
> >
> >Some devices have separate uplink ports and PF representors.
> >As I said, what you're proposing isn't going to work for all drivers.
>
> Well, the point is, mlx5 appears to the the one needing this, not other
> drivers. What I'm trying to point at, mlx5 should not need this.
> It makes things compicated, adding a ugly knob for no good reason.
> Legacy/switchdev mode, in both, the non-sriov/eswitch user should not
> see different behaviour. The mode is an eswitch attribute.
>
> devlink dev eswitch set - sets devlink device eswitch attributes
> mode { legacy | switchdev }
> Set eswitch mode
>
> legacy - Legacy SRIOV
>
> switchdev - SRIOV switchdev offloads
>
>
> Briefly looking over other drivers, looks like ice, bnxt, octeon, sfc,
> there is no new entity created in case of switching to switchdev mode.
> The only driver that creates separate pf entities seems to be nfp,
> but the mode seems to be determined by the app being run (loaded
> firmware).
>
> Am I missing something?
Hm. Okay, I wasn't aware that mlx5 was the only driver that did
heavy-duty reinit for switching modes.
> >> I look at it from the perspective that from some CX generation,
> >> switchdev mode should be default. So that is a device-based decision.
> >> I believe as such it can optionally be permanenty configured (nv config)
> >> on older device. Why not?
> >
> >Feels a bit arbitrary and won't cover all cases. The question should be
>
> What cases it does not cover? I don't follow.
Other FW and HW versions. People are still using EOL devices (CX4/CX5),
IIUC the nvmem config path would require FW upgrade.
> >why you are nacking a more reasonable solution. Keeping Linux config in
> >Linux params.
>
> What's reasonable about adding basically a module option (kernel cmdline
> is pretty much the same) for no reason?
The initial patch as posted added this to a mlx5-specific module param.
If we need a module param IMO generic one is much better.
Doesn't matter if other drivers take no time to reinit into switchdev
mode, having to switch mlx5 with a module param and all the rest in
runtime is not the best user experience?
Tue, May 12, 2026 at 01:41:32AM +0200, kuba@kernel.org wrote:
>On Mon, 11 May 2026 10:42:56 +0200 Jiri Pirko wrote:
>> Sun, May 10, 2026 at 06:37:32PM +0200, kuba@kernel.org wrote:
>> >On Sat, 9 May 2026 09:01:23 +0200 Jiri Pirko wrote:
>> >> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>> >> As "a non-SR-IOV user", what extra representors you talk about? When you
>> >> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
>> >> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
>> >> Everyhing is the same:
>> >
>> >Some devices have separate uplink ports and PF representors.
>> >As I said, what you're proposing isn't going to work for all drivers.
>>
>> Well, the point is, mlx5 appears to the the one needing this, not other
>> drivers. What I'm trying to point at, mlx5 should not need this.
>> It makes things compicated, adding a ugly knob for no good reason.
>> Legacy/switchdev mode, in both, the non-sriov/eswitch user should not
>> see different behaviour. The mode is an eswitch attribute.
>>
>> devlink dev eswitch set - sets devlink device eswitch attributes
>> mode { legacy | switchdev }
>> Set eswitch mode
>>
>> legacy - Legacy SRIOV
>>
>> switchdev - SRIOV switchdev offloads
>>
>>
>> Briefly looking over other drivers, looks like ice, bnxt, octeon, sfc,
>> there is no new entity created in case of switching to switchdev mode.
>> The only driver that creates separate pf entities seems to be nfp,
>> but the mode seems to be determined by the app being run (loaded
>> firmware).
>>
>> Am I missing something?
>
>Hm. Okay, I wasn't aware that mlx5 was the only driver that did
>heavy-duty reinit for switching modes.
>
>> >> I look at it from the perspective that from some CX generation,
>> >> switchdev mode should be default. So that is a device-based decision.
>> >> I believe as such it can optionally be permanenty configured (nv config)
>> >> on older device. Why not?
>> >
>> >Feels a bit arbitrary and won't cover all cases. The question should be
>>
>> What cases it does not cover? I don't follow.
>
>Other FW and HW versions. People are still using EOL devices (CX4/CX5),
>IIUC the nvmem config path would require FW upgrade.
If user wants to have a new feature (a bit odd to call this feature,
but ok), he is obliged to upgrade FW. What's wrong about it?
But, even without nvconfig knob, what's stopping us from fixing the
behaviour (/bugs) and just make "switchdev" mode default in net-next for
all in mlx5 driver? Again, perhaps I'm missing something.
>
>> >why you are nacking a more reasonable solution. Keeping Linux config in
>> >Linux params.
>>
>> What's reasonable about adding basically a module option (kernel cmdline
>> is pretty much the same) for no reason?
>
>The initial patch as posted added this to a mlx5-specific module param.
>If we need a module param IMO generic one is much better.
>Doesn't matter if other drivers take no time to reinit into switchdev
>mode, having to switch mlx5 with a module param and all the rest in
>runtime is not the best user experience?
I still believe we don't need either, not module param, not cmdline
devlink option. We just need to fix bugs and have proper defaults. The
rest is shortcut.
On 09/05/2026 10:01, Jiri Pirko wrote:
> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>> On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>>>> I don't think switchdev by default should mean CX4+ in general. If we get
>>>> there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>>>> the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>>>> default for regular host NIC deployments feels like a much larger compatibility
>>>> change.
>>>
>>> We can't travel throught time, but if from CX5 onwards the default would
>>> be switchdev, nobody would feel broken in terms of compatibility. That
>>> is my point. Having "legacy" as default is simply wrong for never NIC
>>> generations. That is why it is called "legacy" and it should have been
>>> rotten through and out since CX4 times.
>>
>> legacy vs switchdev only describes the eswitch configuration.
>> As a non-SR-IOV user I really don't want to see the extra representors
>> hanging around my systems, confusing all daemons. IIRC mlx5 had some
>> limitations around the uplink representor. Maybe that's the disconnect.
>> But for a real, fully featured switchdev eswitches having the
>> PHY and PF representors on boot, always, will not make sense.
>
> As "a non-SR-IOV user", what extra representors you talk about? When you
> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
> Everyhing is the same:
The netdev list looking similar is a bit misleading. What matters here is
not only how many netdevs show up, but what that netdev actually is.
In legacy mode, a PF only user can just use the PF netdev as a regular NIC
and use ROCE on it directly.
In switchdev mode, even if there are no VFs or SFs yet, the PF is moved into
the switchdev model and the visible netdev is the uplink representor. That is
not the same thing from a user point of view. The uplink representor is not a
ROCE capable endpoint. So a user who used to boot the machine and use ROCE on
the PF now has to create a VF or SF, use that as the roce endpoint, and also
set up the switchdev forwarding path with tc, bridge or OVS so traffic from
that function actually reaches the wire.
That is why I don't think this is only a card generation question. It changes
the deployment model. It may be the right default for BlueField/ECPF style
systems, where the host is expected to sit behind a switchdev control plane,
but it is not a safe default for every regular host NIC setup.
>
> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
> pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
> c-220-136-220-218:~$ devlink dev
> pci/0000:08:00.0: index 0
> nested_devlink:
> auxiliary/mlx5_core.eth.0
> devlink_index/1: index 1
> nested_devlink:
> pci/0000:08:00.0
> pci/0000:08:00.1
> auxiliary/mlx5_core.eth.0: index 2
> pci/0000:08:00.1: index 3
> nested_devlink:
> auxiliary/mlx5_core.eth.1
> auxiliary/mlx5_core.eth.1: index 4
> c-220-136-220-218:~$ devlink port
> auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
> auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
> c-220-136-220-218:~$ ip link
> ...
> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
> altname enp8s0f0np0
> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
> altname enp8s0f1np1
>
>
>>
>> IOW it's not a question of the generation of the card but of
>> the deployment type / use case.
>
> I don't think so, not in the case of mlx5. The difference is only when
> you work with sr-iov, you either use legacy way (ip vf) or the new one.
> Same usecase.
>
>
>>
>>>> For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>>>> layer. This is boot/deployment policy, not a persistent hardware property, and
>>>> storing it in NV memory would make the state persist across kernels/hosts in a
>>>> surprising way.
>>>
>>> Well, as any other nv config, it persists across kernels/hosts. Think
>>> about it as "unbreak-my-not-legacy-device" bit.
>>
>> For most devices the switchdev mode does not change anything
>> substantial about the device. It's purely a kernel / driver config.
>> It changes what objects and default rules kernel / driver installs.
>> So I don't get why it would make sense to flash into the device
>> nvmem a Linux SW stack specific config.
>
> I look at it from the perspective that from some CX generation,
> switchdev mode should be default. So that is a device-based decision.
> I believe as such it can optionally be permanenty configured (nv config)
> on older device. Why not?
This is a deployment policy decision, not a permanent property of the card.
The same adapter can be used in a regular host/RDMA setup or in a
switchdev/offload setup. If we store this in NVM, that Linux switchdev policy
follows the device across hosts, kernels and use cases, and can surprise the
next deployment that just expects a normal NIC.
I'll send another RFC v2 with support limited to:
devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy }
and let's see where we land with that.
I still think a small kernel command line knob is the cleanest way to get to
"switchdev by default" without making the interface too broad. For more
complex boot-time configuration, I agree that a devlinkd or similar userspace
path is probably the better direction.
The "pause probing until userspace configures devlink" idea feels less clear
to me. It is not quite the simple boot policy knob, and not quite the full
userspace policy manager either. It would add a new probe state and require
early userspace orchestration before the device is fully materialized. At
least for now, I would prefer either the small cmdline option for the simple
global/default case, or a proper devlinkd-like solution for more complex
policy. Between those, I still prefer the cmdline option for this specific
early eswitch mode default.
Mark
>
> [...]
Sun, May 10, 2026 at 02:31:35PM +0200, mbloch@nvidia.com wrote:
>
>
>On 09/05/2026 10:01, Jiri Pirko wrote:
>> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@kernel.org wrote:
>>> On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>>>>> I don't think switchdev by default should mean CX4+ in general. If we get
>>>>> there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>>>>> the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>>>>> default for regular host NIC deployments feels like a much larger compatibility
>>>>> change.
>>>>
>>>> We can't travel throught time, but if from CX5 onwards the default would
>>>> be switchdev, nobody would feel broken in terms of compatibility. That
>>>> is my point. Having "legacy" as default is simply wrong for never NIC
>>>> generations. That is why it is called "legacy" and it should have been
>>>> rotten through and out since CX4 times.
>>>
>>> legacy vs switchdev only describes the eswitch configuration.
>>> As a non-SR-IOV user I really don't want to see the extra representors
>>> hanging around my systems, confusing all daemons. IIRC mlx5 had some
>>> limitations around the uplink representor. Maybe that's the disconnect.
>>> But for a real, fully featured switchdev eswitches having the
>>> PHY and PF representors on boot, always, will not make sense.
>>
>> As "a non-SR-IOV user", what extra representors you talk about? When you
>> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
>> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
>> Everyhing is the same:
>
>The netdev list looking similar is a bit misleading. What matters here is
>not only how many netdevs show up, but what that netdev actually is.
>
>In legacy mode, a PF only user can just use the PF netdev as a regular NIC
>and use ROCE on it directly.
I don't see why we have this limitation. Sounds more like a bug to me.
The netdev is still the same, capable of the same things no matter in
which mode you have it. RoCE should work on it in both modes.
>
>In switchdev mode, even if there are no VFs or SFs yet, the PF is moved into
>the switchdev model and the visible netdev is the uplink representor. That is
>not the same thing from a user point of view. The uplink representor is not a
>ROCE capable endpoint. So a user who used to boot the machine and use ROCE on
>the PF now has to create a VF or SF, use that as the roce endpoint, and also
>set up the switchdev forwarding path with tc, bridge or OVS so traffic from
>that function actually reaches the wire.
>
>That is why I don't think this is only a card generation question. It changes
>the deployment model. It may be the right default for BlueField/ECPF style
>systems, where the host is expected to sit behind a switchdev control plane,
>but it is not a safe default for every regular host NIC setup.
Yeah, the point is, not to change deployment model. The legacy/switchdev
should only change behaviour for sriov/eswitch usecase. The rest
(PF/uplink netdev and related objects) should stay the same.
>
>>
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
>> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
>> pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
>> c-220-136-220-218:~$ devlink dev
>> pci/0000:08:00.0: index 0
>> nested_devlink:
>> auxiliary/mlx5_core.eth.0
>> devlink_index/1: index 1
>> nested_devlink:
>> pci/0000:08:00.0
>> pci/0000:08:00.1
>> auxiliary/mlx5_core.eth.0: index 2
>> pci/0000:08:00.1: index 3
>> nested_devlink:
>> auxiliary/mlx5_core.eth.1
>> auxiliary/mlx5_core.eth.1: index 4
>> c-220-136-220-218:~$ devlink port
>> auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
>> auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
>> c-220-136-220-218:~$ ip link
>> ...
>> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>> link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
>> altname enp8s0f0np0
>> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>> link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
>> altname enp8s0f1np1
>>
>>
>>>
>>> IOW it's not a question of the generation of the card but of
>>> the deployment type / use case.
>>
>> I don't think so, not in the case of mlx5. The difference is only when
>> you work with sr-iov, you either use legacy way (ip vf) or the new one.
>> Same usecase.
>>
>>
>>>
>>>>> For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>>>>> layer. This is boot/deployment policy, not a persistent hardware property, and
>>>>> storing it in NV memory would make the state persist across kernels/hosts in a
>>>>> surprising way.
>>>>
>>>> Well, as any other nv config, it persists across kernels/hosts. Think
>>>> about it as "unbreak-my-not-legacy-device" bit.
>>>
>>> For most devices the switchdev mode does not change anything
>>> substantial about the device. It's purely a kernel / driver config.
>>> It changes what objects and default rules kernel / driver installs.
>>> So I don't get why it would make sense to flash into the device
>>> nvmem a Linux SW stack specific config.
>>
>> I look at it from the perspective that from some CX generation,
>> switchdev mode should be default. So that is a device-based decision.
>> I believe as such it can optionally be permanenty configured (nv config)
>> on older device. Why not?
>
>This is a deployment policy decision, not a permanent property of the card.
>The same adapter can be used in a regular host/RDMA setup or in a
>switchdev/offload setup. If we store this in NVM, that Linux switchdev policy
>follows the device across hosts, kernels and use cases, and can surprise the
>next deployment that just expects a normal NIC.
Yeah, from my perspective, there should be not surprise/behaviour_change
for non-sriov/eswitch user. Then switchdev can be default and everyone
is happy. Why to complicate things?
>
>I'll send another RFC v2 with support limited to:
>devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy }
>and let's see where we land with that.
>
>I still think a small kernel command line knob is the cleanest way to get to
>"switchdev by default" without making the interface too broad. For more
>complex boot-time configuration, I agree that a devlinkd or similar userspace
>path is probably the better direction.
>
>The "pause probing until userspace configures devlink" idea feels less clear
>to me. It is not quite the simple boot policy knob, and not quite the full
>userspace policy manager either. It would add a new probe state and require
>early userspace orchestration before the device is fully materialized. At
>least for now, I would prefer either the small cmdline option for the simple
>global/default case, or a proper devlinkd-like solution for more complex
>policy. Between those, I still prefer the cmdline option for this specific
>early eswitch mode default.
>
>Mark
>
>>
>> [...]
>
© 2016 - 2026 Red Hat, Inc.