.../net/ethernet/mellanox/mlx5/core/Makefile | 2 +- .../net/ethernet/mellanox/mlx5/core/en/rqt.c | 27 +- .../ethernet/mellanox/mlx5/core/en/tc_priv.h | 7 + .../net/ethernet/mellanox/mlx5/core/en_tc.c | 83 ++-- .../net/ethernet/mellanox/mlx5/core/eswitch.h | 11 +- .../mellanox/mlx5/core/eswitch_offloads.c | 26 ++ .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++-------- .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++- .../net/ethernet/mellanox/mlx5/core/lag/mp.c | 4 + .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 28 +- .../mellanox/mlx5/core/lag/shared_fdb.c | 233 ++++++++++ .../net/ethernet/mellanox/mlx5/core/lib/sd.c | 227 +++++++-- .../net/ethernet/mellanox/mlx5/core/lib/sd.h | 23 + .../net/ethernet/mellanox/mlx5/core/main.c | 3 +- 14 files changed, 914 insertions(+), 289 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
Hi,
This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. See detailed feature description by Shay below.
Regards,
Tariq
This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).
Design
Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:
- MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
behavior, used by bonding, FW LAG commands, v2p_map)
- MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
(used by MPESW shared FDB across all devices)
- specific group_id: iterate only devices in that SD group (used by
per-group SD shared FDB operations)
Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.
Lifecycle and ownership
The SD LAG lifecycle is tied to the SD group, not to bonding events:
1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
(priv.lag) for each LAG-capable PF. e.g.: SD primary devices
2. During mlx5_sd_init(), after the SD group is fully formed (primary
and secondaries paired), sd_lag_init() registers the secondary
devices into the primary's existing priv.lag by calling
mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
also gets its group_id set. No separate LAG instance is created.
3. After all the devices in SD group transition to switchdev,
mlx5_lag_shared_fdb_create() is invoked with the group_id to create
a software-only shared FDB scoped to that SD group. This sets
sd_fdb_active on all lag_func entries in the group. No FW LAG
commands are issued since SD devices share the same physical port.
4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
per-group SD shared FDB is torn down first, then MPESW shared FDB is
created spanning all devices (ports + SD secondaries) using
MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
restored.
5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
removes secondaries from priv.lag and clears the primary's group_id.
The LAG structure itself is not destroyed.
The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.
SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.
Patches
Infrastructure (patches 1, 5-6):
- Factor out shared FDB code into a dedicated file
- Extend lag_func with group_id and sd_fdb_active fields;
add XA_MARK_PORT and unified iterator with group_id filter
- Extend shared FDB API with group_id parameter
E-Switch preparation (patches 2-3):
- Align eswitch disable sequence ordering
- Move devcom init from TC to eswitch layer
SD group management (patches 4, 7-9):
- Replace peer count check with direct peer lookup
- Register SD secondaries in the existing LAG at SD init time
- Block RoCE and VF LAG for SD devices
- Block multipath LAG for SD devices
Switchdev integration (patch 10):
- Keep netdev resources local in switchdev mode
Steering (patches 11-12):
- Track peer flow slots with bitmap for selective peer flow deletion
- Enable TC flow steering for SD LAG
Enablement (patch 13):
- Verify unique vhca_id count for cross-VHCA RQT
Shay Drory (13):
net/mlx5: LAG, factor out shared FDB code into dedicated file
net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy
transition
net/mlx5: E-Switch, move devcom init from TC to eswitch layer
net/mlx5: LAG, replace peer count check with direct peer lookup
net/mlx5: LAG, prepare for SD device integration
net/mlx5: LAG, extend shared FDB API with group_id filter
net/mlx5: SD, introduce Socket Direct LAG
net/mlx5: LAG, block RoCE and VF LAG for SD devices
net/mlx5: LAG, block multipath LAG for SD devices
net/mlx5: SD, keep netdev resources on same PF in switchdev mode
net/mlx5e: TC, track peer flow slots with bitmap
net/mlx5e: TC, enable steering for SD LAG
net/mlx5e: Verify unique vhca_id count instead of range
.../net/ethernet/mellanox/mlx5/core/Makefile | 2 +-
.../net/ethernet/mellanox/mlx5/core/en/rqt.c | 27 +-
.../ethernet/mellanox/mlx5/core/en/tc_priv.h | 7 +
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 83 ++--
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 11 +-
.../mellanox/mlx5/core/eswitch_offloads.c | 26 ++
.../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++--------
.../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++-
.../net/ethernet/mellanox/mlx5/core/lag/mp.c | 4 +
.../ethernet/mellanox/mlx5/core/lag/mpesw.c | 28 +-
.../mellanox/mlx5/core/lag/shared_fdb.c | 233 ++++++++++
.../net/ethernet/mellanox/mlx5/core/lib/sd.c | 227 +++++++--
.../net/ethernet/mellanox/mlx5/core/lib/sd.h | 23 +
.../net/ethernet/mellanox/mlx5/core/main.c | 3 +-
14 files changed, 914 insertions(+), 289 deletions(-)
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10
--
2.44.0
On 5/27/2026 5:54 AM, Tariq Toukan wrote: > Hi, > > This series enables Socket Direct single netdev to operate in switchdev > mode with shared FDB. See detailed feature description by Shay below. > > Regards, > Tariq > > > This series enables Socket Direct single netdev to operate in switchdev > mode with shared FDB. SD single netdev combines multiple PCI functions > behind a single netdev interface. To support switchdev offloads, these > functions must participate in virtual LAG (shared FDB). > > Design > > Rather than introducing a separate LAG instance for SD, this series > integrates SD secondary devices into the existing LAG structure > (priv.lag) created at probe time. Each lag_func entry carries a > group_id field that identifies its SD group membership (0 means not > part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes > physical port entries from SD secondaries, enabling a single unified > iterator that filters by group: > > - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing > behavior, used by bonding, FW LAG commands, v2p_map) > - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries > (used by MPESW shared FDB across all devices) > - specific group_id: iterate only devices in that SD group (used by > per-group SD shared FDB operations) > > Existing callers use mlx5_ldev_for_each() which maps to > MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD > configurations. > > Lifecycle and ownership > > The SD LAG lifecycle is tied to the SD group, not to bonding events: > > 1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure > (priv.lag) for each LAG-capable PF. e.g.: SD primary devices > > 2. During mlx5_sd_init(), after the SD group is fully formed (primary > and secondaries paired), sd_lag_init() registers the secondary > devices into the primary's existing priv.lag by calling > mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func > also gets its group_id set. No separate LAG instance is created. > > 3. After all the devices in SD group transition to switchdev, > mlx5_lag_shared_fdb_create() is invoked with the group_id to create > a software-only shared FDB scoped to that SD group. This sets > sd_fdb_active on all lag_func entries in the group. No FW LAG > commands are issued since SD devices share the same physical port. > > 4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the > per-group SD shared FDB is torn down first, then MPESW shared FDB is > created spanning all devices (ports + SD secondaries) using > MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is > restored. > > 5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup() > removes secondaries from priv.lag and clears the primary's group_id. > The LAG structure itself is not destroyed. > > The sd_fdb_active flag is set on all lag_func entries in a group (not > just the primary), so any device can detect the SD shared FDB state > during lag_disable_change teardown without needing to look up peer > entries. > > SD shared FDB is a pure software construct -- unlike regular LAG modes > (ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag > commands. The software vport LAG for SD is implemented via eswitch > egress ACL bounce rules, managed by the IB layer through > mlx5_eth_lag_init(). And the software LAG demux is implemented via > steering rules that utilize new destination, VHCA_RX. > I appreciate the overall details on the lifecycle and ownership. That made it easier to follow the patches and understand the changes. > Patches > > Infrastructure (patches 1, 5-6): > - Factor out shared FDB code into a dedicated file > - Extend lag_func with group_id and sd_fdb_active fields; > add XA_MARK_PORT and unified iterator with group_id filter > - Extend shared FDB API with group_id parameter > > E-Switch preparation (patches 2-3): > - Align eswitch disable sequence ordering > - Move devcom init from TC to eswitch layer > > SD group management (patches 4, 7-9): > - Replace peer count check with direct peer lookup > - Register SD secondaries in the existing LAG at SD init time > - Block RoCE and VF LAG for SD devices > - Block multipath LAG for SD devices > > Switchdev integration (patch 10): > - Keep netdev resources local in switchdev mode > > Steering (patches 11-12): > - Track peer flow slots with bitmap for selective peer flow deletion > - Enable TC flow steering for SD LAG > > Enablement (patch 13): > - Verify unique vhca_id count for cross-VHCA RQT > The patch 13 being the "enablement" is a bit confusing to me since I had trouble understanding how the patch description is "enabling" the socket direct stuff.. But the description does say "part 1/2" so I am guessing thats addressed in part 2? > Shay Drory (13): > net/mlx5: LAG, factor out shared FDB code into dedicated file > net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy > transition > net/mlx5: E-Switch, move devcom init from TC to eswitch layer > net/mlx5: LAG, replace peer count check with direct peer lookup > net/mlx5: LAG, prepare for SD device integration > net/mlx5: LAG, extend shared FDB API with group_id filter > net/mlx5: SD, introduce Socket Direct LAG > net/mlx5: LAG, block RoCE and VF LAG for SD devices > net/mlx5: LAG, block multipath LAG for SD devices > net/mlx5: SD, keep netdev resources on same PF in switchdev mode > net/mlx5e: TC, track peer flow slots with bitmap > net/mlx5e: TC, enable steering for SD LAG > net/mlx5e: Verify unique vhca_id count instead of range > > .../net/ethernet/mellanox/mlx5/core/Makefile | 2 +- > .../net/ethernet/mellanox/mlx5/core/en/rqt.c | 27 +- > .../ethernet/mellanox/mlx5/core/en/tc_priv.h | 7 + > .../net/ethernet/mellanox/mlx5/core/en_tc.c | 83 ++-- > .../net/ethernet/mellanox/mlx5/core/eswitch.h | 11 +- > .../mellanox/mlx5/core/eswitch_offloads.c | 26 ++ > .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++-------- > .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++- > .../net/ethernet/mellanox/mlx5/core/lag/mp.c | 4 + > .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 28 +- > .../mellanox/mlx5/core/lag/shared_fdb.c | 233 ++++++++++ > .../net/ethernet/mellanox/mlx5/core/lib/sd.c | 227 +++++++-- > .../net/ethernet/mellanox/mlx5/core/lib/sd.h | 23 + > .../net/ethernet/mellanox/mlx5/core/main.c | 3 +- > 14 files changed, 914 insertions(+), 289 deletions(-) > create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c > > > base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10
On 28/05/2026 1:08, Jacob Keller wrote: > On 5/27/2026 5:54 AM, Tariq Toukan wrote: >> Hi, >> >> This series enables Socket Direct single netdev to operate in switchdev >> mode with shared FDB. See detailed feature description by Shay below. >> >> Regards, >> Tariq >> >> >> This series enables Socket Direct single netdev to operate in switchdev >> mode with shared FDB. SD single netdev combines multiple PCI functions >> behind a single netdev interface. To support switchdev offloads, these >> functions must participate in virtual LAG (shared FDB). >> >> Design >> >> Rather than introducing a separate LAG instance for SD, this series >> integrates SD secondary devices into the existing LAG structure >> (priv.lag) created at probe time. Each lag_func entry carries a >> group_id field that identifies its SD group membership (0 means not >> part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes >> physical port entries from SD secondaries, enabling a single unified >> iterator that filters by group: >> >> - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing >> behavior, used by bonding, FW LAG commands, v2p_map) >> - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries >> (used by MPESW shared FDB across all devices) >> - specific group_id: iterate only devices in that SD group (used by >> per-group SD shared FDB operations) >> >> Existing callers use mlx5_ldev_for_each() which maps to >> MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD >> configurations. >> >> Lifecycle and ownership >> >> The SD LAG lifecycle is tied to the SD group, not to bonding events: >> >> 1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure >> (priv.lag) for each LAG-capable PF. e.g.: SD primary devices >> >> 2. During mlx5_sd_init(), after the SD group is fully formed (primary >> and secondaries paired), sd_lag_init() registers the secondary >> devices into the primary's existing priv.lag by calling >> mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func >> also gets its group_id set. No separate LAG instance is created. >> >> 3. After all the devices in SD group transition to switchdev, >> mlx5_lag_shared_fdb_create() is invoked with the group_id to create >> a software-only shared FDB scoped to that SD group. This sets >> sd_fdb_active on all lag_func entries in the group. No FW LAG >> commands are issued since SD devices share the same physical port. >> >> 4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the >> per-group SD shared FDB is torn down first, then MPESW shared FDB is >> created spanning all devices (ports + SD secondaries) using >> MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is >> restored. >> >> 5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup() >> removes secondaries from priv.lag and clears the primary's group_id. >> The LAG structure itself is not destroyed. >> >> The sd_fdb_active flag is set on all lag_func entries in a group (not >> just the primary), so any device can detect the SD shared FDB state >> during lag_disable_change teardown without needing to look up peer >> entries. >> >> SD shared FDB is a pure software construct -- unlike regular LAG modes >> (ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag >> commands. The software vport LAG for SD is implemented via eswitch >> egress ACL bounce rules, managed by the IB layer through >> mlx5_eth_lag_init(). And the software LAG demux is implemented via >> steering rules that utilize new destination, VHCA_RX. >> > > I appreciate the overall details on the lifecycle and ownership. That > made it easier to follow the patches and understand the changes. > >> Patches >> >> Infrastructure (patches 1, 5-6): >> - Factor out shared FDB code into a dedicated file >> - Extend lag_func with group_id and sd_fdb_active fields; >> add XA_MARK_PORT and unified iterator with group_id filter >> - Extend shared FDB API with group_id parameter >> >> E-Switch preparation (patches 2-3): >> - Align eswitch disable sequence ordering >> - Move devcom init from TC to eswitch layer >> >> SD group management (patches 4, 7-9): >> - Replace peer count check with direct peer lookup >> - Register SD secondaries in the existing LAG at SD init time >> - Block RoCE and VF LAG for SD devices >> - Block multipath LAG for SD devices >> >> Switchdev integration (patch 10): >> - Keep netdev resources local in switchdev mode >> >> Steering (patches 11-12): >> - Track peer flow slots with bitmap for selective peer flow deletion >> - Enable TC flow steering for SD LAG >> >> Enablement (patch 13): >> - Verify unique vhca_id count for cross-VHCA RQT >> > > The patch 13 being the "enablement" is a bit confusing to me since I had > trouble understanding how the patch description is "enabling" the socket > direct stuff.. But the description does say "part 1/2" so I am guessing > thats addressed in part 2? Thanks for the review the word "enablement" here in the cover letter is a bit confusing... :( This commit prepare RQT layer for SD-over-DPU, which will also be enable by the series. in SD-over-DPU configuration, a device's vhca_id ends up failing the old range-based check. > >> Shay Drory (13): >> net/mlx5: LAG, factor out shared FDB code into dedicated file >> net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy >> transition >> net/mlx5: E-Switch, move devcom init from TC to eswitch layer >> net/mlx5: LAG, replace peer count check with direct peer lookup >> net/mlx5: LAG, prepare for SD device integration >> net/mlx5: LAG, extend shared FDB API with group_id filter >> net/mlx5: SD, introduce Socket Direct LAG >> net/mlx5: LAG, block RoCE and VF LAG for SD devices >> net/mlx5: LAG, block multipath LAG for SD devices >> net/mlx5: SD, keep netdev resources on same PF in switchdev mode >> net/mlx5e: TC, track peer flow slots with bitmap >> net/mlx5e: TC, enable steering for SD LAG >> net/mlx5e: Verify unique vhca_id count instead of range >> >> .../net/ethernet/mellanox/mlx5/core/Makefile | 2 +- >> .../net/ethernet/mellanox/mlx5/core/en/rqt.c | 27 +- >> .../ethernet/mellanox/mlx5/core/en/tc_priv.h | 7 + >> .../net/ethernet/mellanox/mlx5/core/en_tc.c | 83 ++-- >> .../net/ethernet/mellanox/mlx5/core/eswitch.h | 11 +- >> .../mellanox/mlx5/core/eswitch_offloads.c | 26 ++ >> .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++-------- >> .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++- >> .../net/ethernet/mellanox/mlx5/core/lag/mp.c | 4 + >> .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 28 +- >> .../mellanox/mlx5/core/lag/shared_fdb.c | 233 ++++++++++ >> .../net/ethernet/mellanox/mlx5/core/lib/sd.c | 227 +++++++-- >> .../net/ethernet/mellanox/mlx5/core/lib/sd.h | 23 + >> .../net/ethernet/mellanox/mlx5/core/main.c | 3 +- >> 14 files changed, 914 insertions(+), 289 deletions(-) >> create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c >> >> >> base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10 > >
On 5/28/2026 2:18 AM, Shay Drori wrote: > > > On 28/05/2026 1:08, Jacob Keller wrote: >> On 5/27/2026 5:54 AM, Tariq Toukan wrote: >> >> The patch 13 being the "enablement" is a bit confusing to me since I had >> trouble understanding how the patch description is "enabling" the socket >> direct stuff.. But the description does say "part 1/2" so I am guessing >> thats addressed in part 2? > > Thanks for the review > > the word "enablement" here in the cover letter is a bit confusing... :( > This commit prepare RQT layer for SD-over-DPU, which will also be enable > by the series. > in SD-over-DPU configuration, a device's vhca_id ends up failing the old > range-based check. > That makes sense, and clarifies the misunderstanding for me. Thanks!
On Wed, 27 May 2026 15:54:14 +0300 Tariq Toukan wrote: > This series enables Socket Direct single netdev to operate in switchdev > mode with shared FDB. See detailed feature description by Shay below. kdoc warning in here: Warning: drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c:140 No description found for return value of 'mlx5_lag_shared_fdb_create' -- pw-bot: cr
© 2016 - 2026 Red Hat, Inc.