.../net/ethernet/mellanox/mlx5/core/eswitch.c | 25 ++- .../net/ethernet/mellanox/mlx5/core/eswitch.h | 15 +- .../mellanox/mlx5/core/eswitch_offloads.c | 204 ++++++++++++++---- .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 46 ++-- .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 1 + .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 12 +- .../net/ethernet/mellanox/mlx5/core/main.c | 26 ++- .../ethernet/mellanox/mlx5/core/sf/devlink.c | 5 + include/linux/mlx5/driver.h | 1 + include/linux/mlx5/eswitch.h | 5 + 10 files changed, 267 insertions(+), 73 deletions(-)
Hi, See detailed description by Mark below [1]. Regards, Tariq [1] This series addresses three problems that have been present for years. First, there is no coordination between E-Switch reconfiguration and representor registration. The E-Switch can be mid-way through a mode change or VF count update while mlx5_ib walks in and registers or unregisters representors. Nothing stops them. The race window is small and there is no field report, but it is clearly wrong. A mutex is not the answer. The representor callbacks reach into RDMA, netdev, and LAG layers that already hold their own locks, making a new mutex in the E-Switch layer a deadlock waiting to happen. Second, the E-Switch work queue has a deadlock of its own. mlx5_eswitch_cleanup() drains the work queue while holding the devlink lock. Workers on that queue acquire devlink lock before checking whether their work is still relevant. They block. The cleanup path waits for them to finish. Deadlock. Third, loading mlx5_ib while the device is already in switchdev mode does not bring up the IB representors. This has been broken for years. mlx5_eswitch_register_vport_reps() only stores callbacks; nobody triggers the actual load after registration. For the work queue deadlock: introduce a generation counter in the top-level mlx5_eswitch struct (moved from mlx5_esw_functions, which only covered function-change events) and a generic dispatch helper mlx5_esw_add_work(). The worker esw_wq_handler() checks the counter before touching the devlink lock using devl_trylock() in a loop. Stale work exits immediately without ever contending. The counter is incremented at every E-Switch operation boundary: cleanup, disable, mode-set, enable, disable_sriov. For the registration race: a simple atomic block state guards all reconfiguration paths. mlx5_esw_reps_block()/mlx5_esw_reps_unblock() spin a cmpxchg between UNBLOCKED and BLOCKED. Every reconfiguration path (mode set, enable, disable, VF/SF add/del, LAG reload, and the register/unregister calls themselves) brackets its work with this guard. No new locks, no deadlock risk. For the missing IB representors: now that the work queue infrastructure is in place, mlx5_eswitch_register_vport_reps() queues a work item that acquires the devlink lock and loads all relevant representors. This is the change that actually fixes the long-standing bug. One thing worth calling out: the block guard is non-reentrant. A caller that tries to transition UNBLOCKED->BLOCKED while the E-Switch is already BLOCKED will spin forever. All call sites were audited: - mlx5_eswitch_enable/disable/disable_sriov hold BLOCKED only around low-level vport helpers that do not call register/unregister. - Inside mlx5_eswitch_unregister_vport_reps the unload callbacks run while BLOCKED is held. The one callback that calls unregister (mlx5_ib_vport_rep_unload in LAG shared-FDB mode) only does so on peer E-Switch instances, each with its own independent atomic. - mlx5_devlink_eswitch_mode_set acquires BLOCKED, then calls esw_offloads_start/stop -> esw_mode_change. esw_mode_change releases BLOCKED before calling rescan_drivers so that the probe/remove callbacks that trigger register/unregister see UNBLOCKED. esw_mode_change re-acquires before returning, and mode_set releases at the end. This is an explicit hand-off of the guard across the rescan window. - mlx5_eswitch_register_vport_reps holds BLOCKED only while storing callbacks and queuing the load work. The actual rep loading runs from the work queue after the guard is released. Patch 1 is cleanup. LAG and MPESW had the same representor reload sequence duplicated in several places and the copies had started to drift. This consolidates them into one helper. Patches 2-4 fix the work queue deadlock in three steps: first move the generation counter from mlx5_esw_functions to mlx5_eswitch; then introduce the generic esw_wq_handler/mlx5_esw_add_work dispatch infrastructure; then apply the actual fix by switching to devl_trylock and adding generation increments at all operation boundaries. Patch 5 adds the atomic block guard for representor registration, protecting all reconfiguration paths. Patch 6 moves the representor load triggered by mlx5_eswitch_register_vport_reps() onto the work queue. This is the patch that fixes IB representors not coming up when mlx5_ib is loaded while the device is already in switchdev mode. Patch 7 adds a driver profile that auto-enables switchdev at device init, for deployments that always operate in switchdev mode and want to avoid a manual devlink command after every probe. Mark Bloch (7): net/mlx5: Lag: refactor representor reload handling net/mlx5: E-Switch, move work queue generation counter net/mlx5: E-Switch, introduce generic work queue dispatch helper net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq net/mlx5: E-Switch, block representors during reconfiguration net/mlx5: E-switch, load reps via work queue after registration net/mlx5: Add profile to auto-enable switchdev mode at device init .../net/ethernet/mellanox/mlx5/core/eswitch.c | 25 ++- .../net/ethernet/mellanox/mlx5/core/eswitch.h | 15 +- .../mellanox/mlx5/core/eswitch_offloads.c | 204 ++++++++++++++---- .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 46 ++-- .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 1 + .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 12 +- .../net/ethernet/mellanox/mlx5/core/main.c | 26 ++- .../ethernet/mellanox/mlx5/core/sf/devlink.c | 5 + include/linux/mlx5/driver.h | 1 + include/linux/mlx5/eswitch.h | 5 + 10 files changed, 267 insertions(+), 73 deletions(-) base-commit: 9700282a7ec721e285771d995ccfe33845e776dc -- 2.44.0
On 09/04/2026 14:55, Tariq Toukan wrote: > Hi, > > See detailed description by Mark below [1]. > Sashiko flagged a few issues in the series. The main ones are in patch 4 ([PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq). These are known, and I currently consider them acceptable trade-offs rather than bugs. That said, reviewers/maintainers may reasonably see this differently. Before sending a v2 focused on patch 4, I’d appreciate feedback on the overall approach and direction of the series, please see sashiko's comments here: https://sashiko.dev/#/patchset/20260409115550.156419-1-tariqt%40nvidia.com I've provided my commetns on each patch on the mailing list and included what was found by sashiko. the patch series can be broken into 3: [patches 2/3/4] – workqueue deadlock: During teardown we must ensure all pending work is completed. However, since the teardown path already holds the devlink lock, we cannot have work items blocking on that same lock without risking a deadlock. [patches 5/6] – reps block/unblock state The interaction between mlx5_core and mlx5_ib (load/unload), eswitch mode transitions, and auxiliary device handling makes this particularly tricky. Several conventional locking/synchronization approaches were explored, but they either reintroduced deadlocks or created even more complex issues. The current approach is admittedly not the cleanest, but it has proven to behave correctly in practice. [patch 7] – switchdev by default The long-term goal is to have switchdev as the default mode for DPU environments. Flipping that behavior globally in one step is risky. This patch provides a controlled, incremental way to move in that direction, allowing validation in real deployments before making it the default. Mark > Regards, > Tariq > > [1] > This series addresses three problems that have been present for years. > First, there is no coordination between E-Switch reconfiguration and > representor registration. The E-Switch can be mid-way through a mode > change or VF count update while mlx5_ib walks in and registers or > unregisters representors. Nothing stops them. The race window is small > and there is no field report, but it is clearly wrong. > > A mutex is not the answer. The representor callbacks reach into RDMA, > netdev, and LAG layers that already hold their own locks, making a > new mutex in the E-Switch layer a deadlock waiting to happen. > > Second, the E-Switch work queue has a deadlock of its own. > mlx5_eswitch_cleanup() drains the work queue while holding the devlink > lock. Workers on that queue acquire devlink lock before checking whether > their work is still relevant. They block. The cleanup path waits for > them to finish. Deadlock. > > Third, loading mlx5_ib while the device is already in switchdev mode > does not bring up the IB representors. This has been broken for years. > mlx5_eswitch_register_vport_reps() only stores callbacks; nobody > triggers the actual load after registration. > > For the work queue deadlock: introduce a generation counter in the > top-level mlx5_eswitch struct (moved from mlx5_esw_functions, > which only covered function-change events) and a generic dispatch helper > mlx5_esw_add_work(). The worker esw_wq_handler() checks the counter > before touching the devlink lock using devl_trylock() in a loop. Stale > work exits immediately without ever contending. The counter is > incremented at every E-Switch operation boundary: cleanup, disable, > mode-set, enable, disable_sriov. > > For the registration race: a simple atomic block state guards all > reconfiguration paths. mlx5_esw_reps_block()/mlx5_esw_reps_unblock() > spin a cmpxchg between UNBLOCKED and BLOCKED. Every reconfiguration > path (mode set, enable, disable, VF/SF add/del, LAG reload, and the > register/unregister calls themselves) brackets its work with this guard. > No new locks, no deadlock risk. > > For the missing IB representors: now that the work queue infrastructure > is in place, mlx5_eswitch_register_vport_reps() queues a work item that > acquires the devlink lock and loads all relevant representors. This is > the change that actually fixes the long-standing bug. > > One thing worth calling out: the block guard is non-reentrant. A caller > that tries to transition UNBLOCKED->BLOCKED while the E-Switch is already > BLOCKED will spin forever. All call sites were audited: > > - mlx5_eswitch_enable/disable/disable_sriov hold BLOCKED only around > low-level vport helpers that do not call register/unregister. > > - Inside mlx5_eswitch_unregister_vport_reps the unload callbacks run > while BLOCKED is held. The one callback that calls unregister > (mlx5_ib_vport_rep_unload in LAG shared-FDB mode) only does so on > peer E-Switch instances, each with its own independent atomic. > > - mlx5_devlink_eswitch_mode_set acquires BLOCKED, then calls > esw_offloads_start/stop -> esw_mode_change. esw_mode_change releases > BLOCKED before calling rescan_drivers so that the probe/remove > callbacks that trigger register/unregister see UNBLOCKED. > esw_mode_change re-acquires before returning, and mode_set releases > at the end. This is an explicit hand-off of the guard across the > rescan window. > > - mlx5_eswitch_register_vport_reps holds BLOCKED only while storing > callbacks and queuing the load work. The actual rep loading runs from > the work queue after the guard is released. > > Patch 1 is cleanup. LAG and MPESW had the same representor reload > sequence duplicated in several places and the copies had started to > drift. This consolidates them into one helper. > > Patches 2-4 fix the work queue deadlock in three steps: first move the > generation counter from mlx5_esw_functions to mlx5_eswitch; > then introduce the generic esw_wq_handler/mlx5_esw_add_work dispatch > infrastructure; then apply the actual fix by switching to devl_trylock > and adding generation increments at all operation boundaries. > > Patch 5 adds the atomic block guard for representor registration, > protecting all reconfiguration paths. > > Patch 6 moves the representor load triggered by > mlx5_eswitch_register_vport_reps() onto the work queue. This is the > patch that fixes IB representors not coming up when mlx5_ib is loaded > while the device is already in switchdev mode. > > Patch 7 adds a driver profile that auto-enables switchdev at device > init, for deployments that always operate in switchdev mode and want > to avoid a manual devlink command after every probe. > > Mark Bloch (7): > net/mlx5: Lag: refactor representor reload handling > net/mlx5: E-Switch, move work queue generation counter > net/mlx5: E-Switch, introduce generic work queue dispatch helper > net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq > net/mlx5: E-Switch, block representors during reconfiguration > net/mlx5: E-switch, load reps via work queue after registration > net/mlx5: Add profile to auto-enable switchdev mode at device init > > .../net/ethernet/mellanox/mlx5/core/eswitch.c | 25 ++- > .../net/ethernet/mellanox/mlx5/core/eswitch.h | 15 +- > .../mellanox/mlx5/core/eswitch_offloads.c | 204 ++++++++++++++---- > .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 46 ++-- > .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 1 + > .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 12 +- > .../net/ethernet/mellanox/mlx5/core/main.c | 26 ++- > .../ethernet/mellanox/mlx5/core/sf/devlink.c | 5 + > include/linux/mlx5/driver.h | 1 + > include/linux/mlx5/eswitch.h | 5 + > 10 files changed, 267 insertions(+), 73 deletions(-) > > > base-commit: 9700282a7ec721e285771d995ccfe33845e776dc
© 2016 - 2026 Red Hat, Inc.