net/mlx5: Improve representor lifecycle and fix work queue deadlock

[PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock

Posted by Tariq Toukan 2 months ago

Hi,

See detailed description by Mark below [1].

Regards,
Tariq

[1]
This series addresses three problems that have been present for years.
First, there is no coordination between E-Switch reconfiguration and
representor registration. The E-Switch can be mid-way through a mode
change or VF count update while mlx5_ib walks in and registers or
unregisters representors. Nothing stops them. The race window is small
and there is no field report, but it is clearly wrong.

A mutex is not the answer. The representor callbacks reach into RDMA,
netdev, and LAG layers that already hold their own locks, making a
new mutex in the E-Switch layer a deadlock waiting to happen.

Second, the E-Switch work queue has a deadlock of its own.
mlx5_eswitch_cleanup() drains the work queue while holding the devlink
lock. Workers on that queue acquire devlink lock before checking whether
their work is still relevant. They block. The cleanup path waits for
them to finish. Deadlock.

Third, loading mlx5_ib while the device is already in switchdev mode
does not bring up the IB representors. This has been broken for years.
mlx5_eswitch_register_vport_reps() only stores callbacks; nobody
triggers the actual load after registration.

For the work queue deadlock: introduce a generation counter in the
top-level mlx5_eswitch struct (moved from mlx5_esw_functions,
which only covered function-change events) and a generic dispatch helper
mlx5_esw_add_work(). The worker esw_wq_handler() checks the counter
before touching the devlink lock using devl_trylock() in a loop. Stale
work exits immediately without ever contending. The counter is
incremented at every E-Switch operation boundary: cleanup, disable,
mode-set, enable, disable_sriov.

For the registration race: a simple atomic block state guards all
reconfiguration paths. mlx5_esw_reps_block()/mlx5_esw_reps_unblock()
spin a cmpxchg between UNBLOCKED and BLOCKED. Every reconfiguration
path (mode set, enable, disable, VF/SF add/del, LAG reload, and the
register/unregister calls themselves) brackets its work with this guard.
No new locks, no deadlock risk.

For the missing IB representors: now that the work queue infrastructure
is in place, mlx5_eswitch_register_vport_reps() queues a work item that
acquires the devlink lock and loads all relevant representors. This is
the change that actually fixes the long-standing bug.

One thing worth calling out: the block guard is non-reentrant. A caller
that tries to transition UNBLOCKED->BLOCKED while the E-Switch is already
BLOCKED will spin forever. All call sites were audited:

 - mlx5_eswitch_enable/disable/disable_sriov hold BLOCKED only around
   low-level vport helpers that do not call register/unregister.

 - Inside mlx5_eswitch_unregister_vport_reps the unload callbacks run
   while BLOCKED is held. The one callback that calls unregister
   (mlx5_ib_vport_rep_unload in LAG shared-FDB mode) only does so on
   peer E-Switch instances, each with its own independent atomic.

 - mlx5_devlink_eswitch_mode_set acquires BLOCKED, then calls
   esw_offloads_start/stop -> esw_mode_change. esw_mode_change releases
   BLOCKED before calling rescan_drivers so that the probe/remove
   callbacks that trigger register/unregister see UNBLOCKED.
   esw_mode_change re-acquires before returning, and mode_set releases
   at the end. This is an explicit hand-off of the guard across the
   rescan window.

 - mlx5_eswitch_register_vport_reps holds BLOCKED only while storing
   callbacks and queuing the load work. The actual rep loading runs from
   the work queue after the guard is released.

Patch 1 is cleanup. LAG and MPESW had the same representor reload
sequence duplicated in several places and the copies had started to
drift. This consolidates them into one helper.

Patches 2-4 fix the work queue deadlock in three steps: first move the
generation counter from mlx5_esw_functions to mlx5_eswitch;
then introduce the generic esw_wq_handler/mlx5_esw_add_work dispatch
infrastructure; then apply the actual fix by switching to devl_trylock
and adding generation increments at all operation boundaries.

Patch 5 adds the atomic block guard for representor registration,
protecting all reconfiguration paths.

Patch 6 moves the representor load triggered by
mlx5_eswitch_register_vport_reps() onto the work queue. This is the
patch that fixes IB representors not coming up when mlx5_ib is loaded
while the device is already in switchdev mode.

Patch 7 adds a driver profile that auto-enables switchdev at device
init, for deployments that always operate in switchdev mode and want
to avoid a manual devlink command after every probe.

Mark Bloch (7):
  net/mlx5: Lag: refactor representor reload handling
  net/mlx5: E-Switch, move work queue generation counter
  net/mlx5: E-Switch, introduce generic work queue dispatch helper
  net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq
  net/mlx5: E-Switch, block representors during reconfiguration
  net/mlx5: E-switch, load reps via work queue after registration
  net/mlx5: Add profile to auto-enable switchdev mode at device init

 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  25 ++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  15 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     | 204 ++++++++++++++----
 .../net/ethernet/mellanox/mlx5/core/lag/lag.c |  46 ++--
 .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   1 +
 .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  12 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  26 ++-
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  |   5 +
 include/linux/mlx5/driver.h                   |   1 +
 include/linux/mlx5/eswitch.h                  |   5 +
 10 files changed, 267 insertions(+), 73 deletions(-)


base-commit: 9700282a7ec721e285771d995ccfe33845e776dc
-- 
2.44.0

Re: [PATCH net-next 0/7] net/mlx5: Improve representor lifecycle and fix work queue deadlock

Posted by Mark Bloch 2 months ago


On 09/04/2026 14:55, Tariq Toukan wrote:
> Hi,
> 
> See detailed description by Mark below [1].
> 

Sashiko flagged a few issues in the series. The main ones are in patch 4
([PATCH net-next 4/7] net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq).
These are known, and I currently consider them acceptable trade-offs rather than bugs.

That said, reviewers/maintainers may reasonably see this differently.
Before sending a v2 focused on patch 4, I’d appreciate feedback on the overall
approach and direction of the series, please see sashiko's comments here:
https://sashiko.dev/#/patchset/20260409115550.156419-1-tariqt%40nvidia.com

I've provided my commetns on each patch on the mailing list and
included what was found by sashiko.

the patch series can be broken into 3:

[patches 2/3/4] – workqueue deadlock:

During teardown we must ensure all pending work is completed.
However, since the teardown path already holds the devlink lock,
we cannot have work items blocking on that same lock without
risking a deadlock.

[patches 5/6] – reps block/unblock state
The interaction between mlx5_core and mlx5_ib (load/unload),
eswitch mode transitions, and auxiliary device handling makes
this particularly tricky.

Several conventional locking/synchronization approaches were
explored, but they either reintroduced deadlocks or created
even more complex issues. The current approach is admittedly
not the cleanest, but it has proven to behave correctly in practice.

[patch 7] – switchdev by default
The long-term goal is to have switchdev as the default mode
for DPU environments. Flipping that behavior globally in one
step is risky.

This patch provides a controlled, incremental way to move in
that direction, allowing validation in real deployments before
making it the default.

Mark

> Regards,
> Tariq
> 
> [1]
> This series addresses three problems that have been present for years.
> First, there is no coordination between E-Switch reconfiguration and
> representor registration. The E-Switch can be mid-way through a mode
> change or VF count update while mlx5_ib walks in and registers or
> unregisters representors. Nothing stops them. The race window is small
> and there is no field report, but it is clearly wrong.
> 
> A mutex is not the answer. The representor callbacks reach into RDMA,
> netdev, and LAG layers that already hold their own locks, making a
> new mutex in the E-Switch layer a deadlock waiting to happen.
> 
> Second, the E-Switch work queue has a deadlock of its own.
> mlx5_eswitch_cleanup() drains the work queue while holding the devlink
> lock. Workers on that queue acquire devlink lock before checking whether
> their work is still relevant. They block. The cleanup path waits for
> them to finish. Deadlock.
> 
> Third, loading mlx5_ib while the device is already in switchdev mode
> does not bring up the IB representors. This has been broken for years.
> mlx5_eswitch_register_vport_reps() only stores callbacks; nobody
> triggers the actual load after registration.
> 
> For the work queue deadlock: introduce a generation counter in the
> top-level mlx5_eswitch struct (moved from mlx5_esw_functions,
> which only covered function-change events) and a generic dispatch helper
> mlx5_esw_add_work(). The worker esw_wq_handler() checks the counter
> before touching the devlink lock using devl_trylock() in a loop. Stale
> work exits immediately without ever contending. The counter is
> incremented at every E-Switch operation boundary: cleanup, disable,
> mode-set, enable, disable_sriov.
> 
> For the registration race: a simple atomic block state guards all
> reconfiguration paths. mlx5_esw_reps_block()/mlx5_esw_reps_unblock()
> spin a cmpxchg between UNBLOCKED and BLOCKED. Every reconfiguration
> path (mode set, enable, disable, VF/SF add/del, LAG reload, and the
> register/unregister calls themselves) brackets its work with this guard.
> No new locks, no deadlock risk.
> 
> For the missing IB representors: now that the work queue infrastructure
> is in place, mlx5_eswitch_register_vport_reps() queues a work item that
> acquires the devlink lock and loads all relevant representors. This is
> the change that actually fixes the long-standing bug.
> 
> One thing worth calling out: the block guard is non-reentrant. A caller
> that tries to transition UNBLOCKED->BLOCKED while the E-Switch is already
> BLOCKED will spin forever. All call sites were audited:
> 
>  - mlx5_eswitch_enable/disable/disable_sriov hold BLOCKED only around
>    low-level vport helpers that do not call register/unregister.
> 
>  - Inside mlx5_eswitch_unregister_vport_reps the unload callbacks run
>    while BLOCKED is held. The one callback that calls unregister
>    (mlx5_ib_vport_rep_unload in LAG shared-FDB mode) only does so on
>    peer E-Switch instances, each with its own independent atomic.
> 
>  - mlx5_devlink_eswitch_mode_set acquires BLOCKED, then calls
>    esw_offloads_start/stop -> esw_mode_change. esw_mode_change releases
>    BLOCKED before calling rescan_drivers so that the probe/remove
>    callbacks that trigger register/unregister see UNBLOCKED.
>    esw_mode_change re-acquires before returning, and mode_set releases
>    at the end. This is an explicit hand-off of the guard across the
>    rescan window.
> 
>  - mlx5_eswitch_register_vport_reps holds BLOCKED only while storing
>    callbacks and queuing the load work. The actual rep loading runs from
>    the work queue after the guard is released.
> 
> Patch 1 is cleanup. LAG and MPESW had the same representor reload
> sequence duplicated in several places and the copies had started to
> drift. This consolidates them into one helper.
> 
> Patches 2-4 fix the work queue deadlock in three steps: first move the
> generation counter from mlx5_esw_functions to mlx5_eswitch;
> then introduce the generic esw_wq_handler/mlx5_esw_add_work dispatch
> infrastructure; then apply the actual fix by switching to devl_trylock
> and adding generation increments at all operation boundaries.
> 
> Patch 5 adds the atomic block guard for representor registration,
> protecting all reconfiguration paths.
> 
> Patch 6 moves the representor load triggered by
> mlx5_eswitch_register_vport_reps() onto the work queue. This is the
> patch that fixes IB representors not coming up when mlx5_ib is loaded
> while the device is already in switchdev mode.
> 
> Patch 7 adds a driver profile that auto-enables switchdev at device
> init, for deployments that always operate in switchdev mode and want
> to avoid a manual devlink command after every probe.
> 
> Mark Bloch (7):
>   net/mlx5: Lag: refactor representor reload handling
>   net/mlx5: E-Switch, move work queue generation counter
>   net/mlx5: E-Switch, introduce generic work queue dispatch helper
>   net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq
>   net/mlx5: E-Switch, block representors during reconfiguration
>   net/mlx5: E-switch, load reps via work queue after registration
>   net/mlx5: Add profile to auto-enable switchdev mode at device init
> 
>  .../net/ethernet/mellanox/mlx5/core/eswitch.c |  25 ++-
>  .../net/ethernet/mellanox/mlx5/core/eswitch.h |  15 +-
>  .../mellanox/mlx5/core/eswitch_offloads.c     | 204 ++++++++++++++----
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.c |  46 ++--
>  .../net/ethernet/mellanox/mlx5/core/lag/lag.h |   1 +
>  .../ethernet/mellanox/mlx5/core/lag/mpesw.c   |  12 +-
>  .../net/ethernet/mellanox/mlx5/core/main.c    |  26 ++-
>  .../ethernet/mellanox/mlx5/core/sf/devlink.c  |   5 +
>  include/linux/mlx5/driver.h                   |   1 +
>  include/linux/mlx5/eswitch.h                  |   5 +
>  10 files changed, 267 insertions(+), 73 deletions(-)
> 
> 
> base-commit: 9700282a7ec721e285771d995ccfe33845e776dc