[PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq

Tariq Toukan posted 4 patches 1 week, 5 days ago
[PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
Posted by Tariq Toukan 1 week, 5 days ago
From: Cosmin Ratiu <cratiu@nvidia.com>

esw_functions_changed_event_handler -> esw_vfs_changed_event_handler is
called from the esw->work_queue and acquires the devlink lock.

Changing the esw mode is done via .eswitch_mode_set (acquires devlink
lock in the devlink_nl_pre_doit call) -> mlx5_devlink_eswitch_mode_set
-> mlx5_eswitch_disable_locked -> mlx5_eswitch_event_handler_unregister
-> flush_workqueue.

This creates a circular lock dependency which could lead to a real
deadlock, as the code flushing the workqueue is holding the devlink
lock, and the work handler being flushed could try to acquire it.

Fix that by adding a new bool field 'notifier_enabled' next to the event
handler scheduling the work, keeping it true while the notifier is
active, and using it to repeatedly try to acquire the devlink lock from
the work handler while true, with a slight delay to avoid busy looping.

This avoids the deadlock because the event handler will be removed
first (turning 'notifier_enabled' false), and the work handler will
eventually give up in acquiring the lock because the work is no longer
necessary.

Fixes: f1bc646c9a06 ("net/mlx5: Use devl_ API in mlx5_esw_offloads_devlink_port_register")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c    |  6 +++++-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h    |  1 +
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c   | 12 +++++++++++-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 4b7a1ce7f406..fddc3b33222d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1066,14 +1066,18 @@ static void mlx5_eswitch_event_handler_register(struct mlx5_eswitch *esw)
 	if (esw->mode == MLX5_ESWITCH_OFFLOADS && mlx5_eswitch_is_funcs_handler(esw->dev)) {
 		MLX5_NB_INIT(&esw->esw_funcs.nb, mlx5_esw_funcs_changed_handler,
 			     ESW_FUNCTIONS_CHANGED);
+		esw->esw_funcs.notifier_enabled = true;
 		mlx5_eq_notifier_register(esw->dev, &esw->esw_funcs.nb);
 	}
 }
 
 static void mlx5_eswitch_event_handler_unregister(struct mlx5_eswitch *esw)
 {
-	if (esw->mode == MLX5_ESWITCH_OFFLOADS && mlx5_eswitch_is_funcs_handler(esw->dev))
+	if (esw->mode == MLX5_ESWITCH_OFFLOADS &&
+	    mlx5_eswitch_is_funcs_handler(esw->dev)) {
+		esw->esw_funcs.notifier_enabled = false;
 		mlx5_eq_notifier_unregister(esw->dev, &esw->esw_funcs.nb);
+	}
 
 	flush_workqueue(esw->work_queue);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index e7fe43799b23..af35312a8ced 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -338,6 +338,7 @@ struct mlx5_host_work {
 
 struct mlx5_esw_functions {
 	struct mlx5_nb		nb;
+	bool                    notifier_enabled;
 	bool			host_funcs_disabled;
 	u16			num_vfs;
 	u16			num_ec_vfs;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index ea94a727633f..0199bea2cb31 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3597,7 +3597,17 @@ esw_vfs_changed_event_handler(struct mlx5_eswitch *esw, const u32 *out)
 		return;
 
 	devlink = priv_to_devlink(esw->dev);
-	devl_lock(devlink);
+	/* Repeatedly try to grab the lock with a delay while this work is
+	 * still relevant.
+	 * This allows a concurrent mlx5_eswitch_event_handler_unregister
+	 * (holding the devlink lock) to flush the wq without deadlocking.
+	 */
+	while (!devl_trylock(devlink)) {
+		if (!esw->esw_funcs.notifier_enabled)
+			return;
+		schedule_timeout_interruptible(msecs_to_jiffies(10));
+	}
+
 	/* Number of VFs can only change from "0 to x" or "x to 0". */
 	if (esw->esw_funcs.num_vfs > 0) {
 		mlx5_eswitch_unload_vf_vports(esw, esw->esw_funcs.num_vfs);
-- 
2.40.1
Re: [PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
Posted by Jakub Kicinski 1 week, 3 days ago
On Tue, 27 Jan 2026 10:52:39 +0200 Tariq Toukan wrote:
> esw_functions_changed_event_handler -> esw_vfs_changed_event_handler is
> called from the esw->work_queue and acquires the devlink lock.
> 
> Changing the esw mode is done via .eswitch_mode_set (acquires devlink
> lock in the devlink_nl_pre_doit call) -> mlx5_devlink_eswitch_mode_set
> -> mlx5_eswitch_disable_locked -> mlx5_eswitch_event_handler_unregister
> -> flush_workqueue.  

This is quite an ugly hack, is there no way to avoid the flush and let 
the work discover that what it was supposed to do is no longer needed?

>  	devlink = priv_to_devlink(esw->dev);
> -	devl_lock(devlink);
> +	/* Repeatedly try to grab the lock with a delay while this work is
> +	 * still relevant.
> +	 * This allows a concurrent mlx5_eswitch_event_handler_unregister
> +	 * (holding the devlink lock) to flush the wq without deadlocking.
> +	 */
> +	while (!devl_trylock(devlink)) {
> +		if (!esw->esw_funcs.notifier_enabled)

Technically READ_ONCE/WRITE_ONCE is required on this.

> +			return;
> +		schedule_timeout_interruptible(msecs_to_jiffies(10));

Why _interruptible(), you're not handling the return value.
If somehow this thread gets a signal pending we'll turn this
loop into a busy poll which doesn't seem ideal?


Ima take this patch out of the series and apply the rest.
Re: [PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
Posted by Cosmin Ratiu 1 week, 3 days ago
On Wed, 2026-01-28 at 20:56 -0800, Jakub Kicinski wrote:
> On Tue, 27 Jan 2026 10:52:39 +0200 Tariq Toukan wrote:
> > esw_functions_changed_event_handler ->
> > esw_vfs_changed_event_handler is
> > called from the esw->work_queue and acquires the devlink lock.
> > 
> > Changing the esw mode is done via .eswitch_mode_set (acquires
> > devlink
> > lock in the devlink_nl_pre_doit call) ->
> > mlx5_devlink_eswitch_mode_set
> > -> mlx5_eswitch_disable_locked ->
> > mlx5_eswitch_event_handler_unregister
> > -> flush_workqueue.  
> 
> This is quite an ugly hack, is there no way to avoid the flush and
> let 
> the work discover that what it was supposed to do is no longer
> needed?

Not possible, unfortunately. I stared at it for quite a while. The wq
is flushed because the esw is being unconfigured, which removes data
structs the work handler uses. Flushing the work is required, otherwise
we'll run into worse issues.

> 
> >  	devlink = priv_to_devlink(esw->dev);
> > -	devl_lock(devlink);
> > +	/* Repeatedly try to grab the lock with a delay while this
> > work is
> > +	 * still relevant.
> > +	 * This allows a concurrent
> > mlx5_eswitch_event_handler_unregister
> > +	 * (holding the devlink lock) to flush the wq without
> > deadlocking.
> > +	 */
> > +	while (!devl_trylock(devlink)) {
> > +		if (!esw->esw_funcs.notifier_enabled)
> 
> Technically READ_ONCE/WRITE_ONCE is required on this.

Will fix.


> > +			return;
> > +		schedule_timeout_interruptible(msecs_to_jiffies(10
> > ));
> 
> Why _interruptible(), you're not handling the return value.
> If somehow this thread gets a signal pending we'll turn this
> loop into a busy poll which doesn't seem ideal?


Didn't pay attention to this possibility. Sorry, will fix.

Cosmin.
Re: [PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
Posted by Jakub Kicinski 1 week, 2 days ago
On Thu, 29 Jan 2026 10:33:40 +0000 Cosmin Ratiu wrote:
> > This is quite an ugly hack, is there no way to avoid the flush and
> > let 
> > the work discover that what it was supposed to do is no longer
> > needed?  
> 
> Not possible, unfortunately. I stared at it for quite a while. The wq
> is flushed because the esw is being unconfigured, which removes data
> structs the work handler uses. Flushing the work is required, otherwise
> we'll run into worse issues.

And having a refount on (I presume) struct mlx5_esw_functions
so that work can hold a ref is not an option?
Are you planning to revisit this in -next?
Re: [PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
Posted by Cosmin Ratiu 6 days, 6 hours ago
On Thu, 2026-01-29 at 15:40 -0800, Jakub Kicinski wrote:
> On Thu, 29 Jan 2026 10:33:40 +0000 Cosmin Ratiu wrote:
> > > This is quite an ugly hack, is there no way to avoid the flush
> > > and
> > > let 
> > > the work discover that what it was supposed to do is no longer
> > > needed?  
> > 
> > Not possible, unfortunately. I stared at it for quite a while. The
> > wq
> > is flushed because the esw is being unconfigured, which removes
> > data
> > structs the work handler uses. Flushing the work is required,
> > otherwise
> > we'll run into worse issues.
> 
> And having a refount on (I presume) struct mlx5_esw_functions
> so that work can hold a ref is not an option?
> Are you planning to revisit this in -next?

Currently, mlx5_eswitch_disable_locked (with the devlink lock held)
waits for esw_vfs_changed_event_handler to finish.
The event handler needs to acquire the same lock and load/unload all
VFs, which touches the entire esw.
I don't currently see how to use reference counting on the esw to avoid
waiting for the handler.

But we can have a deeper look as part of an internal task to improve
this. For now, please accept the V3 fix (about-to-be-sent) with the
current approach because we couldn't find a better way.

Cosmin.
Re: [PATCH net V2 2/4] net/mlx5: Fix deadlock between devlink lock and esw->wq
Posted by Jakub Kicinski 5 days, 20 hours ago
On Mon, 2 Feb 2026 14:48:28 +0000 Cosmin Ratiu wrote:
> > And having a refount on (I presume) struct mlx5_esw_functions
> > so that work can hold a ref is not an option?
> > Are you planning to revisit this in -next?  
> 
> Currently, mlx5_eswitch_disable_locked (with the devlink lock held)
> waits for esw_vfs_changed_event_handler to finish.
> The event handler needs to acquire the same lock and load/unload all
> VFs, which touches the entire esw.
> I don't currently see how to use reference counting on the esw to avoid
> waiting for the handler.

struct my_thing_with_work {
	work;
	refcount;
	dead;
};

work() {
	lock()
	if (my_thing->dead)
		goto out;
	/* .. add code here .. */
out:
	unlock()
	my_thing_put(my_thing)
}

some_op() {
	// assuming lock() held
	if (!work_queued(my_thing->work)) {
		refcount_inc(my_thing->refcount);
		queue_work(my_thing->work)
	}
}

shutdown_op() {
	// assuming lock() held
	if (cancel_work())
		my_thing_put(my_thing)

	my_thing->dead = true;
	my_thing_put(my_thing)
}