[v1] slimbus: qcom-ngd-ctrl: Fix some race conditions and deadlocks

[PATCH 7/7] slimbus: qcom-ngd-ctrl: Avoid ABBA on tx_lock/ctrl->lock

Posted by Bjorn Andersson 1 month ago

During the SSR/PDR down notification the tx_lock is taken with the
intent to provide synchronization with active DMA transfers.

But during this period qcom_slim_ngd_down() is invoked, which ends up in
slim_report_absent(), which takes the slim_controller lock. In multiple
other codepaths these two locks are taken in the opposite order (i.e.
slim_controller then tx_lock).

The result is a lockdep splat, and a possible deadlock:

  rprocctl/449 is trying to acquire lock:
  ffff00009793e620 (&ctrl->lock){+.+.}-{4:4}, at: slim_report_absent (drivers/slimbus/core.c:322) slimbus

  but task is already holding lock:
  ffff00009793fb50 (&ctrl->tx_lock){+.+.}-{4:4}, at: qcom_slim_ngd_ssr_pdr_notify (drivers/slimbus/qcom-ngd-ctrl.c:1475) slim_qcom_ngd_ctrl

  which lock already depends on the new lock.

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&ctrl->tx_lock);
                                lock(&ctrl->lock);
                                lock(&ctrl->tx_lock);
   lock(&ctrl->lock);

The assumption is that the comment refers to the desire to not call
qcom_slim_ngd_exit_dma() while we have an ongoing DMA TX transaction.
But any such transaction is initiated and completed within a single
qcom_slim_ngd_xfer_msg().

Prior to calling qcom_slim_ngd_exit_dma() the slim_controller is torn
down, all child devices are notified that the slimbus is gone and the
child devices are removed.

Stop taking the tx_lock in qcom_slim_ngd_ssr_pdr_notify() to avoid the
deadlock.

Fixes: a899d324863a ("slimbus: qcom-ngd-ctrl: add Sub System Restart support")
Cc: stable@vger.kernel.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
---
 drivers/slimbus/qcom-ngd-ctrl.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/slimbus/qcom-ngd-ctrl.c b/drivers/slimbus/qcom-ngd-ctrl.c
index 54a4c6ee1e71fe55794f09575979826d9aa5be9f..75d70de0909a8d17e2410d30f7811f32d5eebea3 100644
--- a/drivers/slimbus/qcom-ngd-ctrl.c
+++ b/drivers/slimbus/qcom-ngd-ctrl.c
@@ -1471,15 +1471,12 @@ static int qcom_slim_ngd_ssr_pdr_notify(struct qcom_slim_ngd_ctrl *ctrl,
 	switch (action) {
 	case QCOM_SSR_BEFORE_SHUTDOWN:
 	case SERVREG_SERVICE_STATE_DOWN:
-		/* Make sure the last dma xfer is finished */
-		mutex_lock(&ctrl->tx_lock);
 		if (ctrl->state != QCOM_SLIM_NGD_CTRL_DOWN) {
 			pm_runtime_get_noresume(ctrl->ctrl.dev);
 			ctrl->state = QCOM_SLIM_NGD_CTRL_DOWN;
 			qcom_slim_ngd_down(ctrl);
 			qcom_slim_ngd_exit_dma(ctrl);
 		}
-		mutex_unlock(&ctrl->tx_lock);
 		break;
 	case QCOM_SSR_AFTER_POWERUP:
 	case SERVREG_SERVICE_STATE_UP:

-- 
2.51.0

Re: [PATCH 7/7] slimbus: qcom-ngd-ctrl: Avoid ABBA on tx_lock/ctrl->lock

Posted by Dmitry Baryshkov 4 weeks, 1 day ago

On Mon, Mar 09, 2026 at 11:09:08PM -0500, Bjorn Andersson wrote:
> During the SSR/PDR down notification the tx_lock is taken with the
> intent to provide synchronization with active DMA transfers.
> 
> But during this period qcom_slim_ngd_down() is invoked, which ends up in
> slim_report_absent(), which takes the slim_controller lock. In multiple
> other codepaths these two locks are taken in the opposite order (i.e.
> slim_controller then tx_lock).
> 
> The result is a lockdep splat, and a possible deadlock:
> 
>   rprocctl/449 is trying to acquire lock:
>   ffff00009793e620 (&ctrl->lock){+.+.}-{4:4}, at: slim_report_absent (drivers/slimbus/core.c:322) slimbus
> 
>   but task is already holding lock:
>   ffff00009793fb50 (&ctrl->tx_lock){+.+.}-{4:4}, at: qcom_slim_ngd_ssr_pdr_notify (drivers/slimbus/qcom-ngd-ctrl.c:1475) slim_qcom_ngd_ctrl
> 
>   which lock already depends on the new lock.
> 
>   Possible unsafe locking scenario:
> 
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctrl->tx_lock);
>                                 lock(&ctrl->lock);
>                                 lock(&ctrl->tx_lock);
>    lock(&ctrl->lock);
> 
> The assumption is that the comment refers to the desire to not call
> qcom_slim_ngd_exit_dma() while we have an ongoing DMA TX transaction.
> But any such transaction is initiated and completed within a single
> qcom_slim_ngd_xfer_msg().
> 
> Prior to calling qcom_slim_ngd_exit_dma() the slim_controller is torn
> down, all child devices are notified that the slimbus is gone and the
> child devices are removed.
> 
> Stop taking the tx_lock in qcom_slim_ngd_ssr_pdr_notify() to avoid the
> deadlock.
> 
> Fixes: a899d324863a ("slimbus: qcom-ngd-ctrl: add Sub System Restart support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
> ---
>  drivers/slimbus/qcom-ngd-ctrl.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/slimbus/qcom-ngd-ctrl.c b/drivers/slimbus/qcom-ngd-ctrl.c
> index 54a4c6ee1e71fe55794f09575979826d9aa5be9f..75d70de0909a8d17e2410d30f7811f32d5eebea3 100644
> --- a/drivers/slimbus/qcom-ngd-ctrl.c
> +++ b/drivers/slimbus/qcom-ngd-ctrl.c
> @@ -1471,15 +1471,12 @@ static int qcom_slim_ngd_ssr_pdr_notify(struct qcom_slim_ngd_ctrl *ctrl,
>  	switch (action) {
>  	case QCOM_SSR_BEFORE_SHUTDOWN:
>  	case SERVREG_SERVICE_STATE_DOWN:
> -		/* Make sure the last dma xfer is finished */
> -		mutex_lock(&ctrl->tx_lock);
>  		if (ctrl->state != QCOM_SLIM_NGD_CTRL_DOWN) {
>  			pm_runtime_get_noresume(ctrl->ctrl.dev);
>  			ctrl->state = QCOM_SLIM_NGD_CTRL_DOWN;

What will protect ctrl->state from the possible concurrent modification?

>  			qcom_slim_ngd_down(ctrl);
>  			qcom_slim_ngd_exit_dma(ctrl);
>  		}
> -		mutex_unlock(&ctrl->tx_lock);
>  		break;
>  	case QCOM_SSR_AFTER_POWERUP:
>  	case SERVREG_SERVICE_STATE_UP:
> 
> -- 
> 2.51.0
> 

-- 
With best wishes
Dmitry

Re: [PATCH 7/7] slimbus: qcom-ngd-ctrl: Avoid ABBA on tx_lock/ctrl->lock

Posted by Bjorn Andersson 1 week, 1 day ago

On Wed, Mar 11, 2026 at 03:37:10AM +0200, Dmitry Baryshkov wrote:
> On Mon, Mar 09, 2026 at 11:09:08PM -0500, Bjorn Andersson wrote:
> > During the SSR/PDR down notification the tx_lock is taken with the
> > intent to provide synchronization with active DMA transfers.
> > 
> > But during this period qcom_slim_ngd_down() is invoked, which ends up in
> > slim_report_absent(), which takes the slim_controller lock. In multiple
> > other codepaths these two locks are taken in the opposite order (i.e.
> > slim_controller then tx_lock).
> > 
> > The result is a lockdep splat, and a possible deadlock:
> > 
> >   rprocctl/449 is trying to acquire lock:
> >   ffff00009793e620 (&ctrl->lock){+.+.}-{4:4}, at: slim_report_absent (drivers/slimbus/core.c:322) slimbus
> > 
> >   but task is already holding lock:
> >   ffff00009793fb50 (&ctrl->tx_lock){+.+.}-{4:4}, at: qcom_slim_ngd_ssr_pdr_notify (drivers/slimbus/qcom-ngd-ctrl.c:1475) slim_qcom_ngd_ctrl
> > 
> >   which lock already depends on the new lock.
> > 
> >   Possible unsafe locking scenario:
> > 
> >         CPU0                    CPU1
> >         ----                    ----
> >    lock(&ctrl->tx_lock);
> >                                 lock(&ctrl->lock);
> >                                 lock(&ctrl->tx_lock);
> >    lock(&ctrl->lock);
> > 
> > The assumption is that the comment refers to the desire to not call
> > qcom_slim_ngd_exit_dma() while we have an ongoing DMA TX transaction.
> > But any such transaction is initiated and completed within a single
> > qcom_slim_ngd_xfer_msg().
> > 
> > Prior to calling qcom_slim_ngd_exit_dma() the slim_controller is torn
> > down, all child devices are notified that the slimbus is gone and the
> > child devices are removed.
> > 
> > Stop taking the tx_lock in qcom_slim_ngd_ssr_pdr_notify() to avoid the
> > deadlock.
> > 
> > Fixes: a899d324863a ("slimbus: qcom-ngd-ctrl: add Sub System Restart support")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
> > ---
> >  drivers/slimbus/qcom-ngd-ctrl.c | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/drivers/slimbus/qcom-ngd-ctrl.c b/drivers/slimbus/qcom-ngd-ctrl.c
> > index 54a4c6ee1e71fe55794f09575979826d9aa5be9f..75d70de0909a8d17e2410d30f7811f32d5eebea3 100644
> > --- a/drivers/slimbus/qcom-ngd-ctrl.c
> > +++ b/drivers/slimbus/qcom-ngd-ctrl.c
> > @@ -1471,15 +1471,12 @@ static int qcom_slim_ngd_ssr_pdr_notify(struct qcom_slim_ngd_ctrl *ctrl,
> >  	switch (action) {
> >  	case QCOM_SSR_BEFORE_SHUTDOWN:
> >  	case SERVREG_SERVICE_STATE_DOWN:
> > -		/* Make sure the last dma xfer is finished */
> > -		mutex_lock(&ctrl->tx_lock);
> >  		if (ctrl->state != QCOM_SLIM_NGD_CTRL_DOWN) {
> >  			pm_runtime_get_noresume(ctrl->ctrl.dev);
> >  			ctrl->state = QCOM_SLIM_NGD_CTRL_DOWN;
> 
> What will protect ctrl->state from the possible concurrent modification?
> 

Nothing. qcom_slim_ngd_ssr_pdr_notify() might (at least) race with
qcom_slim_ngd_runtime_idle() and qcom_slim_ngd_runtime_suspend().

I think it would make sense to bring the ssr_lock out of
qcom_slim_ngd_up_worker() to ensure that qcom_slim_ngd_ssr_pdr_notify()
can't race with "itself" - but I believe that's still an incomplete fix
in relation to the PM runtime state.

More work will be needed here, beyond this series.

Regards,
Bjorn

> >  			qcom_slim_ngd_down(ctrl);
> >  			qcom_slim_ngd_exit_dma(ctrl);
> >  		}
> > -		mutex_unlock(&ctrl->tx_lock);
> >  		break;
> >  	case QCOM_SSR_AFTER_POWERUP:
> >  	case SERVREG_SERVICE_STATE_UP:
> > 
> > -- 
> > 2.51.0
> > 
> 
> -- 
> With best wishes
> Dmitry

Re: [PATCH 7/7] slimbus: qcom-ngd-ctrl: Avoid ABBA on tx_lock/ctrl->lock

Posted by Mukesh Ojha 4 weeks, 1 day ago

On Mon, Mar 09, 2026 at 11:09:08PM -0500, Bjorn Andersson wrote:
> During the SSR/PDR down notification the tx_lock is taken with the
> intent to provide synchronization with active DMA transfers.
> 
> But during this period qcom_slim_ngd_down() is invoked, which ends up in
> slim_report_absent(), which takes the slim_controller lock. In multiple
> other codepaths these two locks are taken in the opposite order (i.e.
> slim_controller then tx_lock).
> 
> The result is a lockdep splat, and a possible deadlock:
> 
>   rprocctl/449 is trying to acquire lock:
>   ffff00009793e620 (&ctrl->lock){+.+.}-{4:4}, at: slim_report_absent (drivers/slimbus/core.c:322) slimbus
> 
>   but task is already holding lock:
>   ffff00009793fb50 (&ctrl->tx_lock){+.+.}-{4:4}, at: qcom_slim_ngd_ssr_pdr_notify (drivers/slimbus/qcom-ngd-ctrl.c:1475) slim_qcom_ngd_ctrl
> 
>   which lock already depends on the new lock.
> 
>   Possible unsafe locking scenario:
> 
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctrl->tx_lock);
>                                 lock(&ctrl->lock);
>                                 lock(&ctrl->tx_lock);
>    lock(&ctrl->lock);
> 
> The assumption is that the comment refers to the desire to not call
> qcom_slim_ngd_exit_dma() while we have an ongoing DMA TX transaction.
> But any such transaction is initiated and completed within a single
> qcom_slim_ngd_xfer_msg().
> 
> Prior to calling qcom_slim_ngd_exit_dma() the slim_controller is torn
> down, all child devices are notified that the slimbus is gone and the
> child devices are removed.
> 
> Stop taking the tx_lock in qcom_slim_ngd_ssr_pdr_notify() to avoid the
> deadlock.
> 
> Fixes: a899d324863a ("slimbus: qcom-ngd-ctrl: add Sub System Restart support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
> ---
>  drivers/slimbus/qcom-ngd-ctrl.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/slimbus/qcom-ngd-ctrl.c b/drivers/slimbus/qcom-ngd-ctrl.c
> index 54a4c6ee1e71fe55794f09575979826d9aa5be9f..75d70de0909a8d17e2410d30f7811f32d5eebea3 100644
> --- a/drivers/slimbus/qcom-ngd-ctrl.c
> +++ b/drivers/slimbus/qcom-ngd-ctrl.c
> @@ -1471,15 +1471,12 @@ static int qcom_slim_ngd_ssr_pdr_notify(struct qcom_slim_ngd_ctrl *ctrl,
>  	switch (action) {
>  	case QCOM_SSR_BEFORE_SHUTDOWN:
>  	case SERVREG_SERVICE_STATE_DOWN:
> -		/* Make sure the last dma xfer is finished */
> -		mutex_lock(&ctrl->tx_lock);
>  		if (ctrl->state != QCOM_SLIM_NGD_CTRL_DOWN) {
>  			pm_runtime_get_noresume(ctrl->ctrl.dev);
>  			ctrl->state = QCOM_SLIM_NGD_CTRL_DOWN;
>  			qcom_slim_ngd_down(ctrl);
>  			qcom_slim_ngd_exit_dma(ctrl);
>  		}
> -		mutex_unlock(&ctrl->tx_lock);


is it not much more safer, to put this tx_lock around qcom_slim_ngd_exit_dma() ?


>  		break;
>  	case QCOM_SSR_AFTER_POWERUP:
>  	case SERVREG_SERVICE_STATE_UP:
> 
> -- 
> 2.51.0
> 

-- 
-Mukesh Ojha

Re: [PATCH 7/7] slimbus: qcom-ngd-ctrl: Avoid ABBA on tx_lock/ctrl->lock

Posted by Bjorn Andersson 4 weeks, 1 day ago

On Tue, Mar 10, 2026 at 03:33:19PM +0530, Mukesh Ojha wrote:
> On Mon, Mar 09, 2026 at 11:09:08PM -0500, Bjorn Andersson wrote:
> > During the SSR/PDR down notification the tx_lock is taken with the
> > intent to provide synchronization with active DMA transfers.
> > 
> > But during this period qcom_slim_ngd_down() is invoked, which ends up in
> > slim_report_absent(), which takes the slim_controller lock. In multiple
> > other codepaths these two locks are taken in the opposite order (i.e.
> > slim_controller then tx_lock).
> > 
> > The result is a lockdep splat, and a possible deadlock:
> > 
> >   rprocctl/449 is trying to acquire lock:
> >   ffff00009793e620 (&ctrl->lock){+.+.}-{4:4}, at: slim_report_absent (drivers/slimbus/core.c:322) slimbus
> > 
> >   but task is already holding lock:
> >   ffff00009793fb50 (&ctrl->tx_lock){+.+.}-{4:4}, at: qcom_slim_ngd_ssr_pdr_notify (drivers/slimbus/qcom-ngd-ctrl.c:1475) slim_qcom_ngd_ctrl
> > 
> >   which lock already depends on the new lock.
> > 
> >   Possible unsafe locking scenario:
> > 
> >         CPU0                    CPU1
> >         ----                    ----
> >    lock(&ctrl->tx_lock);
> >                                 lock(&ctrl->lock);
> >                                 lock(&ctrl->tx_lock);
> >    lock(&ctrl->lock);
> > 
> > The assumption is that the comment refers to the desire to not call
> > qcom_slim_ngd_exit_dma() while we have an ongoing DMA TX transaction.
> > But any such transaction is initiated and completed within a single
> > qcom_slim_ngd_xfer_msg().
> > 
> > Prior to calling qcom_slim_ngd_exit_dma() the slim_controller is torn
> > down, all child devices are notified that the slimbus is gone and the
> > child devices are removed.
> > 
> > Stop taking the tx_lock in qcom_slim_ngd_ssr_pdr_notify() to avoid the
> > deadlock.
> > 
> > Fixes: a899d324863a ("slimbus: qcom-ngd-ctrl: add Sub System Restart support")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
> > ---
> >  drivers/slimbus/qcom-ngd-ctrl.c | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/drivers/slimbus/qcom-ngd-ctrl.c b/drivers/slimbus/qcom-ngd-ctrl.c
> > index 54a4c6ee1e71fe55794f09575979826d9aa5be9f..75d70de0909a8d17e2410d30f7811f32d5eebea3 100644
> > --- a/drivers/slimbus/qcom-ngd-ctrl.c
> > +++ b/drivers/slimbus/qcom-ngd-ctrl.c
> > @@ -1471,15 +1471,12 @@ static int qcom_slim_ngd_ssr_pdr_notify(struct qcom_slim_ngd_ctrl *ctrl,
> >  	switch (action) {
> >  	case QCOM_SSR_BEFORE_SHUTDOWN:
> >  	case SERVREG_SERVICE_STATE_DOWN:
> > -		/* Make sure the last dma xfer is finished */
> > -		mutex_lock(&ctrl->tx_lock);
> >  		if (ctrl->state != QCOM_SLIM_NGD_CTRL_DOWN) {
> >  			pm_runtime_get_noresume(ctrl->ctrl.dev);
> >  			ctrl->state = QCOM_SLIM_NGD_CTRL_DOWN;
> >  			qcom_slim_ngd_down(ctrl);
> >  			qcom_slim_ngd_exit_dma(ctrl);
> >  		}
> > -		mutex_unlock(&ctrl->tx_lock);
> 
> 
> is it not much more safer, to put this tx_lock around qcom_slim_ngd_exit_dma() ?
> 

It would avoid the deadlock in question, so that's good.

But I don't think it's reasonable to guard against the case where
qcom_slim_ngd_xfer_msg() is running beyond qcom_slim_ngd_down().

qcom_slim_ngd_down() will tear down the world around the caller
of qcom_slim_ngd_xfer_msg(), so it's unlikely we're in a good place if
this happens.

One concrete example of this is that the wcd934x "ddata" will be
released by devres as qcom_slim_ngd_down() is cleaning up the children.


But to clarify, this is not something that is handled properly today -
more work is needed in this area.

Regards,
Bjorn

> 
> >  		break;
> >  	case QCOM_SSR_AFTER_POWERUP:
> >  	case SERVREG_SERVICE_STATE_UP:
> > 
> > -- 
> > 2.51.0
> > 
> 
> -- 
> -Mukesh Ojha

[PATCH 1/7] slimbus: qcom-ngd-ctrl: Fix up platform_driver registration
[PATCH 2/7] slimbus: qcom-ngd-ctrl: Fix probe error path ordering
[PATCH 3/7] slimbus: qcom-ngd-ctrl: Correct PDR and SSR cleanup ownership
[PATCH 4/7] slimbus: qcom-ngd-ctrl: Register callbacks after creating the ngd
[PATCH 5/7] slimbus: qcom-ngd-ctrl: Initialize controller resources in controller
[PATCH 6/7] slimbus: qcom-ngd-ctrl: Balance pm_runtime enablement for NGD
[PATCH 7/7] slimbus: qcom-ngd-ctrl: Avoid ABBA on tx_lock/ctrl->lock