drm/msm/a6xx: Fix GMU lockdep splat

[PATCH] drm/msm/a6xx: Fix GMU lockdep splat

Posted by Rob Clark 2 years, 6 months ago

From: Rob Clark <robdclark@chromium.org>

For normal GPU devfreq, we need to acquire the GMU lock while already
holding devfreq locks.  But in the teardown path, we were calling
dev_pm_domain_detach() while already holding the GMU lock, resulting in
this lockdep splat:

   ======================================================
   WARNING: possible circular locking dependency detected
   6.4.3-debug+ #3 Not tainted
   ------------------------------------------------------
   ring0/391 is trying to acquire lock:
   ffffff80a025c078 (&devfreq->lock){+.+.}-{3:3}, at: qos_notifier_call+0x30/0x74

   but task is already holding lock:
   ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #4 (&(c->notifiers)->rwsem){++++}-{3:3}:
          down_write+0x58/0x74
          __blocking_notifier_chain_register+0x64/0x84
          blocking_notifier_chain_register+0x1c/0x28
          freq_qos_add_notifier+0x5c/0x7c
          dev_pm_qos_add_notifier+0xd4/0xf0
          devfreq_add_device+0x42c/0x560
          devm_devfreq_add_device+0x6c/0xb8
          msm_devfreq_init+0xa8/0x16c [msm]
          msm_gpu_init+0x368/0x54c [msm]
          adreno_gpu_init+0x248/0x2b0 [msm]
          a6xx_gpu_init+0x2d0/0x384 [msm]
          adreno_bind+0x264/0x2bc [msm]
          component_bind_all+0x124/0x1f4
          msm_drm_bind+0x2d0/0x5f4 [msm]
          try_to_bring_up_aggregate_device+0x88/0x1a4
          __component_add+0xd4/0x128
          component_add+0x1c/0x28
          dp_display_probe+0x37c/0x3c0 [msm]
          platform_probe+0x70/0xc0
          really_probe+0x148/0x280
          __driver_probe_device+0xfc/0x114
          driver_probe_device+0x44/0x100
          __device_attach_driver+0x64/0xdc
          bus_for_each_drv+0xb0/0xd8
          __device_attach+0xe4/0x140
          device_initial_probe+0x1c/0x28
          bus_probe_device+0x44/0xb0
          deferred_probe_work_func+0xb0/0xc8
          process_one_work+0x288/0x3d8
          worker_thread+0x1f0/0x260
          kthread+0xf0/0x100
          ret_from_fork+0x10/0x20

   -> #3 (dev_pm_qos_mtx){+.+.}-{3:3}:
          __mutex_lock+0xc8/0x388
          mutex_lock_nested+0x2c/0x38
          dev_pm_qos_remove_notifier+0x3c/0xc8
          genpd_remove_device+0x40/0x11c
          genpd_dev_pm_detach+0x88/0x130
          dev_pm_domain_detach+0x2c/0x3c
          a6xx_gmu_remove+0x44/0xdc [msm]
          a6xx_destroy+0x7c/0xa4 [msm]
          adreno_unbind+0x50/0x64 [msm]
          component_unbind+0x44/0x64
          component_unbind_all+0xb4/0xbc
          msm_drm_uninit.isra.0+0x124/0x17c [msm]
          msm_drm_bind+0x340/0x5f4 [msm]
          try_to_bring_up_aggregate_device+0x88/0x1a4
          __component_add+0xd4/0x128
          component_add+0x1c/0x28
          dp_display_probe+0x37c/0x3c0 [msm]
          platform_probe+0x70/0xc0
          really_probe+0x148/0x280
          __driver_probe_device+0xfc/0x114
          driver_probe_device+0x44/0x100
          __device_attach_driver+0x64/0xdc
          bus_for_each_drv+0xb0/0xd8
          __device_attach+0xe4/0x140
          device_initial_probe+0x1c/0x28
          bus_probe_device+0x44/0xb0
          deferred_probe_work_func+0xb0/0xc8
          process_one_work+0x288/0x3d8
          worker_thread+0x1f0/0x260
          kthread+0xf0/0x100
          ret_from_fork+0x10/0x20

   -> #2 (&a6xx_gpu->gmu.lock){+.+.}-{3:3}:
          __mutex_lock+0xc8/0x388
          mutex_lock_nested+0x2c/0x38
          a6xx_gpu_set_freq+0x38/0x64 [msm]
          msm_devfreq_target+0x170/0x18c [msm]
          devfreq_set_target+0x90/0x1e4
          devfreq_update_target+0xb4/0xf0
          update_devfreq+0x1c/0x28
          devfreq_monitor+0x3c/0x10c
          process_one_work+0x288/0x3d8
          worker_thread+0x1f0/0x260
          kthread+0xf0/0x100
          ret_from_fork+0x10/0x20

   -> #1 (&df->lock){+.+.}-{3:3}:
          __mutex_lock+0xc8/0x388
          mutex_lock_nested+0x2c/0x38
          msm_devfreq_get_dev_status+0x4c/0x104 [msm]
          devfreq_simple_ondemand_func+0x5c/0x128
          devfreq_update_target+0x68/0xf0
          update_devfreq+0x1c/0x28
          devfreq_monitor+0x3c/0x10c
          process_one_work+0x288/0x3d8
          worker_thread+0x1f0/0x260
          kthread+0xf0/0x100
          ret_from_fork+0x10/0x20

   -> #0 (&devfreq->lock){+.+.}-{3:3}:
          __lock_acquire+0xdf8/0x109c
          lock_acquire+0x234/0x284
          __mutex_lock+0xc8/0x388
          mutex_lock_nested+0x2c/0x38
          qos_notifier_call+0x30/0x74
          qos_min_notifier_call+0x1c/0x28
          notifier_call_chain+0xf4/0x114
          blocking_notifier_call_chain+0x4c/0x78
          pm_qos_update_target+0x184/0x190
          freq_qos_apply+0x4c/0x64
          apply_constraint+0xf8/0xfc
          __dev_pm_qos_update_request+0x138/0x164
          dev_pm_qos_update_request+0x44/0x68
          msm_devfreq_boost+0x40/0x70 [msm]
          msm_devfreq_active+0xc0/0xf0 [msm]
          msm_gpu_submit+0xc8/0x12c [msm]
          msm_job_run+0x88/0x128 [msm]
          drm_sched_main+0x240/0x324 [gpu_sched]
          kthread+0xf0/0x100
          ret_from_fork+0x10/0x20

   other info that might help us debug this:
   Chain exists of:
     &devfreq->lock --> dev_pm_qos_mtx --> &(c->notifiers)->rwsem
    Possible unsafe locking scenario:
          CPU0                    CPU1
          ----                    ----
     rlock(&(c->notifiers)->rwsem);
                                  lock(dev_pm_qos_mtx);
                                  lock(&(c->notifiers)->rwsem);
     lock(&devfreq->lock);

    *** DEADLOCK ***
   4 locks held by ring0/391:
    #0: ffffff809c811170 (&gpu->lock){+.+.}-{3:3}, at: msm_job_run+0x7c/0x128 [msm]
    #1: ffffff809c811208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xa8/0x12c [msm]
    #2: ffffffecbbb46600 (dev_pm_qos_mtx){+.+.}-{3:3}, at: dev_pm_qos_update_request+0x38/0x68
    #3: ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78

   stack backtrace:
   CPU: 6 PID: 391 Comm: ring0 Not tainted 6.4.3debug+ #3
   Hardware name: Google Villager (rev1+) with LTE (DT)
   Call trace:
    dump_backtrace+0xb4/0xf0
    show_stack+0x20/0x30
    dump_stack_lvl+0x60/0x84
    dump_stack+0x18/0x24
    print_circular_bug+0x1cc/0x234
    check_noncircular+0x78/0xac
    __lock_acquire+0xdf8/0x109c
    lock_acquire+0x234/0x284
    __mutex_lock+0xc8/0x388
    mutex_lock_nested+0x2c/0x38
    qos_notifier_call+0x30/0x74
    qos_min_notifier_call+0x1c/0x28
    notifier_call_chain+0xf4/0x114
    blocking_notifier_call_chain+0x4c/0x78
    pm_qos_update_target+0x184/0x190
    freq_qos_apply+0x4c/0x64
    apply_constraint+0xf8/0xfc
    __dev_pm_qos_update_request+0x138/0x164
    dev_pm_qos_update_request+0x44/0x68
    msm_devfreq_boost+0x40/0x70 [msm]
    msm_devfreq_active+0xc0/0xf0 [msm]
    msm_gpu_submit+0xc8/0x12c [msm]
    msm_job_run+0x88/0x128 [msm]
    drm_sched_main+0x240/0x324 [gpu_sched]
    kthread+0xf0/0x100
    ret_from_fork+0x10/0x20

Fix this by only synchronizing access to gmu->initialized.

Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer")
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 11 ++++++++---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c |  2 --
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index bf7f855f4a34..3e0033666a2a 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1441,8 +1441,15 @@ void a6xx_gmu_remove(struct a6xx_gpu *a6xx_gpu)
 	struct a6xx_gmu *gmu = &a6xx_gpu->gmu;
 	struct platform_device *pdev = to_platform_device(gmu->dev);
 
-	if (!gmu->initialized)
+	mutex_lock(&gmu->lock);
+	if (!gmu->initialized) {
+		mutex_unlock(&gmu->lock);
 		return;
+	}
+
+	gmu->initialized = false;
+
+	mutex_unlock(&gmu->lock);
 
 	pm_runtime_force_suspend(gmu->dev);
 
@@ -1472,8 +1479,6 @@ void a6xx_gmu_remove(struct a6xx_gpu *a6xx_gpu)
 
 	/* Drop reference taken in of_find_device_by_node */
 	put_device(gmu->dev);
-
-	gmu->initialized = false;
 }
 
 static int cxpd_notifier_cb(struct notifier_block *nb,
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 9be3260c8033..67dd2eeecf62 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -2091,9 +2091,7 @@ static void a6xx_destroy(struct msm_gpu *gpu)
 
 	a6xx_llc_slices_destroy(a6xx_gpu);
 
-	mutex_lock(&a6xx_gpu->gmu.lock);
 	a6xx_gmu_remove(a6xx_gpu);
-	mutex_unlock(&a6xx_gpu->gmu.lock);
 
 	adreno_gpu_cleanup(adreno_gpu);
 
-- 
2.41.0

Re: [PATCH] drm/msm/a6xx: Fix GMU lockdep splat

Posted by Doug Anderson 2 years, 6 months ago

Hi,

On Thu, Aug 3, 2023 at 10:34 AM Rob Clark <robdclark@gmail.com> wrote:
>
> From: Rob Clark <robdclark@chromium.org>
>
> For normal GPU devfreq, we need to acquire the GMU lock while already
> holding devfreq locks.  But in the teardown path, we were calling
> dev_pm_domain_detach() while already holding the GMU lock, resulting in
> this lockdep splat:
>
>    ======================================================
>    WARNING: possible circular locking dependency detected
>    6.4.3-debug+ #3 Not tainted
>    ------------------------------------------------------
>    ring0/391 is trying to acquire lock:
>    ffffff80a025c078 (&devfreq->lock){+.+.}-{3:3}, at: qos_notifier_call+0x30/0x74
>
>    but task is already holding lock:
>    ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78
>
>    which lock already depends on the new lock.
>
>    the existing dependency chain (in reverse order) is:
>
>    -> #4 (&(c->notifiers)->rwsem){++++}-{3:3}:
>           down_write+0x58/0x74
>           __blocking_notifier_chain_register+0x64/0x84
>           blocking_notifier_chain_register+0x1c/0x28
>           freq_qos_add_notifier+0x5c/0x7c
>           dev_pm_qos_add_notifier+0xd4/0xf0
>           devfreq_add_device+0x42c/0x560
>           devm_devfreq_add_device+0x6c/0xb8
>           msm_devfreq_init+0xa8/0x16c [msm]
>           msm_gpu_init+0x368/0x54c [msm]
>           adreno_gpu_init+0x248/0x2b0 [msm]
>           a6xx_gpu_init+0x2d0/0x384 [msm]
>           adreno_bind+0x264/0x2bc [msm]
>           component_bind_all+0x124/0x1f4
>           msm_drm_bind+0x2d0/0x5f4 [msm]
>           try_to_bring_up_aggregate_device+0x88/0x1a4
>           __component_add+0xd4/0x128
>           component_add+0x1c/0x28
>           dp_display_probe+0x37c/0x3c0 [msm]
>           platform_probe+0x70/0xc0
>           really_probe+0x148/0x280
>           __driver_probe_device+0xfc/0x114
>           driver_probe_device+0x44/0x100
>           __device_attach_driver+0x64/0xdc
>           bus_for_each_drv+0xb0/0xd8
>           __device_attach+0xe4/0x140
>           device_initial_probe+0x1c/0x28
>           bus_probe_device+0x44/0xb0
>           deferred_probe_work_func+0xb0/0xc8
>           process_one_work+0x288/0x3d8
>           worker_thread+0x1f0/0x260
>           kthread+0xf0/0x100
>           ret_from_fork+0x10/0x20
>
>    -> #3 (dev_pm_qos_mtx){+.+.}-{3:3}:
>           __mutex_lock+0xc8/0x388
>           mutex_lock_nested+0x2c/0x38
>           dev_pm_qos_remove_notifier+0x3c/0xc8
>           genpd_remove_device+0x40/0x11c
>           genpd_dev_pm_detach+0x88/0x130
>           dev_pm_domain_detach+0x2c/0x3c
>           a6xx_gmu_remove+0x44/0xdc [msm]
>           a6xx_destroy+0x7c/0xa4 [msm]
>           adreno_unbind+0x50/0x64 [msm]
>           component_unbind+0x44/0x64
>           component_unbind_all+0xb4/0xbc
>           msm_drm_uninit.isra.0+0x124/0x17c [msm]
>           msm_drm_bind+0x340/0x5f4 [msm]
>           try_to_bring_up_aggregate_device+0x88/0x1a4
>           __component_add+0xd4/0x128
>           component_add+0x1c/0x28
>           dp_display_probe+0x37c/0x3c0 [msm]
>           platform_probe+0x70/0xc0
>           really_probe+0x148/0x280
>           __driver_probe_device+0xfc/0x114
>           driver_probe_device+0x44/0x100
>           __device_attach_driver+0x64/0xdc
>           bus_for_each_drv+0xb0/0xd8
>           __device_attach+0xe4/0x140
>           device_initial_probe+0x1c/0x28
>           bus_probe_device+0x44/0xb0
>           deferred_probe_work_func+0xb0/0xc8
>           process_one_work+0x288/0x3d8
>           worker_thread+0x1f0/0x260
>           kthread+0xf0/0x100
>           ret_from_fork+0x10/0x20
>
>    -> #2 (&a6xx_gpu->gmu.lock){+.+.}-{3:3}:
>           __mutex_lock+0xc8/0x388
>           mutex_lock_nested+0x2c/0x38
>           a6xx_gpu_set_freq+0x38/0x64 [msm]
>           msm_devfreq_target+0x170/0x18c [msm]
>           devfreq_set_target+0x90/0x1e4
>           devfreq_update_target+0xb4/0xf0
>           update_devfreq+0x1c/0x28
>           devfreq_monitor+0x3c/0x10c
>           process_one_work+0x288/0x3d8
>           worker_thread+0x1f0/0x260
>           kthread+0xf0/0x100
>           ret_from_fork+0x10/0x20
>
>    -> #1 (&df->lock){+.+.}-{3:3}:
>           __mutex_lock+0xc8/0x388
>           mutex_lock_nested+0x2c/0x38
>           msm_devfreq_get_dev_status+0x4c/0x104 [msm]
>           devfreq_simple_ondemand_func+0x5c/0x128
>           devfreq_update_target+0x68/0xf0
>           update_devfreq+0x1c/0x28
>           devfreq_monitor+0x3c/0x10c
>           process_one_work+0x288/0x3d8
>           worker_thread+0x1f0/0x260
>           kthread+0xf0/0x100
>           ret_from_fork+0x10/0x20
>
>    -> #0 (&devfreq->lock){+.+.}-{3:3}:
>           __lock_acquire+0xdf8/0x109c
>           lock_acquire+0x234/0x284
>           __mutex_lock+0xc8/0x388
>           mutex_lock_nested+0x2c/0x38
>           qos_notifier_call+0x30/0x74
>           qos_min_notifier_call+0x1c/0x28
>           notifier_call_chain+0xf4/0x114
>           blocking_notifier_call_chain+0x4c/0x78
>           pm_qos_update_target+0x184/0x190
>           freq_qos_apply+0x4c/0x64
>           apply_constraint+0xf8/0xfc
>           __dev_pm_qos_update_request+0x138/0x164
>           dev_pm_qos_update_request+0x44/0x68
>           msm_devfreq_boost+0x40/0x70 [msm]
>           msm_devfreq_active+0xc0/0xf0 [msm]
>           msm_gpu_submit+0xc8/0x12c [msm]
>           msm_job_run+0x88/0x128 [msm]
>           drm_sched_main+0x240/0x324 [gpu_sched]
>           kthread+0xf0/0x100
>           ret_from_fork+0x10/0x20
>
>    other info that might help us debug this:
>    Chain exists of:
>      &devfreq->lock --> dev_pm_qos_mtx --> &(c->notifiers)->rwsem
>     Possible unsafe locking scenario:
>           CPU0                    CPU1
>           ----                    ----
>      rlock(&(c->notifiers)->rwsem);
>                                   lock(dev_pm_qos_mtx);
>                                   lock(&(c->notifiers)->rwsem);
>      lock(&devfreq->lock);
>
>     *** DEADLOCK ***
>    4 locks held by ring0/391:
>     #0: ffffff809c811170 (&gpu->lock){+.+.}-{3:3}, at: msm_job_run+0x7c/0x128 [msm]
>     #1: ffffff809c811208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xa8/0x12c [msm]
>     #2: ffffffecbbb46600 (dev_pm_qos_mtx){+.+.}-{3:3}, at: dev_pm_qos_update_request+0x38/0x68
>     #3: ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78
>
>    stack backtrace:
>    CPU: 6 PID: 391 Comm: ring0 Not tainted 6.4.3debug+ #3
>    Hardware name: Google Villager (rev1+) with LTE (DT)
>    Call trace:
>     dump_backtrace+0xb4/0xf0
>     show_stack+0x20/0x30
>     dump_stack_lvl+0x60/0x84
>     dump_stack+0x18/0x24
>     print_circular_bug+0x1cc/0x234
>     check_noncircular+0x78/0xac
>     __lock_acquire+0xdf8/0x109c
>     lock_acquire+0x234/0x284
>     __mutex_lock+0xc8/0x388
>     mutex_lock_nested+0x2c/0x38
>     qos_notifier_call+0x30/0x74
>     qos_min_notifier_call+0x1c/0x28
>     notifier_call_chain+0xf4/0x114
>     blocking_notifier_call_chain+0x4c/0x78
>     pm_qos_update_target+0x184/0x190
>     freq_qos_apply+0x4c/0x64
>     apply_constraint+0xf8/0xfc
>     __dev_pm_qos_update_request+0x138/0x164
>     dev_pm_qos_update_request+0x44/0x68
>     msm_devfreq_boost+0x40/0x70 [msm]
>     msm_devfreq_active+0xc0/0xf0 [msm]
>     msm_gpu_submit+0xc8/0x12c [msm]
>     msm_job_run+0x88/0x128 [msm]
>     drm_sched_main+0x240/0x324 [gpu_sched]
>     kthread+0xf0/0x100
>     ret_from_fork+0x10/0x20
>
> Fix this by only synchronizing access to gmu->initialized.
>
> Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer")
> Cc: Douglas Anderson <dianders@chromium.org>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 11 ++++++++---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c |  2 --
>  2 files changed, 8 insertions(+), 5 deletions(-)

While this is tangled enough that I can't necessarily think through
every aspect of it, this seems to me like it should do the trick.
Thanks for the fix!

Reviewed-by: Douglas Anderson <dianders@chromium.org>

-Doug