drm/msm: Switch ordering of runpm put vs devfreq_idle

[PATCH v2] drm/msm: Switch ordering of runpm put vs devfreq_idle

Posted by Rob Clark 3 years, 10 months ago

From: Rob Clark <robdclark@chromium.org>

I've seen a few crashes like:

    CPU: 0 PID: 216 Comm: A618-worker Tainted: G        W         5.4.196 #7
    Hardware name: Google Wormdingler rev1+ INX panel board (DT)
    pstate: 20c00009 (nzCv daif +PAN +UAO)
    pc : msm_readl+0x14/0x34
    lr : a6xx_gpu_busy+0x40/0x80
    sp : ffffffc011b93ad0
    x29: ffffffc011b93ad0 x28: ffffffe77cba3000
    x27: 0000000000000001 x26: ffffffe77bb4c4ac
    x25: ffffffa2f227dfa0 x24: ffffffa2f22aab28
    x23: 0000000000000000 x22: ffffffa2f22bf020
    x21: ffffffa2f22bf000 x20: ffffffc011b93b10
    x19: ffffffc011bd4110 x18: 000000000000000e
    x17: 0000000000000004 x16: 000000000000000c
    x15: 000001be3a969450 x14: 0000000000000400
    x13: 00000000000101d6 x12: 0000000034155555
    x11: 0000000000000001 x10: 0000000000000000
    x9 : 0000000100000000 x8 : ffffffc011bd4000
    x7 : 0000000000000000 x6 : 0000000000000007
    x5 : ffffffc01d8b38f0 x4 : 0000000000000000
    x3 : 00000000ffffffff x2 : 0000000000000002
    x1 : 0000000000000000 x0 : ffffffc011bd4110
    Call trace:
     msm_readl+0x14/0x34
     a6xx_gpu_busy+0x40/0x80
     msm_devfreq_get_dev_status+0x70/0x1d0
     devfreq_simple_ondemand_func+0x34/0x100
     update_devfreq+0x50/0xe8
     qos_notifier_call+0x2c/0x64
     qos_max_notifier_call+0x1c/0x2c
     notifier_call_chain+0x58/0x98
     __blocking_notifier_call_chain+0x74/0x84
     blocking_notifier_call_chain+0x38/0x48
     pm_qos_update_target+0xf8/0x19c
     freq_qos_apply+0x54/0x6c
     apply_constraint+0x60/0x104
     __dev_pm_qos_update_request+0xb4/0x184
     dev_pm_qos_update_request+0x38/0x58
     msm_devfreq_idle_work+0x34/0x40
     kthread_worker_fn+0x144/0x1c8
     kthread+0x140/0x284
     ret_from_fork+0x10/0x18
    Code: f9000bf3 910003fd aa0003f3 d503201f (b9400260)
    ---[ end trace f6309767a42d0831 ]---

Which smells a lot like touching hw after power collapse.  This seems
a bit like a race/timing issue elsewhere, as pm_runtime_get_if_in_use()
in a6xx_gpu_busy() should have kept us from touching hw if it wasn't
powered.

But, we've seen cases where the idle_work scheduled by
msm_devfreq_idle() ends up racing with the resume path.  Which, again,
shouldn't be a problem other than unnecessary freq changes.

v2. Only move the runpm _put_autosuspend, and not the _mark_last_busy()

Fixes: 9bc95570175a ("drm/msm: Devfreq tuning")
Signed-off-by: Rob Clark <robdclark@chromium.org>
Link: https://lore.kernel.org/r/20210927152928.831245-1-robdclark@gmail.com
---
 drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index eb8a6663f309..244511f85044 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -672,7 +672,6 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
 	msm_submit_retire(submit);
 
 	pm_runtime_mark_last_busy(&gpu->pdev->dev);
-	pm_runtime_put_autosuspend(&gpu->pdev->dev);
 
 	spin_lock_irqsave(&ring->submit_lock, flags);
 	list_del(&submit->node);
@@ -686,6 +685,8 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
 		msm_devfreq_idle(gpu);
 	mutex_unlock(&gpu->active_lock);
 
+	pm_runtime_put_autosuspend(&gpu->pdev->dev);
+
 	msm_gem_submit_put(submit);
 }
 
-- 
2.36.1

Re: [PATCH v2] drm/msm: Switch ordering of runpm put vs devfreq_idle

Posted by Doug Anderson 3 years, 10 months ago

Hi,

On Wed, Jun 8, 2022 at 9:13 AM Rob Clark <robdclark@gmail.com> wrote:
>
> From: Rob Clark <robdclark@chromium.org>
>
> I've seen a few crashes like:
>
>     CPU: 0 PID: 216 Comm: A618-worker Tainted: G        W         5.4.196 #7
>     Hardware name: Google Wormdingler rev1+ INX panel board (DT)
>     pstate: 20c00009 (nzCv daif +PAN +UAO)
>     pc : msm_readl+0x14/0x34
>     lr : a6xx_gpu_busy+0x40/0x80
>     sp : ffffffc011b93ad0
>     x29: ffffffc011b93ad0 x28: ffffffe77cba3000
>     x27: 0000000000000001 x26: ffffffe77bb4c4ac
>     x25: ffffffa2f227dfa0 x24: ffffffa2f22aab28
>     x23: 0000000000000000 x22: ffffffa2f22bf020
>     x21: ffffffa2f22bf000 x20: ffffffc011b93b10
>     x19: ffffffc011bd4110 x18: 000000000000000e
>     x17: 0000000000000004 x16: 000000000000000c
>     x15: 000001be3a969450 x14: 0000000000000400
>     x13: 00000000000101d6 x12: 0000000034155555
>     x11: 0000000000000001 x10: 0000000000000000
>     x9 : 0000000100000000 x8 : ffffffc011bd4000
>     x7 : 0000000000000000 x6 : 0000000000000007
>     x5 : ffffffc01d8b38f0 x4 : 0000000000000000
>     x3 : 00000000ffffffff x2 : 0000000000000002
>     x1 : 0000000000000000 x0 : ffffffc011bd4110
>     Call trace:
>      msm_readl+0x14/0x34
>      a6xx_gpu_busy+0x40/0x80
>      msm_devfreq_get_dev_status+0x70/0x1d0
>      devfreq_simple_ondemand_func+0x34/0x100
>      update_devfreq+0x50/0xe8
>      qos_notifier_call+0x2c/0x64
>      qos_max_notifier_call+0x1c/0x2c
>      notifier_call_chain+0x58/0x98
>      __blocking_notifier_call_chain+0x74/0x84
>      blocking_notifier_call_chain+0x38/0x48
>      pm_qos_update_target+0xf8/0x19c
>      freq_qos_apply+0x54/0x6c
>      apply_constraint+0x60/0x104
>      __dev_pm_qos_update_request+0xb4/0x184
>      dev_pm_qos_update_request+0x38/0x58
>      msm_devfreq_idle_work+0x34/0x40
>      kthread_worker_fn+0x144/0x1c8
>      kthread+0x140/0x284
>      ret_from_fork+0x10/0x18
>     Code: f9000bf3 910003fd aa0003f3 d503201f (b9400260)
>     ---[ end trace f6309767a42d0831 ]---
>
> Which smells a lot like touching hw after power collapse.  This seems
> a bit like a race/timing issue elsewhere, as pm_runtime_get_if_in_use()
> in a6xx_gpu_busy() should have kept us from touching hw if it wasn't
> powered.

I dunno if we want to change the commit message since I think my patch
[1] addresses the above problem?

[1] https://lore.kernel.org/r/20220609094716.v2.1.Ie846c5352bc307ee4248d7cab998ab3016b85d06@changeid


> But, we've seen cases where the idle_work scheduled by
> msm_devfreq_idle() ends up racing with the resume path.  Which, again,
> shouldn't be a problem other than unnecessary freq changes.
>
> v2. Only move the runpm _put_autosuspend, and not the _mark_last_busy()
>
> Fixes: 9bc95570175a ("drm/msm: Devfreq tuning")
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> Link: https://lore.kernel.org/r/20210927152928.831245-1-robdclark@gmail.com
> ---
>  drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

In any case, your patch fixes the potential WARN_ON and seems like the
right thing to do, so:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

Re: [PATCH v2] drm/msm: Switch ordering of runpm put vs devfreq_idle

Posted by Akhil P Oommen 3 years, 10 months ago

On 6/8/2022 9:43 PM, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
>
> I've seen a few crashes like:
>
>      CPU: 0 PID: 216 Comm: A618-worker Tainted: G        W         5.4.196 #7
>      Hardware name: Google Wormdingler rev1+ INX panel board (DT)
>      pstate: 20c00009 (nzCv daif +PAN +UAO)
>      pc : msm_readl+0x14/0x34
>      lr : a6xx_gpu_busy+0x40/0x80
>      sp : ffffffc011b93ad0
>      x29: ffffffc011b93ad0 x28: ffffffe77cba3000
>      x27: 0000000000000001 x26: ffffffe77bb4c4ac
>      x25: ffffffa2f227dfa0 x24: ffffffa2f22aab28
>      x23: 0000000000000000 x22: ffffffa2f22bf020
>      x21: ffffffa2f22bf000 x20: ffffffc011b93b10
>      x19: ffffffc011bd4110 x18: 000000000000000e
>      x17: 0000000000000004 x16: 000000000000000c
>      x15: 000001be3a969450 x14: 0000000000000400
>      x13: 00000000000101d6 x12: 0000000034155555
>      x11: 0000000000000001 x10: 0000000000000000
>      x9 : 0000000100000000 x8 : ffffffc011bd4000
>      x7 : 0000000000000000 x6 : 0000000000000007
>      x5 : ffffffc01d8b38f0 x4 : 0000000000000000
>      x3 : 00000000ffffffff x2 : 0000000000000002
>      x1 : 0000000000000000 x0 : ffffffc011bd4110
>      Call trace:
>       msm_readl+0x14/0x34
>       a6xx_gpu_busy+0x40/0x80
>       msm_devfreq_get_dev_status+0x70/0x1d0
>       devfreq_simple_ondemand_func+0x34/0x100
>       update_devfreq+0x50/0xe8
>       qos_notifier_call+0x2c/0x64
>       qos_max_notifier_call+0x1c/0x2c
>       notifier_call_chain+0x58/0x98
>       __blocking_notifier_call_chain+0x74/0x84
>       blocking_notifier_call_chain+0x38/0x48
>       pm_qos_update_target+0xf8/0x19c
>       freq_qos_apply+0x54/0x6c
>       apply_constraint+0x60/0x104
>       __dev_pm_qos_update_request+0xb4/0x184
>       dev_pm_qos_update_request+0x38/0x58
>       msm_devfreq_idle_work+0x34/0x40
>       kthread_worker_fn+0x144/0x1c8
>       kthread+0x140/0x284
>       ret_from_fork+0x10/0x18
>      Code: f9000bf3 910003fd aa0003f3 d503201f (b9400260)
>      ---[ end trace f6309767a42d0831 ]---
>
> Which smells a lot like touching hw after power collapse.  This seems
> a bit like a race/timing issue elsewhere, as pm_runtime_get_if_in_use()
> in a6xx_gpu_busy() should have kept us from touching hw if it wasn't
> powered.
>
> But, we've seen cases where the idle_work scheduled by
> msm_devfreq_idle() ends up racing with the resume path.  Which, again,
> shouldn't be a problem other than unnecessary freq changes.
>
> v2. Only move the runpm _put_autosuspend, and not the _mark_last_busy()
>
> Fixes: 9bc95570175a ("drm/msm: Devfreq tuning")
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> Link: https://lore.kernel.org/r/20210927152928.831245-1-robdclark@gmail.com
> ---
>   drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index eb8a6663f309..244511f85044 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -672,7 +672,6 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
>   	msm_submit_retire(submit);
>   
>   	pm_runtime_mark_last_busy(&gpu->pdev->dev);
> -	pm_runtime_put_autosuspend(&gpu->pdev->dev);
>   
>   	spin_lock_irqsave(&ring->submit_lock, flags);
>   	list_del(&submit->node);
> @@ -686,6 +685,8 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
>   		msm_devfreq_idle(gpu);
>   	mutex_unlock(&gpu->active_lock);
>   
> +	pm_runtime_put_autosuspend(&gpu->pdev->dev);
> +
>   	msm_gem_submit_put(submit);
>   }
>   

Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com>


-Akhil.