[PATCH] drm/amdgpu: Mark contexts guilty for any reset type

André Almeida posted 1 patch 2 years, 7 months ago
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 +++++++-
2 files changed, 7 insertions(+), 4 deletions(-)
[PATCH] drm/amdgpu: Mark contexts guilty for any reset type
Posted by André Almeida 2 years, 7 months ago
When a DRM job timeout, the GPU is probably hang and amdgpu have some
ways to deal with that, ranging from soft recoveries to full device
reset. Anyway, when userspace ask the kernel the state of the context
(via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
reset, regardless if a full reset happened or not.

However, amdgpu only marks a context guilty in the ASIC reset path. This
makes the userspace report incomplete, given that on soft recovery path
the guilty context is not told that it's the guilty one.

Fix this by marking the context guilty for every type of reset when a
job timeouts.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 +++++++-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ac78caa7cba8..ea169d1689e2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 
 	amdgpu_fence_driver_isr_toggle(adev, false);
 
-	if (job && job->vm)
-		drm_sched_increase_karma(&job->base);
-
 	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
 	/* If reset handler not implemented, continue; otherwise return */
 	if (r == -ENOSYS)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index c3d9d75143f4..097ed8f06865 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 	memset(&ti, 0, sizeof(struct amdgpu_task_info));
 	adev->job_hang = true;
 
+	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
+
+	if (job && job->vm) {
+		DRM_INFO("marking %s context as guilty", ti.process_name);
+		drm_sched_increase_karma(&job->base);
+	}
+
 	if (amdgpu_gpu_recovery &&
 	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
 		DRM_ERROR("ring %s timeout, but soft recovered\n",
@@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		goto exit;
 	}
 
-	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
 	DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
 		  job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
 		  ring->fence_drv.sync_seq);
-- 
2.40.0

Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
Posted by Christian König 2 years, 7 months ago
Am 24.04.23 um 03:43 schrieb André Almeida:
> When a DRM job timeout, the GPU is probably hang and amdgpu have some
> ways to deal with that, ranging from soft recoveries to full device
> reset. Anyway, when userspace ask the kernel the state of the context
> (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
> reset, regardless if a full reset happened or not.
>
> However, amdgpu only marks a context guilty in the ASIC reset path. This
> makes the userspace report incomplete, given that on soft recovery path
> the guilty context is not told that it's the guilty one.
>
> Fix this by marking the context guilty for every type of reset when a
> job timeouts.

The guilty handling is pretty much broken by design and only works 
because we go through multiple hops of validating the entity after the 
job has already been pushed to the hw.

I think we should probably just remove that completely and use an 
approach where we check the in flight submissions in the query state 
IOCTL. See my other patch on the mailing list regarding that.

Additional to that I currently didn't considered soft-recovered 
submissions as fatal and continue accepting submissions from that 
context, but already wanted to talk with Marek about that behavior.

Regards,
Christian.

>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 +++++++-
>   2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index ac78caa7cba8..ea169d1689e2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   
>   	amdgpu_fence_driver_isr_toggle(adev, false);
>   
> -	if (job && job->vm)
> -		drm_sched_increase_karma(&job->base);
> -
>   	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
>   	/* If reset handler not implemented, continue; otherwise return */
>   	if (r == -ENOSYS)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index c3d9d75143f4..097ed8f06865 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   	adev->job_hang = true;
>   
> +	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> +
> +	if (job && job->vm) {
> +		DRM_INFO("marking %s context as guilty", ti.process_name);
> +		drm_sched_increase_karma(&job->base);
> +	}
> +
>   	if (amdgpu_gpu_recovery &&
>   	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>   		DRM_ERROR("ring %s timeout, but soft recovered\n",
> @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		goto exit;
>   	}
>   
> -	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>   	DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>   		  job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
>   		  ring->fence_drv.sync_seq);

Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
Posted by André Almeida 2 years, 7 months ago
Hi Christian, thank you for your comments.

Em 24/04/2023 04:03, Christian König escreveu:
> Am 24.04.23 um 03:43 schrieb André Almeida:
>> When a DRM job timeout, the GPU is probably hang and amdgpu have some
>> ways to deal with that, ranging from soft recoveries to full device
>> reset. Anyway, when userspace ask the kernel the state of the context
>> (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
>> reset, regardless if a full reset happened or not.
>>
>> However, amdgpu only marks a context guilty in the ASIC reset path. This
>> makes the userspace report incomplete, given that on soft recovery path
>> the guilty context is not told that it's the guilty one.
>>
>> Fix this by marking the context guilty for every type of reset when a
>> job timeouts.
> 
> The guilty handling is pretty much broken by design and only works 
> because we go through multiple hops of validating the entity after the 
> job has already been pushed to the hw.

I see, thanks.

> 
> I think we should probably just remove that completely and use an 
> approach where we check the in flight submissions in the query state 
> IOCTL.

Like the DRM_IOCTL_I915_GET_RESET_STATS approach?

 > See my other patch on the mailing list regarding that.

Which one, the "[PATCH 1/8] drm/scheduler: properly forward fence 
errors" series?

> 
> Additional to that I currently didn't considered soft-recovered 
> submissions as fatal and continue accepting submissions from that 
> context, but already wanted to talk with Marek about that behavior.
> 

Interesting. I will try to test and validate this approach to see if the 
contexts keep working as expected on soft-resets.

> Regards,
> Christian.
Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
Posted by Michel Dänzer 2 years, 7 months ago
On 4/24/23 15:26, André Almeida wrote:
>>
>> Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior.
>>
> 
> Interesting. I will try to test and validate this approach to see if the contexts keep working as expected on soft-resets.

FWIW, on this Thinkpad E595 with a Picasso APU, I've hit soft-resets (usually either in Firefox or gnome-shell) a number of times, and usually continued using the GNOME session for a few days without any issues.

(Interestingly, Firefox reacts to the soft-resets by falling back to software rendering, even when it's not guilty itself)


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer

Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
Posted by kernel test robot 2 years, 7 months ago
Hi André,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on drm/drm-next drm-exynos/exynos-drm-next drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.3 next-20230421]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20230424014324.218531-1-andrealmeid%40igalia.com
patch subject: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
config: s390-allyesconfig (https://download.01.org/0day-ci/archive/20230424/202304241259.Qq0Dmlud-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
        git checkout ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/gpu/drm/amd/amdgpu/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304241259.Qq0Dmlud-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 'amdgpu_device_pre_asic_reset':
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4738:28: warning: variable 'job' set but not used [-Wunused-but-set-variable]
    4738 |         struct amdgpu_job *job = NULL;
         |                            ^~~


vim +/job +4738 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

5740682e66cef5 Monk Liu          2017-10-25  4733  
e3c1b0712fdb03 shaoyunl          2021-02-16  4734  int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
04442bf70debb1 Lijo Lazar        2021-03-16  4735  				 struct amdgpu_reset_context *reset_context)
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4736  {
5c1e6fa49e8d8d Huang Rui         2021-12-16  4737  	int i, r = 0;
04442bf70debb1 Lijo Lazar        2021-03-16 @4738  	struct amdgpu_job *job = NULL;
04442bf70debb1 Lijo Lazar        2021-03-16  4739  	bool need_full_reset =
04442bf70debb1 Lijo Lazar        2021-03-16  4740  		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar        2021-03-16  4741  
04442bf70debb1 Lijo Lazar        2021-03-16  4742  	if (reset_context->reset_req_dev == adev)
04442bf70debb1 Lijo Lazar        2021-03-16  4743  		job = reset_context->job;
711826656bebb0 Monk Liu          2017-12-25  4744  
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4745  	if (amdgpu_sriov_vf(adev)) {
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4746  		/* stop the data exchange thread */
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4747  		amdgpu_virt_fini_data_exchange(adev);
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4748  	}
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4749  
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4750  	amdgpu_fence_driver_isr_toggle(adev, true);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4751  
711826656bebb0 Monk Liu          2017-12-25  4752  	/* block all schedulers and reset given job's ring */
0875dc9e80eb3b Chunming Zhou     2016-06-12  4753  	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
0875dc9e80eb3b Chunming Zhou     2016-06-12  4754  		struct amdgpu_ring *ring = adev->rings[i];
0875dc9e80eb3b Chunming Zhou     2016-06-12  4755  
51687759be93fb Chunming Zhou     2017-04-24  4756  		if (!ring || !ring->sched.thread)
0875dc9e80eb3b Chunming Zhou     2016-06-12  4757  			continue;
5740682e66cef5 Monk Liu          2017-10-25  4758  
c530b02f39850a Jack Zhang        2021-05-12  4759  		/*clear job fence from fence drv to avoid force_completion
c530b02f39850a Jack Zhang        2021-05-12  4760  		 *leave NULL and vm flush fence in fence drv */
5c1e6fa49e8d8d Huang Rui         2021-12-16  4761  		amdgpu_fence_driver_clear_job_fences(ring);
c530b02f39850a Jack Zhang        2021-05-12  4762  
2200edac745a65 Chunming Zhou     2016-06-30  4763  		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
2f9d4084cac96a Monk Liu          2017-10-16  4764  		amdgpu_fence_driver_force_completion(ring);
2f9d4084cac96a Monk Liu          2017-10-16  4765  	}
d38ceaf99ed015 Alex Deucher      2015-04-20  4766  
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4767  	amdgpu_fence_driver_isr_toggle(adev, false);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4768  
04442bf70debb1 Lijo Lazar        2021-03-16  4769  	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
404b277bbe4945 Lijo Lazar        2021-03-26  4770  	/* If reset handler not implemented, continue; otherwise return */
404b277bbe4945 Lijo Lazar        2021-03-26  4771  	if (r == -ENOSYS)
404b277bbe4945 Lijo Lazar        2021-03-26  4772  		r = 0;
404b277bbe4945 Lijo Lazar        2021-03-26  4773  	else
04442bf70debb1 Lijo Lazar        2021-03-16  4774  		return r;
04442bf70debb1 Lijo Lazar        2021-03-16  4775  
1d721ed679db18 Andrey Grodzovsky 2019-04-18  4776  	/* Don't suspend on bare metal if we are not going to HW reset the ASIC */
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4777  	if (!amdgpu_sriov_vf(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4778  
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4779  		if (!need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4780  			need_full_reset = amdgpu_device_ip_need_full_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4781  
360cd08196cabc Likun Gao         2022-12-21  4782  		if (!need_full_reset && amdgpu_gpu_recovery &&
360cd08196cabc Likun Gao         2022-12-21  4783  		    amdgpu_device_ip_check_soft_reset(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4784  			amdgpu_device_ip_pre_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4785  			r = amdgpu_device_ip_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4786  			amdgpu_device_ip_post_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4787  			if (r || amdgpu_device_ip_check_soft_reset(adev)) {
aac891685da661 Dennis Li         2020-08-20  4788  				dev_info(adev->dev, "soft reset failed, will fallback to full reset!\n");
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4789  				need_full_reset = true;
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4790  			}
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4791  		}
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4792  
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4793  		if (need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4794  			r = amdgpu_device_ip_suspend(adev);
04442bf70debb1 Lijo Lazar        2021-03-16  4795  		if (need_full_reset)
04442bf70debb1 Lijo Lazar        2021-03-16  4796  			set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar        2021-03-16  4797  		else
04442bf70debb1 Lijo Lazar        2021-03-16  4798  			clear_bit(AMDGPU_NEED_FULL_RESET,
04442bf70debb1 Lijo Lazar        2021-03-16  4799  				  &reset_context->flags);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4800  	}
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4801  
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4802  	return r;
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4803  }
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4804  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests