[RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

André Almeida posted 1 patch 2 years, 7 months ago
drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
include/uapi/drm/amdgpu_drm.h            |  7 ++++++
7 files changed, 52 insertions(+), 1 deletion(-)
[RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Posted by André Almeida 2 years, 7 months ago
Currently UMD hasn't much information on what went wrong during a GPU reset. To
help with that, this patch proposes a new IOCTL that can be used to query
information about the resources that caused the hang.

The goal of this RFC is to gather feedback about this interface. The mesa part
can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785

The current implementation is racy, meaning that if two resets happens (even on
different rings), the app will get the last reset information available, rather
than the one that is looking for. Maybe this can be fixed with a ring_id
parameter to query the information for a specific ring, but this also requires
an interface to tell the UMD which ring caused it.

I know that devcoredump is also used for this kind of information, but I believe
that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
a file that its contents are subjected to be changed.

André Almeida (1):
  drm/amdgpu: Add interface to dump guilty IB on GPU hang

 drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
 include/uapi/drm/amdgpu_drm.h            |  7 ++++++
 7 files changed, 52 insertions(+), 1 deletion(-)

-- 
2.40.1

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Posted by Christian König 2 years, 7 months ago
Well first of all don't expose the VMID to userspace.

The UMD doesn't know (and shouldn't know) which VMID is used for a 
submission since this is dynamically assigned and can change at any time.

For debugging there is an interface to use an reserved VMID for your 
debugged process which allows to associate logs, tracepoints and hw 
dumps with the stuff executed by this specific process.

Then we already have a feedback mechanism in the form of the error 
number in the fence. What we still need is an IOCTL to query that.

Regarding how far processing inside the IB was when the issue was 
detected, intermediate debug fences are much more reliable than asking 
the kernel for that.

Regards,
Christian.

Am 01.05.23 um 20:57 schrieb André Almeida:
> Currently UMD hasn't much information on what went wrong during a GPU reset. To
> help with that, this patch proposes a new IOCTL that can be used to query
> information about the resources that caused the hang.
>
> The goal of this RFC is to gather feedback about this interface. The mesa part
> can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785
>
> The current implementation is racy, meaning that if two resets happens (even on
> different rings), the app will get the last reset information available, rather
> than the one that is looking for. Maybe this can be fixed with a ring_id
> parameter to query the information for a specific ring, but this also requires
> an interface to tell the UMD which ring caused it.
>
> I know that devcoredump is also used for this kind of information, but I believe
> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
> a file that its contents are subjected to be changed.
>
> André Almeida (1):
>    drm/amdgpu: Add interface to dump guilty IB on GPU hang
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
>   include/uapi/drm/amdgpu_drm.h            |  7 ++++++
>   7 files changed, 52 insertions(+), 1 deletion(-)
>

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Posted by Alex Deucher 2 years, 7 months ago
On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Currently UMD hasn't much information on what went wrong during a GPU reset. To
> help with that, this patch proposes a new IOCTL that can be used to query
> information about the resources that caused the hang.

If we went with the IOCTL, we'd want to limit this to the guilty process.

>
> The goal of this RFC is to gather feedback about this interface. The mesa part
> can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785
>
> The current implementation is racy, meaning that if two resets happens (even on
> different rings), the app will get the last reset information available, rather
> than the one that is looking for. Maybe this can be fixed with a ring_id
> parameter to query the information for a specific ring, but this also requires
> an interface to tell the UMD which ring caused it.

I think you'd want engine type or something like that so mesa knows
how to interpret the IB info.  You could store the most recent info in
the fd priv for the guilty app.  E.g., see what I did for tracking GPU
page fault into:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/gpu_fault_info_ioctl

>
> I know that devcoredump is also used for this kind of information, but I believe
> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
> a file that its contents are subjected to be changed.

Can you elaborate a bit on that?  Isn't the whole point of devcoredump
to store this sort of information?

Alex


>
> André Almeida (1):
>   drm/amdgpu: Add interface to dump guilty IB on GPU hang
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
>  include/uapi/drm/amdgpu_drm.h            |  7 ++++++
>  7 files changed, 52 insertions(+), 1 deletion(-)
>
> --
> 2.40.1
>
Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Posted by André Almeida 2 years, 7 months ago
Em 01/05/2023 16:24, Alex Deucher escreveu:
> On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> wrote:
>>
>> I know that devcoredump is also used for this kind of information, but I believe
>> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
>> a file that its contents are subjected to be changed.
> 
> Can you elaborate a bit on that?  Isn't the whole point of devcoredump
> to store this sort of information?
> 

I think that devcoredump is something that you could use to submit to a 
bug report as it is, and then people can read/parse as they want, not as 
an interface to be read by Mesa... I'm not sure that it's something that 
I would call an API. But I might be wrong, if you know something that 
uses that as an API please share.

Anyway, relying on that for Mesa would mean that we would need to ensure 
stability for the file content and format, making it less flexible to 
modify in the future and probe to bugs, while the IOCTL is well defined 
and extensible. Maybe the dump from Mesa + devcoredump could be 
complementary information to a bug report.
Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Posted by Christian König 2 years, 7 months ago
Am 02.05.23 um 03:26 schrieb André Almeida:
> Em 01/05/2023 16:24, Alex Deucher escreveu:
>> On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> 
>> wrote:
>>>
>>> I know that devcoredump is also used for this kind of information, 
>>> but I believe
>>> that using an IOCTL is better for interfacing Mesa + Linux rather 
>>> than parsing
>>> a file that its contents are subjected to be changed.
>>
>> Can you elaborate a bit on that?  Isn't the whole point of devcoredump
>> to store this sort of information?
>>
>
> I think that devcoredump is something that you could use to submit to 
> a bug report as it is, and then people can read/parse as they want, 
> not as an interface to be read by Mesa... I'm not sure that it's 
> something that I would call an API. But I might be wrong, if you know 
> something that uses that as an API please share.
>
> Anyway, relying on that for Mesa would mean that we would need to 
> ensure stability for the file content and format, making it less 
> flexible to modify in the future and probe to bugs, while the IOCTL is 
> well defined and extensible. Maybe the dump from Mesa + devcoredump 
> could be complementary information to a bug report.

Neither using an IOCTL nor devcoredump is a good approach for this since 
the values read from the hw register are completely unreliable. They 
could not be available because of GFXOFF or they could be overwritten or 
not even updated by the CP in the first place because of a hang etc....

If you want to track progress inside an IB what you do instead is to 
insert intermediate fence write commands into the IB. E.g. something 
like write value X to location Y when this executes.

This way you can not only track how far the IB processed, but also in 
which stages of processing we where when the hang occurred. E.g. End of 
Pipe, End of Shaders, specific shader stages etc...

Regards,
Christian.