[PATCH] drm/amdgpu: add ring buffer information in devcoredump

Sunil Khatri posted 1 patch 1 year, 11 months ago
There is a newer version of this series
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
[PATCH] drm/amdgpu: add ring buffer information in devcoredump
Posted by Sunil Khatri 1 year, 11 months ago
Add relevant ringbuffer information such as
rptr, wptr, ring name, ring size and also
the ring contents for each ring on a gpu reset.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 6d059f853adc..1992760039da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
 			   fault_info->status);
 	}
 
+	drm_printf(&p, "Ring buffer information\n");
+	for (int i = 0; i < coredump->adev->num_rings; i++) {
+		int j = 0;
+		struct amdgpu_ring *ring = coredump->adev->rings[i];
+
+		drm_printf(&p, "ring name: %s\n", ring->name);
+		drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n",
+			   amdgpu_ring_get_rptr(ring) & ring->buf_mask,
+			   amdgpu_ring_get_wptr(ring) & ring->buf_mask);
+		drm_printf(&p, "Ring size in dwords: %d\n",
+			   ring->ring_size / 4);
+		drm_printf(&p, "Ring contents\n");
+		drm_printf(&p, "Offset \t Value\n");
+
+		while (j < ring->ring_size) {
+			drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
+			j += 4;
+		}
+		drm_printf(&p, "Ring dumped\n");
+	}
+
 	if (coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
 	if (coredump->adev->reset_info.num_regs) {
-- 
2.34.1
Re: [PATCH] drm/amdgpu: add ring buffer information in devcoredump
Posted by Christian König 1 year, 11 months ago

Am 11.03.24 um 13:22 schrieb Sunil Khatri:
> Add relevant ringbuffer information such as
> rptr, wptr, ring name, ring size and also
> the ring contents for each ring on a gpu reset.
>
> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index 6d059f853adc..1992760039da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>   			   fault_info->status);
>   	}
>   
> +	drm_printf(&p, "Ring buffer information\n");
> +	for (int i = 0; i < coredump->adev->num_rings; i++) {
> +		int j = 0;
> +		struct amdgpu_ring *ring = coredump->adev->rings[i];
> +
> +		drm_printf(&p, "ring name: %s\n", ring->name);
> +		drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n",
> +			   amdgpu_ring_get_rptr(ring) & ring->buf_mask,
> +			   amdgpu_ring_get_wptr(ring) & ring->buf_mask);

Don't apply the mask here. We do have some use cases where the rptr and 
wptr are outside the ring buffer.

> +		drm_printf(&p, "Ring size in dwords: %d\n",
> +			   ring->ring_size / 4);

Rather print the mask as additional value here.

> +		drm_printf(&p, "Ring contents\n");
> +		drm_printf(&p, "Offset \t Value\n");
> +
> +		while (j < ring->ring_size) {
> +			drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
> +			j += 4;
> +		}

> +		drm_printf(&p, "Ring dumped\n");

That seems superfluous.

Regards,
Christian.

> +	}
> +
>   	if (coredump->reset_vram_lost)
>   		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>   	if (coredump->adev->reset_info.num_regs) {
Re: [PATCH] drm/amdgpu: add ring buffer information in devcoredump
Posted by Khatri, Sunil 1 year, 11 months ago
On 3/11/2024 7:29 PM, Christian König wrote:
>
>
> Am 11.03.24 um 13:22 schrieb Sunil Khatri:
>> Add relevant ringbuffer information such as
>> rptr, wptr, ring name, ring size and also
>> the ring contents for each ring on a gpu reset.
>>
>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +++++++++++++++++++++
>>   1 file changed, 21 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> index 6d059f853adc..1992760039da 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> @@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
>> offset, size_t count,
>>                  fault_info->status);
>>       }
>>   +    drm_printf(&p, "Ring buffer information\n");
>> +    for (int i = 0; i < coredump->adev->num_rings; i++) {
>> +        int j = 0;
>> +        struct amdgpu_ring *ring = coredump->adev->rings[i];
>> +
>> +        drm_printf(&p, "ring name: %s\n", ring->name);
>> +        drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n",
>> +               amdgpu_ring_get_rptr(ring) & ring->buf_mask,
>> +               amdgpu_ring_get_wptr(ring) & ring->buf_mask);
>
> Don't apply the mask here. We do have some use cases where the rptr 
> and wptr are outside the ring buffer.
Sure i will remove the mask.
>
>> +        drm_printf(&p, "Ring size in dwords: %d\n",
>> +               ring->ring_size / 4);
>
> Rather print the mask as additional value here.
Does that help adding the mask value ?
>
>> +        drm_printf(&p, "Ring contents\n");
>> +        drm_printf(&p, "Offset \t Value\n");
>> +
>> +        while (j < ring->ring_size) {
>> +            drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
>> +            j += 4;
>> +        }
>
>> +        drm_printf(&p, "Ring dumped\n");
>
> That seems superfluous.

Noted


Regards
Sunil

>
> Regards,
> Christian.
>
>> +    }
>> +
>>       if (coredump->reset_vram_lost)
>>           drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>       if (coredump->adev->reset_info.num_regs) {
>
Re: [PATCH] drm/amdgpu: add ring buffer information in devcoredump
Posted by Christian König 1 year, 11 months ago
Am 11.03.24 um 15:48 schrieb Khatri, Sunil:
>
> On 3/11/2024 7:29 PM, Christian König wrote:
>>
>>
>> Am 11.03.24 um 13:22 schrieb Sunil Khatri:
>>> Add relevant ringbuffer information such as
>>> rptr, wptr, ring name, ring size and also
>>> the ring contents for each ring on a gpu reset.
>>>
>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +++++++++++++++++++++
>>>   1 file changed, 21 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> index 6d059f853adc..1992760039da 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> @@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
>>> offset, size_t count,
>>>                  fault_info->status);
>>>       }
>>>   +    drm_printf(&p, "Ring buffer information\n");
>>> +    for (int i = 0; i < coredump->adev->num_rings; i++) {
>>> +        int j = 0;
>>> +        struct amdgpu_ring *ring = coredump->adev->rings[i];
>>> +
>>> +        drm_printf(&p, "ring name: %s\n", ring->name);
>>> +        drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx\n",
>>> +               amdgpu_ring_get_rptr(ring) & ring->buf_mask,
>>> +               amdgpu_ring_get_wptr(ring) & ring->buf_mask);
>>
>> Don't apply the mask here. We do have some use cases where the rptr 
>> and wptr are outside the ring buffer.
> Sure i will remove the mask.
>>
>>> +        drm_printf(&p, "Ring size in dwords: %d\n",
>>> +               ring->ring_size / 4);
>>
>> Rather print the mask as additional value here.
> Does that help adding the mask value ?

I think it should help as a reminder that rptr & wptr needs to be masked 
to become valid indexes.

Some hw generations have really crude workarounds where we have to 
allocate an extra page after the ring buffer because the hw is buddy and 
sometimes tries to read command from there as well.

So when we see a hang with some rptr and wptr values which don't fit 
into the mask we will know that the hw has another issue in that direction.

Regards,
Christian.

>>
>>> +        drm_printf(&p, "Ring contents\n");
>>> +        drm_printf(&p, "Offset \t Value\n");
>>> +
>>> +        while (j < ring->ring_size) {
>>> +            drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
>>> +            j += 4;
>>> +        }
>>
>>> +        drm_printf(&p, "Ring dumped\n");
>>
>> That seems superfluous.
>
> Noted
>
>
> Regards
> Sunil
>
>>
>> Regards,
>> Christian.
>>
>>> +    }
>>> +
>>>       if (coredump->reset_vram_lost)
>>>           drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>>       if (coredump->adev->reset_info.num_regs) {
>>