[PATCH] Revert "drm/sched: Use parent fence instead of finished"

Arvind Yadav posted 1 patch 1 year, 4 months ago
drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
[PATCH] Revert "drm/sched: Use parent fence instead of finished"
Posted by Arvind Yadav 1 year, 4 months ago
This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.

    This is causing instability on Linus' desktop, and Observed System
    hung  when running MesaGL benchmark or VK CTS runs.

    netconsole got me the following oops:
    [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
    [ 1234.778782] #PF: supervisor read access in kernel mode
    [ 1234.778787] #PF: error_code(0x0000) - not-present page
    [ 1234.778791] PGD 0 P4D 0
    [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
    [ 1234.778809] Hardware name: System manufacturer System Product
    Name/PRIME X370-PRO, BIOS 5603 07/28/2020
    [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
    [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
    ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
    48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
    00 f0
    [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087
    [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018
    [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000
    [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808
    [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00
    [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458
    [ 1234.778912] FS:  00007f35e7008580(0000) GS:ffff95428ebc0000(0000)
    knlGS:0000000000000000
    [ 1234.778916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0
    [ 1234.778924] Call Trace:
    [ 1234.778981]  <IRQ>
    [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
    [ 1234.778999]  dma_fence_signal+0x2c/0x50
    [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
    [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
    [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
    [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
    [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
    [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
    [ 1234.779946]  handle_irq_event+0x34/0x70
    [ 1234.779949]  handle_edge_irq+0x9f/0x240
    [ 1234.779954]  __common_interrupt+0x66/0x100
    [ 1234.779960]  common_interrupt+0xa0/0xc0
    [ 1234.779965]  </IRQ>
    [ 1234.779968]  <TASK>
    [ 1234.779971]  asm_common_interrupt+0x22/0x40
    [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
    [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
    54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
    48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
    83 ea
    [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202

    Revert it for now and figure it out later.

Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 820c0c5544e1..ea7bfa99d6c9 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 	job = list_first_entry_or_null(&sched->pending_list,
 				       struct drm_sched_job, list);
 
-	if (job && dma_fence_is_signaled(job->s_fence->parent)) {
+	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
 		/* remove job from pending_list */
 		list_del_init(&job->list);
 
@@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 
 		if (next) {
 			next->s_fence->scheduled.timestamp =
-				job->s_fence->parent->timestamp;
+				job->s_fence->finished.timestamp;
 			/* start TO timer for next job */
 			drm_sched_start_timeout(sched);
 		}
-- 
2.25.1
Re: [PATCH] Revert "drm/sched: Use parent fence instead of finished"
Posted by Rob Clark 1 year ago
On Fri, Dec 2, 2022 at 9:24 AM Arvind Yadav <Arvind.Yadav@amd.com> wrote:
>
> This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.
>
>     This is causing instability on Linus' desktop, and Observed System
>     hung  when running MesaGL benchmark or VK CTS runs.
>
>     netconsole got me the following oops:
>     [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
>     [ 1234.778782] #PF: supervisor read access in kernel mode
>     [ 1234.778787] #PF: error_code(0x0000) - not-present page
>     [ 1234.778791] PGD 0 P4D 0
>     [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI
>     [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
>     [ 1234.778809] Hardware name: System manufacturer System Product
>     Name/PRIME X370-PRO, BIOS 5603 07/28/2020
>     [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
>     [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
>     ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
>     48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
>     00 f0
>     [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087
>     [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018
>     [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000
>     [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808
>     [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00
>     [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458
>     [ 1234.778912] FS:  00007f35e7008580(0000) GS:ffff95428ebc0000(0000)
>     knlGS:0000000000000000
>     [ 1234.778916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0
>     [ 1234.778924] Call Trace:
>     [ 1234.778981]  <IRQ>
>     [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
>     [ 1234.778999]  dma_fence_signal+0x2c/0x50
>     [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
>     [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
>     [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
>     [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
>     [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
>     [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
>     [ 1234.779946]  handle_irq_event+0x34/0x70
>     [ 1234.779949]  handle_edge_irq+0x9f/0x240
>     [ 1234.779954]  __common_interrupt+0x66/0x100
>     [ 1234.779960]  common_interrupt+0xa0/0xc0
>     [ 1234.779965]  </IRQ>
>     [ 1234.779968]  <TASK>
>     [ 1234.779971]  asm_common_interrupt+0x22/0x40
>     [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
>     [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
>     54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
>     48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
>     83 ea
>     [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202
>
>     Revert it for now and figure it out later.

Just fwiw, the issue here is a race against sched_main observing that
the hw fence is signaled and doing job_cleanup and the driver retiring
the job.  I don't think there is a sane way to use the parent fence
without having this race condition so the "figure it out later" is
"don't do that" ;-)

BR,
-R

> Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 820c0c5544e1..ea7bfa99d6c9 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>         job = list_first_entry_or_null(&sched->pending_list,
>                                        struct drm_sched_job, list);
>
> -       if (job && dma_fence_is_signaled(job->s_fence->parent)) {
> +       if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>                 /* remove job from pending_list */
>                 list_del_init(&job->list);
>
> @@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>
>                 if (next) {
>                         next->s_fence->scheduled.timestamp =
> -                               job->s_fence->parent->timestamp;
> +                               job->s_fence->finished.timestamp;
>                         /* start TO timer for next job */
>                         drm_sched_start_timeout(sched);
>                 }
> --
> 2.25.1
>
Re: [PATCH] Revert "drm/sched: Use parent fence instead of finished"
Posted by Christian König 1 year, 4 months ago
Am 02.12.22 um 18:23 schrieb Arvind Yadav:
> This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.
>
>      This is causing instability on Linus' desktop, and Observed System
>      hung  when running MesaGL benchmark or VK CTS runs.
>
>      netconsole got me the following oops:
>      [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
>      [ 1234.778782] #PF: supervisor read access in kernel mode
>      [ 1234.778787] #PF: error_code(0x0000) - not-present page
>      [ 1234.778791] PGD 0 P4D 0
>      [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI
>      [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
>      [ 1234.778809] Hardware name: System manufacturer System Product
>      Name/PRIME X370-PRO, BIOS 5603 07/28/2020
>      [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
>      [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
>      ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
>      48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
>      00 f0
>      [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087
>      [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018
>      [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000
>      [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808
>      [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00
>      [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458
>      [ 1234.778912] FS:  00007f35e7008580(0000) GS:ffff95428ebc0000(0000)
>      knlGS:0000000000000000
>      [ 1234.778916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>      [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0
>      [ 1234.778924] Call Trace:
>      [ 1234.778981]  <IRQ>
>      [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
>      [ 1234.778999]  dma_fence_signal+0x2c/0x50
>      [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
>      [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
>      [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
>      [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
>      [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
>      [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
>      [ 1234.779946]  handle_irq_event+0x34/0x70
>      [ 1234.779949]  handle_edge_irq+0x9f/0x240
>      [ 1234.779954]  __common_interrupt+0x66/0x100
>      [ 1234.779960]  common_interrupt+0xa0/0xc0
>      [ 1234.779965]  </IRQ>
>      [ 1234.779968]  <TASK>
>      [ 1234.779971]  asm_common_interrupt+0x22/0x40
>      [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
>      [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
>      54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
>      48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
>      83 ea
>      [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202
>
>      Revert it for now and figure it out later.
>
> Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 820c0c5544e1..ea7bfa99d6c9 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -790,7 +790,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>   	job = list_first_entry_or_null(&sched->pending_list,
>   				       struct drm_sched_job, list);
>   
> -	if (job && dma_fence_is_signaled(job->s_fence->parent)) {
> +	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>   		/* remove job from pending_list */
>   		list_del_init(&job->list);
>   
> @@ -802,7 +802,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>   
>   		if (next) {
>   			next->s_fence->scheduled.timestamp =
> -				job->s_fence->parent->timestamp;
> +				job->s_fence->finished.timestamp;
>   			/* start TO timer for next job */
>   			drm_sched_start_timeout(sched);
>   		}