[PATCH AUTOSEL 7.0] drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset

Sasha Levin posted 1 patch 4 days, 11 hours ago
drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 26 +++++++++++------------
1 file changed, 12 insertions(+), 14 deletions(-)
[PATCH AUTOSEL 7.0] drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset
Posted by Sasha Levin 4 days, 11 hours ago
From: Christian König <christian.koenig@amd.com>

[ Upstream commit d5971c5c34303a00bf841a902ca00a703602c500 ]

The purpose of a GPU reset is to make sure that fence can be signaled
again and the signal and resume workers can make progress again.

So waiting for the resume worker or any fence in the GPU reset path is
just utterly nonsense.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fcd5f065eab46993af43442fd77ee8d9eb9c5bdf)
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Decision Summary
This should be backported for stable trees that contain
`amdgpu_userq_pre_reset()`. It removes two blocking waits from the GPU
reset path: `cancel_delayed_work_sync(&uqm->resume_work)` and
`amdgpu_userq_wait_for_last_fence()`, where the latter waits on
`dma_fence_wait_timeout(..., MAX_SCHEDULE_TIMEOUT)`. The reset path is
precisely what is supposed to let those fences/workers make progress, so
waiting there can deadlock GPU recovery.

The patch is one-function, one-file, no new API, and `git apply --check`
succeeds against this `7.0.y` checkout. Main concern: the review thread
confirms broader queue-state locking concerns remain, but those are pre-
existing and the accepted patch was explicitly scoped to removing the
obvious deadlock.

## Phase Walkthrough
Phase 1:
- Record 1.1: subsystem `drm/amdgpu`; action `remove`; intent is to
  remove deadlocks from `amdgpu_userq_pre_reset()`.
- Record 1.2: tags present: `Signed-off-by: Christian König`, `Reviewed-
  by: Prike Liang`, `Signed-off-by: Alex Deucher`; no `Fixes:`,
  `Reported-by:`, `Tested-by:`, `Acked-by:`, `Link:`, or `Cc: stable`.
- Record 1.3: commit body says GPU reset exists to allow fences and
  signal/resume workers to progress; waiting for resume worker or fences
  inside reset is wrong. Failure mode: reset can hang/deadlock.
- Record 1.4: not hidden; it is explicitly a deadlock fix.

Phase 2:
- Record 2.1: one file, `drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c`,
  `+12/-14`; only `amdgpu_userq_pre_reset()` changed; single-file
  surgical fix.
- Record 2.2: before, reset path canceled `resume_work` synchronously
  and waited for `last_fence` before unmapping mapped queues. After, it
  skips non-mapped queues and directly unmaps/marks hung/force-completes
  mapped queues.
- Record 2.3: bug category is deadlock/hang. Mechanism: reset path
  waited on work/fences that may require reset progress to complete.
- Record 2.4: fix is minimal. Regression risk is low-to-medium because
  queue-state locking is still called out by the TODO and review
  discussion, but no new API or broad behavior is added.

Phase 3:
- Record 3.1: `git blame` shows the reset function came from
  `290f46cf57265` (`drm/amdgpu: Implement user queue reset
  functionality`); current textual lines were later touched by
  `473f12f820956` and `dc21e39fd20c7` helper/signature cleanups. `git
  describe --contains 290f46cf...` points to `v6.19-rc1`, so older
  stable trees before v6.19 likely do not contain this exact code.
- Record 3.2: no `Fixes:` tag, so no Fixes target to follow.
- Record 3.3: recent file history shows active userq work and related
  reset/fence changes; this patch applies standalone to current `7.0.y`.
- Record 3.4: `MAINTAINERS` lists Christian König and Alex Deucher as
  AMDGPU DRM maintainers; author/committer are subsystem maintainers.
- Record 3.5: no compile dependency found; patch is part of an 11-patch
  series but applies cleanly by itself.

Phase 4:
- Record 4.1: `b4 dig` failed because the commit object is absent
  locally. Fallback found the original patch at
  `lists.freedesktop.org/archives/amd-gfx/2026-April/143199.html`;
  committed version matches the patch content with `Reviewed-by` added.
- Record 4.2: `b4 -w` failed; lore thread shows Alex Deucher, Prike
  Liang, Sunil Khatri, and `amd-gfx` were involved.
- Record 4.3: no external bug report or syzbot link found.
- Record 4.4: series context exists (`[PATCH 02/11]`), with later
  userq/fence/reset cleanups, but no hard prerequisite was identified
  for this specific deadlock removal.
- Record 4.5: web search did not find stable-list discussion for this
  exact patch.

Phase 5:
- Record 5.1: modified function: `amdgpu_userq_pre_reset()`.
- Record 5.2: callers: local search shows only
  `amdgpu_device_halt_activities()`, called from
  `amdgpu_device_gpu_recover()` and PCI error handling.
- Record 5.3: affected callees include `userq_funcs->unmap()` and
  `amdgpu_userq_fence_driver_force_completion()`; removed callees
  include `cancel_delayed_work_sync()` and
  `amdgpu_userq_wait_for_last_fence()`.
- Record 5.4: reachable through AMDGPU recovery paths, including job
  timeout/GPU recovery paths; exact unprivileged trigger policy was not
  fully verified.
- Record 5.5: similar wait/cancel patterns remain in non-reset paths
  such as destroy/suspend/eviction, but the reset-path instance is the
  one fixed here.

Phase 6:
- Record 6.1: buggy code exists in this `7.0.y` tree and appears
  introduced for `v6.19-rc1`; likely relevant to v6.19+ stable trees
  containing userq reset support.
- Record 6.2: `git apply --check` succeeds against this checkout, so
  current `7.0.y` backport difficulty is clean.
- Record 6.3: no exact related stable replacement fix found locally or
  in web stable search.

Phase 7:
- Record 7.1: subsystem is AMDGPU DRM driver; criticality is IMPORTANT,
  driver-specific but for widely used GPU hardware.
- Record 7.2: subsystem/file is active; recent history contains many
  userq reset/fence fixes and refactors.

Phase 8:
- Record 8.1: affected users are AMDGPU users with user queues enabled
  who hit GPU reset/recovery.
- Record 8.2: trigger is GPU reset while user queues have pending resume
  work or unsignaled fences; commonality depends on userq usage and GPU
  hangs.
- Record 8.3: failure mode is GPU reset deadlock/hang; severity HIGH to
  CRITICAL because recovery can stall.
- Record 8.4: benefit is high for affected systems; risk is low-to-
  medium due small scope but reset-path concurrency sensitivity. Ratio
  favors backport.

Phase 9:
- Record 9.1 FOR: fixes deadlock, reset-path hang, small one-function
  diff, maintainer-authored, reviewed, applies cleanly. AGAINST: no
  reported user bug, no stable tag, part of a larger userq series,
  residual queue-state locking TODO. UNRESOLVED: no runtime
  reproduction/build, exact applicability to every active stable branch
  not verified.
- Record 9.2 stable rules: obviously correct enough by code and
  maintainer review: yes; fixes real bug: yes, deadlock; important
  issue: yes, recovery hang; small/contained: yes; no new API/features:
  yes; can apply to current stable tree: yes.
- Record 9.3 exception category: none.
- Record 9.4 decision: backport to stable trees containing this userq
  reset code.

## Verification
- [Phase 1] Parsed GitHub patch and lore message: confirmed subject,
  body, tags, and lack of `Fixes`/`Cc stable`.
- [Phase 2] Compared diff: confirmed one-file `+12/-14` change removing
  `cancel_delayed_work_sync()` and `amdgpu_userq_wait_for_last_fence()`.
- [Phase 3] `git blame`: confirmed reset code origin/refactor history
  around `amdgpu_userq_pre_reset()`.
- [Phase 3] `git describe --contains`: confirmed introduction is in the
  v6.19 era.
- [Phase 3] `MAINTAINERS`: confirmed Christian König and Alex Deucher
  maintain AMDGPU DRM.
- [Phase 4] `b4 dig -c/-a/-w`: failed because commit object is not in
  this checkout.
- [Phase 4] WebFetch GitHub patch and freedesktop lore: confirmed
  original patch, review discussion, and `Reviewed-by: Prike Liang`.
- [Phase 5] `rg` and file reads: confirmed caller path through
  `amdgpu_device_halt_activities()` and `amdgpu_device_gpu_recover()`.
- [Phase 6] `git apply --check`: confirmed clean application to current
  `7.0.y`.
- UNVERIFIED: no kernel build or runtime reproduction was performed;
  exact applicability to all non-`7.0.y` stable trees was not fully
  checked.

**YES**

 drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 26 +++++++++++------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
index 0a1b93259887a..6f49ba3bd0394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
@@ -1583,23 +1583,21 @@ void amdgpu_userq_pre_reset(struct amdgpu_device *adev)
 {
 	const struct amdgpu_userq_funcs *userq_funcs;
 	struct amdgpu_usermode_queue *queue;
-	struct amdgpu_userq_mgr *uqm;
 	unsigned long queue_id;
 
+	/* TODO: We probably need a new lock for the queue state */
 	xa_for_each(&adev->userq_doorbell_xa, queue_id, queue) {
-		uqm = queue->userq_mgr;
-		cancel_delayed_work_sync(&uqm->resume_work);
-		if (queue->state == AMDGPU_USERQ_STATE_MAPPED) {
-			amdgpu_userq_wait_for_last_fence(queue);
-			userq_funcs = adev->userq_funcs[queue->queue_type];
-			userq_funcs->unmap(queue);
-			/* just mark all queues as hung at this point.
-			 * if unmap succeeds, we could map again
-			 * in amdgpu_userq_post_reset() if vram is not lost
-			 */
-			queue->state = AMDGPU_USERQ_STATE_HUNG;
-			amdgpu_userq_fence_driver_force_completion(queue);
-		}
+		if (queue->state != AMDGPU_USERQ_STATE_MAPPED)
+			continue;
+
+		userq_funcs = adev->userq_funcs[queue->queue_type];
+		userq_funcs->unmap(queue);
+		/* just mark all queues as hung at this point.
+		 * if unmap succeeds, we could map again
+		 * in amdgpu_userq_post_reset() if vram is not lost
+		 */
+		queue->state = AMDGPU_USERQ_STATE_HUNG;
+		amdgpu_userq_fence_driver_force_completion(queue);
 	}
 }
 
-- 
2.53.0