From nobody Sat Feb 7 18:49:43 2026 Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EE7AE1F891C for ; Tue, 7 Jan 2025 14:03:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736258599; cv=none; b=UYyhtw8vKKjQYtdAvXE34QZ7IJNw/rVvFBCNZ7ByC7QpwIW6UjP44GO/JGxWMxM3F7U+K9+3Z5y4B5Eg87Ekeg7pzmJhrnNSr0rNmRgRlcha9toIZgyXKEWe2HEwz4DQL7FzkQE3ZGFQ/Crkiju7LKk1G4cU0saXLNgnnyo7fQ8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736258599; c=relaxed/simple; bh=Hyp5sUCwgiGZFsgMtcM+dTji4O9ALYYFgEobjf3Fe3k=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=cKPXpecp65YF0HBArMJCfzUj6hwqQuqKV0xEvZR1QqLwC9MwTq+rbd3USEyFEj51bAPrhYN1FhC7klNo87B4xLQydUFzE4JHMZWFJxP7HvfNadQpBOKmfyRdH/M8q4v3esNbhGUc8gZm+iDJBBAR7dfHiemp6RYg0fDemkcLYkU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linbit.com; spf=pass smtp.mailfrom=linbit.com; dkim=pass (2048-bit key) header.d=linbit-com.20230601.gappssmtp.com header.i=@linbit-com.20230601.gappssmtp.com header.b=AVM81G6f; arc=none smtp.client-ip=209.85.208.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linbit-com.20230601.gappssmtp.com header.i=@linbit-com.20230601.gappssmtp.com header.b="AVM81G6f" Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-3047818ac17so137807331fa.3 for ; Tue, 07 Jan 2025 06:03:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linbit-com.20230601.gappssmtp.com; s=20230601; t=1736258579; x=1736863379; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=3xFXzN3XB56uGdpTaZfx8hAMErjpZa4pNklIejZ9Zg4=; b=AVM81G6fQJKsll7UVSArUO22QttSLTS/F53X6rfXXCHZX/6Ax8WBrJuSweXc1IWmog Uw9AJPD+KQNp7rpm3zKCgTjDITPf2Z1u80PAJnUFI8LluIhIuvz1UiLGRId1QC+XS2F+ yjsO5wHdWy4087D2wh8oUOg0SIZhTDn3+XcJMB0SMs1VGyKmV2ZTghnuaIXgkUIzDGKu XMJbEWYj2gE/MX11QpyBHCZsOqni75E4eOGDDDDT4KNR0lvOZnc2ESq8y6op3AGV0r3U vqKYSW5uysB+1Pdfe/uzejXJUNMPJhVcBVER2B/Mkqh678bLksNo1bRvVx2eVBBK2BIp /RBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736258579; x=1736863379; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3xFXzN3XB56uGdpTaZfx8hAMErjpZa4pNklIejZ9Zg4=; b=Hc8PqKjOFItpk+p+3UuI+LktKdHpLbKPT8EsUQCEMxIAHK/hHpHWdrHOJ4GG9I+u2Y 8Z+lTEpEBgJllteFUGl+eS2NWix2rdh/RJrw6Bk4u3u4n5Z5r32JKxD/PTZ5SHHShiMo b4Fm5B89hwx8nDdUdrfAe2i2XXYNhDyTBpxZblAy8ch3xkTCkNwNCbG7Y6UNgdJ3khdk 4eEQONZpEswrnelNyjOocd2+Ut+GVhHDQPlJi6hc6f0A00QBHU2wYwpBqtfHOOKLmkxQ tprlZy08tNHX70ZUvbtKYK2GZhJKxSntm9/iJYIjT8MwJhQ340hwZby26NyNOH6o0kPe 8gCQ== X-Gm-Message-State: AOJu0YwH8qqfRFsfe8lO5Uf6JbB+9o+3RRU0Y7OD9QpOcIF05obj599e 1PAOP7rpB4L7d6wwv5rIVjn3GRaiUXaK+8S1yEUOwQ7KncaeIvsYW8Yh0eanBNA= X-Gm-Gg: ASbGncv1CTWQ+F5nfu/bDnMnv2bjHmQBh7UhWgFaVd3AWt90YIgmy/ndFcUVnM9PxvH vbQ7Rk/YNPJ7jqH/5cqjQgLCaHK+7NZ2yp1t+k5hIvDEUxp5ngt9e+or3B4S32ChmfrYcaL5NzN mpJYh5D0xw0n6dQd65lw9yGasOOIY/x54/3I2451IkSP2F/cMXcHG9YKA/vVTHU7jSll2C0Hm9C cHDmnMEqN1GsmAGLN+sO61SMLz6Nh3kwmCkHdk8o0C+Q9DX5CtNYvpmjl+0UC5fRlADpomTK7AP MLc/elx0ot4+bcFM8/7VP/Or3XmDmEz2 X-Google-Smtp-Source: AGHT+IFMc9pDfzFjAJiP3oI7iKkxIT5q0OsY1QwakVQCq8Upd6xmvz2HkbBYMMaLcYaOOnyvgTPYNg== X-Received: by 2002:a05:651c:b12:b0:302:34d6:f047 with SMTP id 38308e7fff4ca-3046851f88amr184215771fa.5.1736258578164; Tue, 07 Jan 2025 06:02:58 -0800 (PST) Received: from ryzen9.home (193-81-174-222.hdsl.highway.telekom.at. [193.81.174.222]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-aaf57c2e24esm1081145766b.205.2025.01.07.06.02.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Jan 2025 06:02:57 -0800 (PST) From: Philipp Reisner To: dri-devel@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org, =?UTF-8?q?Christian=20K=C3=B6nig?= , Nirmoy Das , Simona Vetter , Philipp Reisner Subject: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume Date: Tue, 7 Jan 2025 15:02:40 +0100 Message-ID: <20250107140240.325899-1-philipp.reisner@linbit.com> X-Mailer: git-send-email 2.47.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The following OOPS plagues me on about every 10th suspend and resume: [160640.791304] BUG: kernel NULL pointer dereference, address: 000000000000= 0008 [160640.791309] #PF: supervisor read access in kernel mode [160640.791311] #PF: error_code(0x0000) - not-present page [160640.791313] PGD 0 P4D 0 [160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted:= G OE 6.11.7-300.fc41.x86_64 #1 [160640.791324] Tainted: [O]=3DOOT_MODULE, [E]=3DUNSIGNED_MODULE [160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B= 450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021 [160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched] [160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53= 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <4= 8> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f [160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206 [160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d7840= 20f0d0 [160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d7840= 20f638 [160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75= 646568 [160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000= 000000 [160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000= 000000 [160640.791349] FS: 00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS= :0000000000000000 [160640.791351] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000= 350ef0 [160640.791354] Call Trace: [160640.791357] [160640.791360] ? __die_body.cold+0x19/0x27 [160640.791367] ? page_fault_oops+0x15a/0x2f0 [160640.791372] ? exc_page_fault+0x7e/0x180 [160640.791376] ? asm_exc_page_fault+0x26/0x30 [160640.791380] ? drm_sched_job_arm+0x23/0x60 [gpu_sched] [160640.791384] ? drm_sched_job_arm+0x1f/0x60 [gpu_sched] [160640.791390] amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu] [160640.792011] ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] [160640.792341] drm_ioctl_kernel+0xb0/0x100 [160640.792346] drm_ioctl+0x28b/0x540 [160640.792349] ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] [160640.792673] amdgpu_drm_ioctl+0x4e/0x90 [amdgpu] [160640.792994] __x64_sys_ioctl+0x94/0xd0 [160640.792999] do_syscall_64+0x82/0x160 [160640.793006] ? __count_memcg_events+0x75/0x130 [160640.793009] ? count_memcg_events.constprop.0+0x1a/0x30 [160640.793014] ? handle_mm_fault+0x21b/0x330 [160640.793016] ? do_user_addr_fault+0x55a/0x7b0 [160640.793020] ? exc_page_fault+0x7e/0x180 [160640.793023] entry_SYSCALL_64_after_hwframe+0x76/0x7e The OOPS happens because the rq member of entity is NULL in drm_sched_job_arm() after the call to drm_sched_entity_select_rq(). In drm_sched_entity_select_rq(), the code considers that drb_sched_pick_best() might return a NULL value. When NULL, it assigns NULL to entity->rq even if it had a non-NULL value before. drm_sched_job_arm() does not deal with entities having a rq of NULL. Fix this by leaving the entity on the engine it was instead of assigning a NULL to its run queue member. Link: https://retrace.fedoraproject.org/faf/reports/1038619/ Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746 Signed-off-by: Philipp Reisner --- drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/sch= eduler/sched_entity.c index a75eede8bf8d..495bc087588b 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_enti= ty *entity) =20 spin_lock(&entity->rq_lock); sched =3D drm_sched_pick_best(entity->sched_list, entity->num_sched_list); - rq =3D sched ? sched->sched_rq[entity->priority] : NULL; - if (rq !=3D entity->rq) { - drm_sched_rq_remove_entity(entity->rq, entity); - entity->rq =3D rq; + if (sched) { + rq =3D sched->sched_rq[entity->priority]; + if (rq !=3D entity->rq) { + drm_sched_rq_remove_entity(entity->rq, entity); + entity->rq =3D rq; + } } spin_unlock(&entity->rq_lock); =20 --=20 2.47.1