From nobody Sun Feb 8 15:01:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FC2FC04A94 for ; Thu, 13 Jul 2023 21:33:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231909AbjGMVd3 (ORCPT ); Thu, 13 Jul 2023 17:33:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40086 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231324AbjGMVdX (ORCPT ); Thu, 13 Jul 2023 17:33:23 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6139E136 for ; Thu, 13 Jul 2023 14:33:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=KtDjAFCYmByWLVDPHSoFeYlAFDkrSIvmPO75H9eHTbY=; b=E+bnIYFMRos4aBzHcV43A+FBuL 4jGpLItV6oQ3lb/0sdDi/qiRgiSPKhclAVDGTRYlDwtzV2/iw+7qPLpS5J/BnaqZEf+7hYmxwJ7AT YpclinzcTa2Up0o7uxyByaC8joJHO6oiP2kkWOblFszr/VXK0jhT8PK/ytveUXjP93caemcOLYlaN N0I0heMl5ngmQkEdTermhXwmBEaerHCdqYyDdZ4DO2pWflrWWJLkl7xqxLCbVsE0eBIn6XC2QPV4o 5fPG2ZxifPT2j4k9kEyv8EDOLUSuWEoJ28Dltw+krT3SWVTDj0SoVfJA4p4FU4kmDFP1rMeI2ewey IYw9SvrQ==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qK3w5-00EDEa-Vj; Thu, 13 Jul 2023 23:33:18 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH v2 1/6] drm/amdgpu: Create a module param to disable soft recovery Date: Thu, 13 Jul 2023 18:32:37 -0300 Message-ID: <20230713213242.680944-2-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230713213242.680944-1-andrealmeid@igalia.com> References: <20230713213242.680944-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Create a module parameter to disable soft recoveries on amdgpu, making every recovery go through the device reset path. This option makes easier to force device resets for testing and debugging purposes. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 +++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +++++- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index a84bd4a0c421..dbe062a087c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -189,6 +189,7 @@ extern uint amdgpu_force_long_training; extern int amdgpu_lbpw; extern int amdgpu_compute_multipipe; extern int amdgpu_gpu_recovery; +extern bool amdgpu_soft_recovery; extern int amdgpu_emu_mode; extern uint amdgpu_smu_memory_pool_size; extern int amdgpu_smu_pptable_id; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/= amdgpu/amdgpu_drv.c index 3b711babd4e2..7c69f3169aa6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -163,6 +163,7 @@ uint amdgpu_force_long_training; int amdgpu_lbpw =3D -1; int amdgpu_compute_multipipe =3D -1; int amdgpu_gpu_recovery =3D -1; /* auto */ +bool amdgpu_soft_recovery =3D true; int amdgpu_emu_mode; uint amdgpu_smu_memory_pool_size; int amdgpu_smu_pptable_id =3D -1; @@ -540,6 +541,14 @@ module_param_named(compute_multipipe, amdgpu_compute_m= ultipipe, int, 0444); MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 =3D enab= le, 0 =3D disable, -1 =3D auto)"); module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); =20 +/** + * DOC: gpu_soft_recovery (bool) + * Set true to allow the driver to try soft recoveries if a job get stuck.= Set + * to false to always force a GPU reset during recovery. + */ +MODULE_PARM_DESC(gpu_soft_recovery, "Enable GPU soft recovery mechanism (d= efault: true)"); +module_param_named(gpu_soft_recovery, amdgpu_soft_recovery, bool, 0644); + /** * DOC: emu_mode (int) * Set value 1 to enable emulation mode. This is only needed when running = on an emulator. The default is 0 (disabled). diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd= /amdgpu/amdgpu_ring.c index 80d6e132e409..40678d9fb17e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -434,8 +434,12 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *rin= g, unsigned int vmid, struct dma_fence *fence) { unsigned long flags; + ktime_t deadline; =20 - ktime_t deadline =3D ktime_add_us(ktime_get(), 10000); + if (!amdgpu_soft_recovery) + return false; + + deadline =3D ktime_add_us(ktime_get(), 10000); =20 if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence) return false; --=20 2.41.0 From nobody Sun Feb 8 15:01:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CC0BC0015E for ; Thu, 13 Jul 2023 21:33:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232238AbjGMVdf (ORCPT ); Thu, 13 Jul 2023 17:33:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40092 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231644AbjGMVdY (ORCPT ); Thu, 13 Jul 2023 17:33:24 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90E022D5F for ; Thu, 13 Jul 2023 14:33:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=haPtXaNAbjvN1gN1Qc0VL4/srlGmwa1LiUX0uWr0Rds=; b=fhZEZ2ePHPD4F7ZyhdzLTbQBCf rZf9p38ZMoyLzNjguHX2WxksXK87S9PdlLTBmb/09UUfMsmrJobTVCSPpRaHTUVeXOrglzEUvwiU1 felj1TAmGjwM3GSnZnYSe4Y2LmXYgUl79uIuVf2iqD39jbE1V4T9adEhJxReB7AN5R6xjpVQpMeQw 7qy8bZgfTfAbHcFGmxfsmowYhIoyeeCWUO2pt1DsDpNtfFTJHGJZePWyyGJPVxcr1h/MaWUxxT4MP uy+YJu/infhFCmMdKL2Pn3ibV1KPnwmqZTin024o1EppE5z4I+C40CPZUgHO92MHheEkkHMpALdtb ddkuMeXQ==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qK3w9-00EDEa-7m; Thu, 13 Jul 2023 23:33:21 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH v2 2/6] drm/amdgpu: Allocate coredump memory in a nonblocking way Date: Thu, 13 Jul 2023 18:32:38 -0300 Message-ID: <20230713213242.680944-3-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230713213242.680944-1-andrealmeid@igalia.com> References: <20230713213242.680944-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org During a GPU reset, a normal memory reclaim could block to reclaim memory. Giving that coredump is a best effort mechanism, it shouldn't disturb the reset path. Change its memory allocation flag to a nonblocking one. Signed-off-by: Andr=C3=A9 Almeida Reviewed-by: Christian K=C3=B6nig --- v2: New patch drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index e25f085ee886..a824f844a984 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5011,7 +5011,7 @@ static void amdgpu_reset_capture_coredumpm(struct amd= gpu_device *adev) struct drm_device *dev =3D adev_to_drm(adev); =20 ktime_get_ts64(&adev->reset_time); - dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL, + dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_NOWAIT, amdgpu_devcoredump_read, amdgpu_devcoredump_free); } #endif --=20 2.41.0 From nobody Sun Feb 8 15:01:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBA1BC001B0 for ; Thu, 13 Jul 2023 21:33:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232310AbjGMVdi (ORCPT ); Thu, 13 Jul 2023 17:33:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40092 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231425AbjGMVd3 (ORCPT ); Thu, 13 Jul 2023 17:33:29 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E044F2D64 for ; Thu, 13 Jul 2023 14:33:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=4gdfWwU6Nsn9XjCnUyawzQF5Eo6/GW8UMrvx6qcU0Ps=; b=bt4d+YnqQ5u7EUOqy85ww199HO RrVGgy8coEuVAcTtFtf5bCl5WyhSWN0YmAaib9UJvexG1huW4JHismvidkiM4kYdGjDzZvHrfVULF 0+anX54VBikL9euMU7G6G1wMdpEx72qQzJi4I/QyrOfB453JLv3VgN7SSRHsC8rA5iGQmeYnSnhMC xE1Pnwq7weV6WorS/P5O/VZhDBHmzPXbJEtokxoOq5bFTTJKxSad0qf8J8kR4C/PKwfdWS7VF4tXV GXfcizrg4LC6LmBjmYUQM1/4BBIWToiuOL3ERLk+cgvJ7WKuSw3SlcJITkoesdroE85dbxb6Dz9Fx irN2vToA==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qK3wC-00EDEa-H9; Thu, 13 Jul 2023 23:33:24 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH v2 3/6] drm/amdgpu: Rework coredump to use memory dynamically Date: Thu, 13 Jul 2023 18:32:39 -0300 Message-ID: <20230713213242.680944-4-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230713213242.680944-1-andrealmeid@igalia.com> References: <20230713213242.680944-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Instead of storing coredump information inside amdgpu_device struct, move if to a proper separated struct and allocate it dynamically. This will make it easier to further expand the logged information. Signed-off-by: Andr=C3=A9 Almeida --- v2: Replace GFP_KERNEL with GPF_NOWAIT drivers/gpu/drm/amd/amdgpu/amdgpu.h | 14 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++-------- 2 files changed, 51 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index dbe062a087c5..e1cc83a89d46 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1068,11 +1068,6 @@ struct amdgpu_device { uint32_t *reset_dump_reg_list; uint32_t *reset_dump_reg_value; int num_regs; -#ifdef CONFIG_DEV_COREDUMP - struct amdgpu_task_info reset_task_info; - bool reset_vram_lost; - struct timespec64 reset_time; -#endif =20 bool scpm_enabled; uint32_t scpm_status; @@ -1085,6 +1080,15 @@ struct amdgpu_device { uint32_t aid_mask; }; =20 +#ifdef CONFIG_DEV_COREDUMP +struct amdgpu_coredump_info { + struct amdgpu_device *adev; + struct amdgpu_task_info reset_task_info; + struct timespec64 reset_time; + bool reset_vram_lost; +}; +#endif + static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) { return container_of(ddev, struct amdgpu_device, ddev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index a824f844a984..e80670420586 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_dev= ice *adev) return 0; } =20 -#ifdef CONFIG_DEV_COREDUMP +#ifndef CONFIG_DEV_COREDUMP +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, + struct amdgpu_reset_context *reset_context) +{ +} +#else static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, void *data, size_t datalen) { struct drm_printer p; - struct amdgpu_device *adev =3D data; + struct amdgpu_coredump_info *coredump =3D data; struct drm_print_iterator iter; int i; =20 @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, drm_printf(&p, "**** AMDGPU Device Coredump ****\n"); drm_printf(&p, "kernel: " UTS_RELEASE "\n"); drm_printf(&p, "module: " KBUILD_MODNAME "\n"); - drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset= _time.tv_nsec); - if (adev->reset_task_info.pid) + drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredum= p->reset_time.tv_nsec); + if (coredump->reset_task_info.pid) drm_printf(&p, "process_name: %s PID: %d\n", - adev->reset_task_info.process_name, - adev->reset_task_info.pid); + coredump->reset_task_info.process_name, + coredump->reset_task_info.pid); =20 - if (adev->reset_vram_lost) + if (coredump->reset_vram_lost) drm_printf(&p, "VRAM is lost due to GPU reset!\n"); - if (adev->num_regs) { + if (coredump->adev->num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); =20 - for (i =3D 0; i < adev->num_regs; i++) + for (i =3D 0; i < coredump->adev->num_regs; i++) drm_printf(&p, "0x%08x: 0x%08x\n", - adev->reset_dump_reg_list[i], - adev->reset_dump_reg_value[i]); + coredump->adev->reset_dump_reg_list[i], + coredump->adev->reset_dump_reg_value[i]); } =20 return count - iter.remain; @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, =20 static void amdgpu_devcoredump_free(void *data) { + kfree(data); } =20 -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev) +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, + struct amdgpu_reset_context *reset_context) { + struct amdgpu_coredump_info *coredump; struct drm_device *dev =3D adev_to_drm(adev); =20 - ktime_get_ts64(&adev->reset_time); - dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_NOWAIT, + coredump =3D kmalloc(sizeof(*coredump), GFP_NOWAIT); + + if (!coredump) { + DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__); + return; + } + + memset(coredump, 0, sizeof(*coredump)); + + coredump->reset_vram_lost =3D vram_lost; + + if (reset_context->job && reset_context->job->vm) + coredump->reset_task_info =3D reset_context->job->vm->task_info; + + coredump->adev =3D adev; + + ktime_get_ts64(&coredump->reset_time); + + dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT, amdgpu_devcoredump_read, amdgpu_devcoredump_free); } #endif @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_li= st_handle, goto out; =20 vram_lost =3D amdgpu_device_check_vram_lost(tmp_adev); -#ifdef CONFIG_DEV_COREDUMP - tmp_adev->reset_vram_lost =3D vram_lost; - memset(&tmp_adev->reset_task_info, 0, - sizeof(tmp_adev->reset_task_info)); - if (reset_context->job && reset_context->job->vm) - tmp_adev->reset_task_info =3D - reset_context->job->vm->task_info; - amdgpu_reset_capture_coredumpm(tmp_adev); -#endif + + amdgpu_coredump(tmp_adev, vram_lost, reset_context); + if (vram_lost) { DRM_INFO("VRAM is lost due to GPU reset!\n"); amdgpu_inc_vram_lost(tmp_adev); --=20 2.41.0 From nobody Sun Feb 8 15:01:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29464C0015E for ; Thu, 13 Jul 2023 21:33:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229905AbjGMVdm (ORCPT ); Thu, 13 Jul 2023 17:33:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40126 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232089AbjGMVdb (ORCPT ); Thu, 13 Jul 2023 17:33:31 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2CA8C2D60 for ; Thu, 13 Jul 2023 14:33:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=Z8+yHUBLE2x/uwIO5VHayp0mcRdlKPk55oOFlYjjXeQ=; b=UfYOM+5kJXM9Evp9TRM9IXUy6f kkyhp0JWYUGqYYVuAYJqEyeTwPVwTTGAf9u7xvCanubbmyzcMMekutTEYvL/cJgS5CzHcNLX2rt0S MookQemT4QHCBBeMrptMCymBJqCNr/cu0nY6L3EsSu3GFqlXuBd3v/7wr28pgc2kHyhj20n1kUQyo yybxaA5aq58BYDjuScRc1N3qSt580ll7Gipcc26k0yb6GwUpg447aUHp0W/Gw8gm0jqQHA2NksmVr sPTmYmCdyvX9XRwH4G9UaFHGlKHz+VegD+f5tk1BSG6y0cSQ8x/sBefbIwYvtH6RL9KbYIaGX5Ihu Qi8ABpGA==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qK3wF-00EDEa-Qs; Thu, 13 Jul 2023 23:33:28 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH v2 4/6] drm/amdgpu: Limit info in coredump for kernel threads Date: Thu, 13 Jul 2023 18:32:40 -0300 Message-ID: <20230713213242.680944-5-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230713213242.680944-1-andrealmeid@igalia.com> References: <20230713213242.680944-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a kernel thread caused the reset, the information available to be logged will be limited, so return early in the dump function. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index e80670420586..07546781b8b8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4988,10 +4988,14 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, drm_printf(&p, "kernel: " UTS_RELEASE "\n"); drm_printf(&p, "module: " KBUILD_MODNAME "\n"); drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredum= p->reset_time.tv_nsec); - if (coredump->reset_task_info.pid) + if (coredump->reset_task_info.pid) { drm_printf(&p, "process_name: %s PID: %d\n", coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + } else { + drm_printf(&p, "GPU reset caused by a kernel thread\n"); + return count - iter.remain; + } =20 if (coredump->reset_vram_lost) drm_printf(&p, "VRAM is lost due to GPU reset!\n"); --=20 2.41.0 From nobody Sun Feb 8 15:01:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C59CBC0015E for ; Thu, 13 Jul 2023 21:33:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232829AbjGMVdt (ORCPT ); Thu, 13 Jul 2023 17:33:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232405AbjGMVdj (ORCPT ); Thu, 13 Jul 2023 17:33:39 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 67D292D6B for ; Thu, 13 Jul 2023 14:33:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=bkcPn/8IkEMwm5HtzX5ohbd8o26KLnwK0sNSpD7QR84=; b=TPEOi7m6dCmKrdH3CCyj5aoAt9 yXi5I1eqvV0R1HkWU3nv3YCqtkZ4J4yk3H0nx7In5LVOJpoRqwHd4St0rj0sGUoL0nBBDinekyrC9 IWqUAIwyw/ASO0dN1oTRV/ZfUqLzldQIIIMhm2rYSVwFw42BQSwHph9kSPR5LTOwk5DnNi5EYkrML 6KcwnZ4caTNNHvzcdZjMM/Dj6oex0kL7v9ZpvDuvkZGkXa6Gx8zXxQ7pe927VgDKXQbQymkKcZSl3 8akCi+jg4bCTpeQSDCQ7Ab9KZOJfC88SkVhnMcaBaS4uM23PSXK8JGNnipQNmfS23xYOPcXTL96LF Bg276vQA==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qK3wJ-00EDEa-2R; Thu, 13 Jul 2023 23:33:31 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH v2 5/6] drm/amdgpu: Log IBs and ring name at coredump Date: Thu, 13 Jul 2023 18:32:41 -0300 Message-ID: <20230713213242.680944-6-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230713213242.680944-1-andrealmeid@igalia.com> References: <20230713213242.680944-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Log the IB addresses used by the hung job along with the stuck ring name. Note that due to nested IBs, the one that caused the reset itself may be in not listed address. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 +++++++++++++++++++++- 2 files changed, 33 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index e1cc83a89d46..cfeaf93934fd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1086,6 +1086,9 @@ struct amdgpu_coredump_info { struct amdgpu_task_info reset_task_info; struct timespec64 reset_time; bool reset_vram_lost; + u64 *ibs; + u32 num_ibs; + char ring_name[16]; }; #endif =20 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index 07546781b8b8..431ccc3d7857 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5008,12 +5008,24 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, coredump->adev->reset_dump_reg_value[i]); } =20 + if (coredump->num_ibs) { + drm_printf(&p, "IBs:\n"); + for (i =3D 0; i < coredump->num_ibs; i++) + drm_printf(&p, "\t[%d] 0x%llx\n", i, coredump->ibs[i]); + } + + if (coredump->ring_name[0] !=3D '\0') + drm_printf(&p, "ring name: %s\n", coredump->ring_name); + return count - iter.remain; } =20 static void amdgpu_devcoredump_free(void *data) { - kfree(data); + struct amdgpu_coredump_info *coredump =3D data; + + kfree(coredump->ibs); + kfree(coredump); } =20 static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, @@ -5021,6 +5033,8 @@ static void amdgpu_coredump(struct amdgpu_device *ade= v, bool vram_lost, { struct amdgpu_coredump_info *coredump; struct drm_device *dev =3D adev_to_drm(adev); + struct amdgpu_job *job =3D reset_context->job; + int i; =20 coredump =3D kmalloc(sizeof(*coredump), GFP_NOWAIT); =20 @@ -5038,6 +5052,21 @@ static void amdgpu_coredump(struct amdgpu_device *ad= ev, bool vram_lost, =20 coredump->adev =3D adev; =20 + if (job && job->num_ibs) { + struct amdgpu_ring *ring =3D to_amdgpu_ring(job->base.sched); + u32 num_ibs =3D job->num_ibs; + + coredump->ibs =3D kmalloc_array(num_ibs, sizeof(coredump->ibs), GFP_NOWA= IT); + if (coredump->ibs) + coredump->num_ibs =3D num_ibs; + + for (i =3D 0; i < coredump->num_ibs; i++) + coredump->ibs[i] =3D job->ibs[i].gpu_addr; + + if (ring) + strncpy(coredump->ring_name, ring->name, 16); + } + ktime_get_ts64(&coredump->reset_time); =20 dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT, --=20 2.41.0 From nobody Sun Feb 8 15:01:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2906DC0015E for ; Thu, 13 Jul 2023 21:33:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232940AbjGMVdw (ORCPT ); Thu, 13 Jul 2023 17:33:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40468 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231911AbjGMVdo (ORCPT ); Thu, 13 Jul 2023 17:33:44 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BDAD430CA for ; Thu, 13 Jul 2023 14:33:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=wefOQXZQwSDcp2ccRjthMbBFASOk2LxGNeKu9meJsro=; b=JWFiob9ERIYHygWAocTFwnHZf1 +uazyGuftVZPQH5KKgq25teyGk168GsBKhcFBFqUi/obWJuO2LA37ryauNp4qC2eIlxd84PN/izne e6QqDRYAmmluEC6QASXQQt7jIrN2RSaFR6/IRd6V3RRjTTIa+4H5rBQrZPzEyUwvsrnXpR5NupxRd z24lAv18gtid3d33S7jgeUr4A1nlfoCsjQ+2zA9qXf6rZy4XCmj9kAoxhhdv0leDCa9NWHtnPZ8CO rsBnk0D1ke2t1Ow9GTFjoUPwe8JZoUoDdv3nQcm+qL6+GN6feKodxezlRUoti2Z3iMGwhrNXa0N+c K2eyVsdA==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qK3wM-00EDEa-C8; Thu, 13 Jul 2023 23:33:34 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH v2 6/6] drm/amdgpu: Create version number for coredumps Date: Thu, 13 Jul 2023 18:32:42 -0300 Message-ID: <20230713213242.680944-7-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230713213242.680944-1-andrealmeid@igalia.com> References: <20230713213242.680944-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Even if there's nothing currently parsing amdgpu's coredump files, if we eventually have such tools they will be glad to find a version field to properly read the file. Create a version number to be displayed on top of coredump file, to be incremented when the file format or content get changed. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index cfeaf93934fd..905574acf3a0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1081,6 +1081,9 @@ struct amdgpu_device { }; =20 #ifdef CONFIG_DEV_COREDUMP + +#define AMDGPU_COREDUMP_VERSION "1" + struct amdgpu_coredump_info { struct amdgpu_device *adev; struct amdgpu_task_info reset_task_info; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index 431ccc3d7857..c83ea7aa135a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4985,6 +4985,7 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, = loff_t offset, p =3D drm_coredump_printer(&iter); =20 drm_printf(&p, "**** AMDGPU Device Coredump ****\n"); + drm_printf(&p, "version: " AMDGPU_COREDUMP_VERSION "\n"); drm_printf(&p, "kernel: " UTS_RELEASE "\n"); drm_printf(&p, "module: " KBUILD_MODNAME "\n"); drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredum= p->reset_time.tv_nsec); --=20 2.41.0