From nobody Sun Feb 8 11:06:40 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54BB4C0015E for ; Tue, 11 Jul 2023 21:35:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231531AbjGKVfv (ORCPT ); Tue, 11 Jul 2023 17:35:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38796 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230437AbjGKVfr (ORCPT ); Tue, 11 Jul 2023 17:35:47 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6BAB1E69 for ; Tue, 11 Jul 2023 14:35:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=KtDjAFCYmByWLVDPHSoFeYlAFDkrSIvmPO75H9eHTbY=; b=VLIc1Y3O+024uwIdNbCglhXD23 kWJ5GKiw3jqPD27tDBblfuikBdP4D2HexBRXCb0+ZEaC19Ap4iP04G+ML7tEJ3Wq0Ddr7ZrAwCpro hdHljfSIqxvZiYIMv6dJNjFlrq91xFholDn12QBde4khFM7Qz2TF9XZavxWPMDE7n7dMA8QxCMCX1 eHjWKrt6WN7zLnilNC6SnvfdA1YfI3YdXhNpx7rfwigJWtqLsl+mGRa+Klw9pbRwx+vORBmVTl0/t pRnBV0tXE2QyUShZYfY/Duu5dO3A917qXjblluu6VTh8lQjNeBNpY0HtWA99tzVnpk8DN+Nm1jRAA phQL2nPw==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qJL1K-00Cl0M-UT; Tue, 11 Jul 2023 23:35:43 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH 1/6] drm/amdgpu: Create a module param to disable soft recovery Date: Tue, 11 Jul 2023 18:34:56 -0300 Message-ID: <20230711213501.526237-2-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230711213501.526237-1-andrealmeid@igalia.com> References: <20230711213501.526237-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Create a module parameter to disable soft recoveries on amdgpu, making every recovery go through the device reset path. This option makes easier to force device resets for testing and debugging purposes. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 +++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +++++- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index a84bd4a0c421..dbe062a087c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -189,6 +189,7 @@ extern uint amdgpu_force_long_training; extern int amdgpu_lbpw; extern int amdgpu_compute_multipipe; extern int amdgpu_gpu_recovery; +extern bool amdgpu_soft_recovery; extern int amdgpu_emu_mode; extern uint amdgpu_smu_memory_pool_size; extern int amdgpu_smu_pptable_id; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/= amdgpu/amdgpu_drv.c index 3b711babd4e2..7c69f3169aa6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -163,6 +163,7 @@ uint amdgpu_force_long_training; int amdgpu_lbpw =3D -1; int amdgpu_compute_multipipe =3D -1; int amdgpu_gpu_recovery =3D -1; /* auto */ +bool amdgpu_soft_recovery =3D true; int amdgpu_emu_mode; uint amdgpu_smu_memory_pool_size; int amdgpu_smu_pptable_id =3D -1; @@ -540,6 +541,14 @@ module_param_named(compute_multipipe, amdgpu_compute_m= ultipipe, int, 0444); MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 =3D enab= le, 0 =3D disable, -1 =3D auto)"); module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); =20 +/** + * DOC: gpu_soft_recovery (bool) + * Set true to allow the driver to try soft recoveries if a job get stuck.= Set + * to false to always force a GPU reset during recovery. + */ +MODULE_PARM_DESC(gpu_soft_recovery, "Enable GPU soft recovery mechanism (d= efault: true)"); +module_param_named(gpu_soft_recovery, amdgpu_soft_recovery, bool, 0644); + /** * DOC: emu_mode (int) * Set value 1 to enable emulation mode. This is only needed when running = on an emulator. The default is 0 (disabled). diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd= /amdgpu/amdgpu_ring.c index 80d6e132e409..40678d9fb17e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -434,8 +434,12 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *rin= g, unsigned int vmid, struct dma_fence *fence) { unsigned long flags; + ktime_t deadline; =20 - ktime_t deadline =3D ktime_add_us(ktime_get(), 10000); + if (!amdgpu_soft_recovery) + return false; + + deadline =3D ktime_add_us(ktime_get(), 10000); =20 if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence) return false; --=20 2.41.0 From nobody Sun Feb 8 11:06:40 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63598EB64DD for ; Tue, 11 Jul 2023 21:35:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231175AbjGKVf5 (ORCPT ); Tue, 11 Jul 2023 17:35:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231416AbjGKVft (ORCPT ); Tue, 11 Jul 2023 17:35:49 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D338127 for ; Tue, 11 Jul 2023 14:35:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=Ddrjm27aI2mzONVUTJ28Yp+tdR9+PE1LPwDohpg9Dzo=; b=pTL/XLcfjoADLmvXjercNTTKr1 f1hgqEGH9Jk6v+MtlrnR8mnSSM7DK3Z6BzYZNGmq80ZLBxGaYKNVChiD5sIOO8enRyAtXEBX5r9Td KmT3AEEudOVjBR/roOGGSf4EVN5SYcAELVWHxFNUyvrTML5H0g85t/UBkkCtvSaR9UKCj61X7ciBX 3Bg+NIrva9TfY0d2M7dOWjFk/wuPKzJ4Zhq9Ft3d/QNm8c/gBBoRQ829jgR7MKxqPeKNmoI/2ta7t OFGjohehJYr6tx/pfpBG0aEzSiKTs9jorXaqMeZ/QOUQEFkndPXG+N0a6IdtbXuZKN5QrrGLy/gLj /zTQaLLA==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qJL1O-00Cl0M-7n; Tue, 11 Jul 2023 23:35:46 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries Date: Tue, 11 Jul 2023 18:34:57 -0300 Message-ID: <20230711213501.526237-3-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230711213501.526237-1-andrealmeid@igalia.com> References: <20230711213501.526237-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a DRM fence is set to -ENODATA, that means that this context was a cause of a soft reset, but is never marked as guilty. Flag it as guilty and log to user that this context won't accept more submissions. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/= amdgpu/amdgpu_ctx.c index 0dc9c655c4fb..fe8e47d063da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 = hw_ip, u32 instance, ctx_entity =3D &ctx->entities[hw_ip][ring]->entity; r =3D drm_sched_entity_error(ctx_entity); if (r) { + if (r =3D=3D -ENODATA) { + DRM_ERROR("%s (%d) context caused a reset," + "marking it guilty and refusing new submissions.\n", + current->comm, current->pid); + atomic_set(&ctx->guilty, 1); + } DRM_DEBUG("error entity %p\n", ctx_entity); return r; } --=20 2.41.0 From nobody Sun Feb 8 11:06:40 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E21DBC0015E for ; Tue, 11 Jul 2023 21:36:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229937AbjGKVf7 (ORCPT ); Tue, 11 Jul 2023 17:35:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38838 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231533AbjGKVfx (ORCPT ); Tue, 11 Jul 2023 17:35:53 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DBDF5E69 for ; Tue, 11 Jul 2023 14:35:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=KTRbZwUZgZnlGc0WMOlsXvvWVPDAjN3SirsI8t8QzuM=; b=LFtKd7KQNb4Fif9xGzTjkrwSAi lfL8gI2EaPuNVIs6NFCmsMRqH0V/NuNdlvlUF4hdZfU7CuC6zIvOKXh3M5DQwyHgKiBTsijeK0nUb yzpxrSB7wWWOtsB1O0lu2mkIZyX6Oy5LbBXb5soTTi6MOshDZrv9QJPiahvi03mThXHRF998PwxFc Sy0ES6jvMweiKXBx2ALG9adsHdvTH1KoJ9DDIzKxSDZa/MUBWtWo+YTz+GLOJxqXnIZB6W9D0lvjs mDKvG1Eat2DrOcZwBFH+rOLKIHMXsD4JWBDCMF2gqeOIC3SY/PKO5g0UAE783NOssKaJAXpoWawTQ OJZsyAdA==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qJL1R-00Cl0M-IK; Tue, 11 Jul 2023 23:35:49 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically Date: Tue, 11 Jul 2023 18:34:58 -0300 Message-ID: <20230711213501.526237-4-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230711213501.526237-1-andrealmeid@igalia.com> References: <20230711213501.526237-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Instead of storing coredump information inside amdgpu_device struct, move if to a proper separated struct and allocate it dynamically. This will make it easier to further expand the logged information. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 14 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++-------- 2 files changed, 51 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index dbe062a087c5..e1cc83a89d46 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1068,11 +1068,6 @@ struct amdgpu_device { uint32_t *reset_dump_reg_list; uint32_t *reset_dump_reg_value; int num_regs; -#ifdef CONFIG_DEV_COREDUMP - struct amdgpu_task_info reset_task_info; - bool reset_vram_lost; - struct timespec64 reset_time; -#endif =20 bool scpm_enabled; uint32_t scpm_status; @@ -1085,6 +1080,15 @@ struct amdgpu_device { uint32_t aid_mask; }; =20 +#ifdef CONFIG_DEV_COREDUMP +struct amdgpu_coredump_info { + struct amdgpu_device *adev; + struct amdgpu_task_info reset_task_info; + struct timespec64 reset_time; + bool reset_vram_lost; +}; +#endif + static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) { return container_of(ddev, struct amdgpu_device, ddev); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index e25f085ee886..23b9784e9787 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_dev= ice *adev) return 0; } =20 -#ifdef CONFIG_DEV_COREDUMP +#ifndef CONFIG_DEV_COREDUMP +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, + struct amdgpu_reset_context *reset_context) +{ +} +#else static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, void *data, size_t datalen) { struct drm_printer p; - struct amdgpu_device *adev =3D data; + struct amdgpu_coredump_info *coredump =3D data; struct drm_print_iterator iter; int i; =20 @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, drm_printf(&p, "**** AMDGPU Device Coredump ****\n"); drm_printf(&p, "kernel: " UTS_RELEASE "\n"); drm_printf(&p, "module: " KBUILD_MODNAME "\n"); - drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset= _time.tv_nsec); - if (adev->reset_task_info.pid) + drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredum= p->reset_time.tv_nsec); + if (coredump->reset_task_info.pid) drm_printf(&p, "process_name: %s PID: %d\n", - adev->reset_task_info.process_name, - adev->reset_task_info.pid); + coredump->reset_task_info.process_name, + coredump->reset_task_info.pid); =20 - if (adev->reset_vram_lost) + if (coredump->reset_vram_lost) drm_printf(&p, "VRAM is lost due to GPU reset!\n"); - if (adev->num_regs) { + if (coredump->adev->num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); =20 - for (i =3D 0; i < adev->num_regs; i++) + for (i =3D 0; i < coredump->adev->num_regs; i++) drm_printf(&p, "0x%08x: 0x%08x\n", - adev->reset_dump_reg_list[i], - adev->reset_dump_reg_value[i]); + coredump->adev->reset_dump_reg_list[i], + coredump->adev->reset_dump_reg_value[i]); } =20 return count - iter.remain; @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, =20 static void amdgpu_devcoredump_free(void *data) { + kfree(data); } =20 -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev) +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, + struct amdgpu_reset_context *reset_context) { + struct amdgpu_coredump_info *coredump; struct drm_device *dev =3D adev_to_drm(adev); =20 - ktime_get_ts64(&adev->reset_time); - dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL, + coredump =3D kmalloc(sizeof(*coredump), GFP_KERNEL); + + if (!coredump) { + DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__); + return; + } + + memset(coredump, 0, sizeof(*coredump)); + + coredump->reset_vram_lost =3D vram_lost; + + if (reset_context->job && reset_context->job->vm) + coredump->reset_task_info =3D reset_context->job->vm->task_info; + + coredump->adev =3D adev; + + ktime_get_ts64(&coredump->reset_time); + + dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL, amdgpu_devcoredump_read, amdgpu_devcoredump_free); } #endif @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_li= st_handle, goto out; =20 vram_lost =3D amdgpu_device_check_vram_lost(tmp_adev); -#ifdef CONFIG_DEV_COREDUMP - tmp_adev->reset_vram_lost =3D vram_lost; - memset(&tmp_adev->reset_task_info, 0, - sizeof(tmp_adev->reset_task_info)); - if (reset_context->job && reset_context->job->vm) - tmp_adev->reset_task_info =3D - reset_context->job->vm->task_info; - amdgpu_reset_capture_coredumpm(tmp_adev); -#endif + + amdgpu_coredump(tmp_adev, vram_lost, reset_context); + if (vram_lost) { DRM_INFO("VRAM is lost due to GPU reset!\n"); amdgpu_inc_vram_lost(tmp_adev); --=20 2.41.0 From nobody Sun Feb 8 11:06:40 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71087C0015E for ; Tue, 11 Jul 2023 21:36:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231600AbjGKVgD (ORCPT ); Tue, 11 Jul 2023 17:36:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38966 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230078AbjGKVf6 (ORCPT ); Tue, 11 Jul 2023 17:35:58 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46F40171E for ; Tue, 11 Jul 2023 14:35:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=WH7Cm3VveDbv6D+GxRpXLNfTbvl+zR6aYn4to7mM5e0=; b=o6ZCvMVtkCfTreopWJ0LzaVhuw BpyVOSfP1bMdVfWrbafOBb2fOdFbsHfTlv+klJvaVKoNt6j3iQ/0q/U8Og+M9LC5pEtMURUGNXGS5 fci8b6lrQ20/F99S8foYZPyG7KGKd97Vm2vwFlvIS9P3x7TmIqXSz5qApGZSfoSxF/ybGn1g27XgN /GgLePM+/Pov1YFOnw5OzewKz2tl6pvAUqSs2SslC5GucnwcfaPCBXtBqXd+PSDBYKQH/tNr2AbsH ACHoxrtQT1Ejq5c4nTQIJ5w2Jx86RrCBUSKlWkC90wYL/0RBPnxrGi5l0G9RqT32rEBh1acgWAyQi SIpq0c1g==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qJL1U-00Cl0M-Ph; Tue, 11 Jul 2023 23:35:53 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH 4/6] drm/amdgpu: Limit info in coredump for kernel threads Date: Tue, 11 Jul 2023 18:34:59 -0300 Message-ID: <20230711213501.526237-5-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230711213501.526237-1-andrealmeid@igalia.com> References: <20230711213501.526237-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a kernel thread caused the reset, the information available to be logged will be limited, so return early in the dump function. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index 23b9784e9787..7449aead1e13 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4988,10 +4988,14 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, drm_printf(&p, "kernel: " UTS_RELEASE "\n"); drm_printf(&p, "module: " KBUILD_MODNAME "\n"); drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredum= p->reset_time.tv_nsec); - if (coredump->reset_task_info.pid) + if (coredump->reset_task_info.pid) { drm_printf(&p, "process_name: %s PID: %d\n", coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + } else { + drm_printf(&p, "GPU reset caused by a kernel thread\n"); + return count - iter.remain; + } =20 if (coredump->reset_vram_lost) drm_printf(&p, "VRAM is lost due to GPU reset!\n"); --=20 2.41.0 From nobody Sun Feb 8 11:06:40 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3694EB64DC for ; Tue, 11 Jul 2023 21:36:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231613AbjGKVgL (ORCPT ); Tue, 11 Jul 2023 17:36:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39134 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231549AbjGKVgD (ORCPT ); Tue, 11 Jul 2023 17:36:03 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7CF9C1987 for ; Tue, 11 Jul 2023 14:35:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=WzTNgc8vMEis87Hj/iirVJV1cpCA51tiCMdAX2SlFNM=; b=SEXnvZaJfsA0QKs6gClIojECCK bvxgfKyKHkDq9dKFqCt7+1ZJ+W8nSv4DQPbeUDnyA2c/FtLmUVRh4/IKdGhR2zXSnjp8h7F/Lm9YL r2AywIcLF8Vw3PstpFkDK0LLf90vP3tSzHZqEfa01ZXVYNwZt8KIMt5TcpHEfWjVHxwdL3DMVNSzb 6+6OSWuU5V8lG9KiodvyL0P+kRoBqx7nfgSsFQnrqDcpNIPzomIViwcQe6efExSRhneh4lui4kKMI /f3u4B0J/r1W9XG98j2AQ++gIdORNs1DnwGuNJ49VdLvzr9joTeGHAEv8LwopTzTBNF0+0Yxi9Tcd lH2DB2/w==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qJL1Y-00Cl0M-2x; Tue, 11 Jul 2023 23:35:56 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH 5/6] drm/amdgpu: Log IBs and ring name at coredump Date: Tue, 11 Jul 2023 18:35:00 -0300 Message-ID: <20230711213501.526237-6-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230711213501.526237-1-andrealmeid@igalia.com> References: <20230711213501.526237-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Log the IB addresses used by the hung job along with the stuck ring name. Note that due to nested IBs, the one that caused the reset itself may be in not listed address. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 +++++++++++++++++++++- 2 files changed, 33 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index e1cc83a89d46..cfeaf93934fd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1086,6 +1086,9 @@ struct amdgpu_coredump_info { struct amdgpu_task_info reset_task_info; struct timespec64 reset_time; bool reset_vram_lost; + u64 *ibs; + u32 num_ibs; + char ring_name[16]; }; #endif =20 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index 7449aead1e13..38d03ca7a9fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5008,12 +5008,24 @@ static ssize_t amdgpu_devcoredump_read(char *buffer= , loff_t offset, coredump->adev->reset_dump_reg_value[i]); } =20 + if (coredump->num_ibs) { + drm_printf(&p, "IBs:\n"); + for (i =3D 0; i < coredump->num_ibs; i++) + drm_printf(&p, "\t[%d] 0x%llx\n", i, coredump->ibs[i]); + } + + if (coredump->ring_name[0] !=3D '\0') + drm_printf(&p, "ring name: %s\n", coredump->ring_name); + return count - iter.remain; } =20 static void amdgpu_devcoredump_free(void *data) { - kfree(data); + struct amdgpu_coredump_info *coredump =3D data; + + kfree(coredump->ibs); + kfree(coredump); } =20 static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, @@ -5021,6 +5033,8 @@ static void amdgpu_coredump(struct amdgpu_device *ade= v, bool vram_lost, { struct amdgpu_coredump_info *coredump; struct drm_device *dev =3D adev_to_drm(adev); + struct amdgpu_job *job =3D reset_context->job; + int i; =20 coredump =3D kmalloc(sizeof(*coredump), GFP_KERNEL); =20 @@ -5038,6 +5052,21 @@ static void amdgpu_coredump(struct amdgpu_device *ad= ev, bool vram_lost, =20 coredump->adev =3D adev; =20 + if (job && job->num_ibs) { + struct amdgpu_ring *ring =3D to_amdgpu_ring(job->base.sched); + u32 num_ibs =3D job->num_ibs; + + coredump->ibs =3D kmalloc_array(num_ibs, sizeof(coredump->ibs), GFP_KERN= EL); + if (coredump->ibs) + coredump->num_ibs =3D num_ibs; + + for (i =3D 0; i < coredump->num_ibs; i++) + coredump->ibs[i] =3D job->ibs[i].gpu_addr; + + if (ring) + strncpy(coredump->ring_name, ring->name, 16); + } + ktime_get_ts64(&coredump->reset_time); =20 dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL, --=20 2.41.0 From nobody Sun Feb 8 11:06:40 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86D45EB64DC for ; Tue, 11 Jul 2023 21:36:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231615AbjGKVgP (ORCPT ); Tue, 11 Jul 2023 17:36:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39248 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229742AbjGKVgH (ORCPT ); Tue, 11 Jul 2023 17:36:07 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4C9131BC1 for ; Tue, 11 Jul 2023 14:36:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=ejZZEKDMhXw4j4IEg03PtvfMno8w3phoXuomv3p/hio=; b=QndCUJKuerndBPoYHtUjBNEcos GQ7EXlnjZy4EXgWbIjj8nfcxW3mC1bRxp1kYNq6sIxSNSWYdnCtGIXuxHDqNKY6UIKRGwNubxO01s ptcPNA3QnDDmaLn/DWoD+xqEIXBBPl+6UECjVZh2xYqSH96BnkdySGfaNC62C3eMEbAXynN8Pt3h2 Se+CyKHmMFfaStfnYJhcxTkMWa8ZcfVL3D6V8+PmpDTrpi8t07INoGsaKBp3HsKyd39lyqhruGjKN 3oZDedRAohu3tn8QbB021EvmxQjnaHfz5hWyy7Ziski5TQPpZZeom/tO0UjESD4VKBr+yKcud6QbF H35d/Uhw==; Received: from [187.74.70.209] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qJL1b-00Cl0M-DU; Tue, 11 Jul 2023 23:35:59 +0200 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?UTF-8?q?=27Marek=20Ol=C5=A1=C3=A1k=27?= , Samuel Pitoiset , Bas Nieuwenhuizen , =?UTF-8?q?Timur=20Krist=C3=B3f?= , michel.daenzer@mailbox.org, =?UTF-8?q?Andr=C3=A9=20Almeida?= Subject: [PATCH 6/6] drm/amdgpu: Create version number for coredumps Date: Tue, 11 Jul 2023 18:35:01 -0300 Message-ID: <20230711213501.526237-7-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230711213501.526237-1-andrealmeid@igalia.com> References: <20230711213501.526237-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Even if there's nothing currently parsing amdgpu's coredump files, if we eventually have such tools they will be glad to find a version field to properly read the file. Create a version number to be displayed on top of coredump file, to be incremented when the file format or content get changed. Signed-off-by: Andr=C3=A9 Almeida --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdg= pu/amdgpu.h index cfeaf93934fd..905574acf3a0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1081,6 +1081,9 @@ struct amdgpu_device { }; =20 #ifdef CONFIG_DEV_COREDUMP + +#define AMDGPU_COREDUMP_VERSION "1" + struct amdgpu_coredump_info { struct amdgpu_device *adev; struct amdgpu_task_info reset_task_info; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/a= md/amdgpu/amdgpu_device.c index 38d03ca7a9fc..7b448e189717 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4985,6 +4985,7 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, = loff_t offset, p =3D drm_coredump_printer(&iter); =20 drm_printf(&p, "**** AMDGPU Device Coredump ****\n"); + drm_printf(&p, "version: " AMDGPU_COREDUMP_VERSION "\n"); drm_printf(&p, "kernel: " UTS_RELEASE "\n"); drm_printf(&p, "module: " KBUILD_MODNAME "\n"); drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredum= p->reset_time.tv_nsec); --=20 2.41.0