From nobody Thu Dec 18 09:45:39 2025 Received: from out30-101.freemail.mail.aliyun.com (out30-101.freemail.mail.aliyun.com [115.124.30.101]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 60F8D146013 for ; Wed, 8 Jan 2025 02:34:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.101 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736303652; cv=none; b=cpxpyYXfIaDXakbhuwk/wWapAlOL8WgksNzvNWJ0K6/Z7fkh3c8wkpqTrAIOTYmQNs85HZH0eHInQDKBeyZ33TT/OCen9CKiEhyPsbq2E4ewoTnKT+VC33liv3P4Ilvh0EBF/Dr13t1DFpnthuBTXE6ZbHmLHalLZSbpJF3a7/c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736303652; c=relaxed/simple; bh=EA0GGPjQCE9ctGn2HYiJ9rWDQn9oZNoLpX3DJDSDK3A=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=G/BvYmV3jXlOfU0SHAo0I3J6r+LT+sC4unTCSS9p2FvIIM08rqdsvP4vd2yTLeiWjbPGMxA6nIANGpggCQAVkpgXQ+1x/ARqm+FS9tVyEiYJsakozfJCwwaFDYaPyxRxILl1oSDZvs29PmVHZbA1gVF71md96/P8Tb4146BjZtQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=YFWrW7XA; arc=none smtp.client-ip=115.124.30.101 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="YFWrW7XA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1736303647; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=pVlTQIbFuRa2W/n67g0E65TmqpSMwXCyHsru7N04vV8=; b=YFWrW7XAwAh2bZKN8chh2JJKaCQ/zI7gJ9svt4Zgb+ReDf0ZUDiivu+R60oaTymgHIFkavwPyZf468PDo0Dl6YYhT0/L8d1+86gSDxmp0pQVPVGWnUx+W8WdR1PqeevvlwSezELEw9JARXprn2Pux3Myh/1eje0ar1ZSxebrkYk= Received: from localhost(mailfrom:tianruidong@linux.alibaba.com fp:SMTPD_---0WNC83CF_1736303642 cluster:ay36) by smtp.aliyun-inc.com; Wed, 08 Jan 2025 10:34:06 +0800 From: Ruidong Tian To: alexander.deucher@amd.com, christian.koenig@amd.com, Xinhui.Pan@amd.com, airlied@gmail.com, simona@ffwll.ch, xueshuai@linux.alibaba.com Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, tianruidong@linux.alibaba.com Subject: [RESEND PATCH] drm/amdgpu: add tracepoint while dump mca bank Date: Wed, 8 Jan 2025 10:34:00 +0800 Message-Id: <20250108023400.35081-1-tianruidong@linux.alibaba.com> X-Mailer: git-send-email 2.33.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" RAS errors are typically exposed to user-space programs using tracepoints, allowing tools like rasdaemon to decode and post-process them. AMDGPU might also follow this, offering the following benefits: 1. It can proactively notify users of RAS events, eliminating the need to monitor /dev/kmsg. 2. It allows for further post-processing similar to AMD SMCA[1]. [1]: https://github.com/mchehab/rasdaemon/commit/932118 Signed-off-by: Ruidong Tian --- drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 31 +++++++++++++++++++++++ 2 files changed, 34 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c b/drivers/gpu/drm/amd/= amdgpu/amdgpu_mca.c index 3ca03b5e0f91..9daa95365457 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c @@ -23,6 +23,7 @@ #include "amdgpu_ras.h" #include "amdgpu.h" #include "amdgpu_mca.h" +#include "amdgpu_trace.h" =20 #include "umc/umc_6_7_0_offset.h" #include "umc/umc_6_7_0_sh_mask.h" @@ -287,6 +288,8 @@ static void amdgpu_mca_smu_mca_bank_dump(struct amdgpu_= device *adev, int idx, st idx, entry->regs[MCA_REG_IDX_IPID]); RAS_EVENT_LOG(adev, event_id, HW_ERR "aca entry[%02d].SYND=3D0x%016llx\n", idx, entry->regs[MCA_REG_IDX_SYND]); + + trace_amdgpu_mca_bank_dumps(event_id, idx, entry); } =20 static int amdgpu_mca_smu_get_valid_mca_count(struct amdgpu_device *adev, = enum amdgpu_mca_error_type type, uint32_t *count) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/am= d/amdgpu/amdgpu_trace.h index 383fce40d4dd..a0ba79394099 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -554,6 +554,37 @@ TRACE_EVENT(amdgpu_reset_reg_dumps, __entry->value) ); =20 +TRACE_EVENT(amdgpu_mca_bank_dumps, + TP_PROTO(uint64_t event_id, int idx, struct mca_bank_entry *e), + TP_ARGS(event_id, idx, e), + TP_STRUCT__entry( + __field(uint64_t, event_id) + __field(int, idx) + __field(uint64_t, status) + __field(uint64_t, addr) + __field(uint64_t, misc0) + __field(uint64_t, ipid) + __field(uint64_t, synd) + ), + TP_fast_assign( + __entry->event_id =3D event_id; + __entry->idx =3D idx; + __entry->status =3D e->regs[MCA_REG_IDX_STATUS]; + __entry->addr =3D e->regs[MCA_REG_IDX_ADDR]; + __entry->misc0 =3D e->regs[MCA_REG_IDX_MISC0]; + __entry->ipid =3D e->regs[MCA_REG_IDX_IPID]; + __entry->synd =3D e->regs[MCA_REG_IDX_SYND]; + ), + TP_printk("amdgpu mca bank dump: event_id: %lld, idx: %d, STATUS: %016= llx, ADDR: %016llx, MISC0: %016llx, IPID: %016llx, SYND: %016llx", + __entry->event_id, + __entry->idx, + __entry->status, + __entry->addr, + __entry->misc0, + __entry->ipid, + __entry->synd) +); + #undef AMDGPU_JOB_GET_TIMELINE_NAME #endif =20 --=20 2.33.1