From nobody Sun Feb 8 14:22:02 2026 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE4F61F9413; Fri, 7 Feb 2025 14:30:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938660; cv=none; b=X02XbtWdymSPLh1kV5XvDRzZmc5ZPisk/rWzuy9OqKkcB927dUFgdqrEHpGM2jpUBs1kcaTiUl/jPMnGklBPYqO3X0YELvMA2NHOhAMDi3C8sbeqdoocuL9bTjzFBJIXyM+zhoXLi8p5XpouQASfviQvGCm+vlFDeJ7z9H8gjMM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938660; c=relaxed/simple; bh=vNVC3Evx3GympB8CVtdKud8Q4t14jImdj81D6vl44tA=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QhbQfnIWd9/4aDVzYPu10kAV6PZ3+NS61KBvz9FRBwe+K2+wmXoZyPHnQS8hXCIxQiFZPzcFSjEqiRD7r9pdhdhw7JYDduqKpyyCsHKrYPMK5sZmpWCr8F22EagdA4tW+WH1gJNHzSpsR7Y4fUJp6GSj2Tiwt4+TzcrAd3Zs7hA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4YqGZp0YMHz6HJZd; Fri, 7 Feb 2025 22:29:54 +0800 (CST) Received: from frapeml500007.china.huawei.com (unknown [7.182.85.172]) by mail.maildlp.com (Postfix) with ESMTPS id 5DCC8140A70; Fri, 7 Feb 2025 22:30:55 +0800 (CST) Received: from P_UKIT01-A7bmah.china.huawei.com (10.126.173.5) by frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 7 Feb 2025 15:30:54 +0100 From: To: , , , , , , , , , , CC: , , , , Subject: [PATCH 1/4] rasdaemon: cxl: Add support for memory sparing operation Date: Fri, 7 Feb 2025 14:30:22 +0000 Message-ID: <20250207143028.1865-2-shiju.jose@huawei.com> X-Mailer: git-send-email 2.43.0.windows.1 In-Reply-To: <20250207143028.1865-1-shiju.jose@huawei.com> References: <20250207143028.1865-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: lhrpeml100002.china.huawei.com (7.191.160.241) To frapeml500007.china.huawei.com (7.182.85.172) Content-Type: text/plain; charset="utf-8" From: Shiju Jose CXL spec 3.1, Section 8.2.9.2.1, Table 8-43, "Common Event Record Format" table defines the Event Record Flags: 'Maintenance Needed' and 'Maintenance Operation Subclass Valid Flag' flags, which indicate when these flags are set, signaling that the memory device requires maintenance. When the device sets the maintenance operation class and maintenance operation subclass for memory sparing, the CXL DRAM handler sets attributes for memory sparing via the EDAC memory repair sysfs interface, initiating the sparing operation in the CXL memory device. Add support for the memory sparing operation and enable for the CXL DRAM event if auto repair is on. Auto memory repair is disabled default. Signed-off-by: Shiju Jose --- misc/rasdaemon.env | 4 + ras-cxl-handler.c | 287 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 291 insertions(+) diff --git a/misc/rasdaemon.env b/misc/rasdaemon.env index 963aaa0..b3fdba7 100644 --- a/misc/rasdaemon.env +++ b/misc/rasdaemon.env @@ -88,3 +88,7 @@ TRIGGER_DIR=3D # MC_UE_TRIGGER=3Dmc_event_trigger MC_CE_TRIGGER=3D MC_UE_TRIGGER=3D + +# CXL memory auto repair control +# Whether to enable CXL memory auto repair (yes|no). +CXL_AUTO_REPAIR_ENABLE=3D"no" diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c index cb95fa6..3311949 100644 --- a/ras-cxl-handler.c +++ b/ras-cxl-handler.c @@ -4,7 +4,9 @@ * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved. */ =20 +#include #include +#include #include #include #include @@ -722,6 +724,140 @@ static int handle_ras_cxl_common_hdr(struct trace_seq= *s, return 0; } =20 +/* memory repair */ +/* + * Common Event Record Format + * CXL 3.1 section 8.2.9.2.1; Table 8-43 + */ +#define CXL_MAINT_CLASS_SPARING 0x02 +#define CXL_MAINT_SUBCLASS_CACHE_SPARING 0x00 +#define CXL_MAINT_SUBCLASS_ROW_SPARING 0x01 +#define CXL_MAINT_SUBCLASS_BANK_SPARING 0x02 +#define CXL_MAINT_SUBCLASS_RANK_SPARING 0x03 + +#define CXL_CMD_BUF_SIZE 256 + +enum cxl_mem_sparing_type { + CXL_CACHE_SPARING, + CXL_ROW_SPARING, + CXL_BANK_SPARING, + CXL_RANK_SPARING, +}; + +static const char *edac_bus_path =3D "/sys/bus/edac/devices/"; +#define EDAC_CXL_DEV_PREFIX "cxl_" + +/* + * Auto repair is disabled default. + * 'export CXL_AUTO_REPAIR_ENABLE=3Dyes' to enable auto repair. + */ +static bool auto_repair; + +static void check_config_status(void) +{ + char *env =3D getenv("CXL_AUTO_REPAIR_ENABLE"); + + if (!env || strcasecmp(env, "yes")) + return; + + auto_repair =3D true; +} + +static int get_sysfs_data_str(const char *dir, const char *file, char *out) +{ + char path[CXL_CMD_BUF_SIZE]; + char buf[CXL_CMD_BUF_SIZE]; + int fd; + + snprintf(path, CXL_CMD_BUF_SIZE, "%s/%s", dir, file); + fd =3D open(path, O_RDONLY); + if (fd =3D=3D -1) { + log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, path); + return -1; + } + + memset(buf, 0, strlen(buf)); + if (read(fd, buf, sizeof(buf)) <=3D 0) + goto error; + + if (sscanf(buf, "%s", out) <=3D 0) + goto error; + + close(fd); + return 0; + +error: + close(fd); + return -1; +} + +static int set_sysfs_data_uint32(const char *dir, const char *file, uint32= _t data) +{ + char path[CXL_CMD_BUF_SIZE]; + int fd; + + snprintf(path, CXL_CMD_BUF_SIZE, "%s/%s", dir, file); + fd =3D open(path, O_WRONLY); + if (fd =3D=3D -1) { + log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, path); + return -1; + } + + if (dprintf(fd, "%d", data) <=3D 0) { + log(TERM, LOG_ERR, "[%s]: write data to [%s] failed, errno:%d\n", + __func__, path, errno); + close(fd); + return -1; + } + close(fd); + + return 0; +} + +static int set_sysfs_data_uint64(const char *dir, const char *file, uint64= _t data) +{ + char path[CXL_CMD_BUF_SIZE]; + int fd; + + snprintf(path, CXL_CMD_BUF_SIZE, "%s/%s", dir, file); + fd =3D open(path, O_WRONLY); + if (fd =3D=3D -1) { + log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, path); + return -1; + } + + if (dprintf(fd, "0x%lx", data) <=3D 0) { + log(TERM, LOG_ERR, "[%s]: write data to [%s] failed, errno:%d\n", + __func__, path, errno); + close(fd); + return -1; + } + close(fd); + + return 0; +} + +static int cxl_find_spare(const char *repair_dev, const char *repair_type) +{ + char dir[CXL_CMD_BUF_SIZE]; + char out[CXL_CMD_BUF_SIZE]; + int idx =3D 0; + + while (1) { + snprintf(dir, CXL_CMD_BUF_SIZE, "%s%s%s/mem_repair%d", + edac_bus_path, EDAC_CXL_DEV_PREFIX, repair_dev, idx); + + if (get_sysfs_data_str(dir, "repair_type", out)) + return -1; + + if (!strcmp(repair_type, out)) + return idx; + idx++; + } + + return -1; +} + int ras_cxl_generic_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) @@ -1027,6 +1163,155 @@ static const char * const cxl_der_mem_event_type[] = =3D { "CKID Violation", }; =20 +/* + * Each type of sparing requires a superset of the info needed for + * coarser grained sparing. + */ +static int fill_rank_sparing_attrs(struct ras_cxl_dram_event *ev, + const char *dir) +{ + if (set_sysfs_data_uint64(dir, "dpa", ev->dpa)) + return -1; + + if (set_sysfs_data_uint32(dir, "channel", ev->channel)) + return -1; + + if (set_sysfs_data_uint32(dir, "rank", ev->rank)) + return -1; + + if (ev->validity_flags & CXL_DER_VALID_NIBBLE) { + if (set_sysfs_data_uint32(dir, "nibble_mask", ev->nibble_mask)) + return -1; + } + + return 0; +} + +static int fill_bank_sparing_attrs(struct ras_cxl_dram_event *ev, + const char *dir) +{ + if (fill_rank_sparing_attrs(ev, dir)) + return -1; + + if (set_sysfs_data_uint32(dir, "bank_group", ev->bank_group)) + return -1; + + if (set_sysfs_data_uint32(dir, "bank", ev->bank)) + return -1; + + return 0; +} + +static int fill_row_sparing_attrs(struct ras_cxl_dram_event *ev, + const char *dir) +{ + if (fill_bank_sparing_attrs(ev, dir)) + return -1; + + if (set_sysfs_data_uint32(dir, "row", ev->row)) + return -1; + + return 0; +} + +static int fill_cacheline_sparing_attrs(struct ras_cxl_dram_event *ev, + const char *dir) +{ + if (fill_row_sparing_attrs(ev, dir)) + return -1; + + if (set_sysfs_data_uint32(dir, "column", ev->column)) + return -1; + + if (ev->validity_flags & CXL_DER_VALID_SUB_CHANNEL) { + if (set_sysfs_data_uint32(dir, "sub_channel", ev->sub_channel)) + return -1; + } + + return 0; +} + +static int cxl_dram_sparing(struct ras_cxl_dram_event *ev) +{ + struct ras_cxl_event_common_hdr *hdr =3D &ev->hdr; + char dir[CXL_CMD_BUF_SIZE]; + char repair_type[256]; + uint8_t sparing_type; + int idx; + + check_config_status(); + if (!auto_repair) + return -1; + + if (!(ev->hdr.hdr_flags & CXL_EVENT_RECORD_FLAG_MAINT_NEEDED) || + !(ev->hdr.hdr_flags & CXL_EVENT_RECORD_FLAG_MAINT_OP_SUB_CLASS_VALID)= || + ev->hdr.hdr_maint_op_class !=3D CXL_MAINT_CLASS_SPARING || + ev->dpa_flags & CXL_DPA_NOT_REPAIRABLE) + return -1; + + if (!(ev->validity_flags & CXL_DER_VALID_CHANNEL) || + !(ev->validity_flags & CXL_DER_VALID_RANK)) + return -1; + + /* + * CXL device reports the type of the repair in the event record. + */ + switch (hdr->hdr_maint_op_sub_class) { + case CXL_MAINT_SUBCLASS_CACHE_SPARING: + if (!(ev->validity_flags & CXL_DER_VALID_BANK_GROUP) || + !(ev->validity_flags & CXL_DER_VALID_BANK) || + !(ev->validity_flags & CXL_DER_VALID_ROW) || + !(ev->validity_flags & CXL_DER_VALID_COLUMN)) + return -1; + snprintf(repair_type, CXL_CMD_BUF_SIZE, "cacheline-sparing"); + sparing_type =3D CXL_CACHE_SPARING; + break; + case CXL_MAINT_SUBCLASS_ROW_SPARING: + if (!(ev->validity_flags & CXL_DER_VALID_BANK_GROUP) || + !(ev->validity_flags & CXL_DER_VALID_BANK) || + !(ev->validity_flags & CXL_DER_VALID_ROW)) + return -1; + snprintf(repair_type, CXL_CMD_BUF_SIZE, "row-sparing"); + sparing_type =3D CXL_ROW_SPARING; + break; + case CXL_MAINT_SUBCLASS_BANK_SPARING: + if (!(ev->validity_flags & CXL_DER_VALID_BANK_GROUP) || + !(ev->validity_flags & CXL_DER_VALID_BANK)) + return -1; + snprintf(repair_type, CXL_CMD_BUF_SIZE, "bank-sparing"); + sparing_type =3D CXL_CACHE_SPARING; + break; + case CXL_MAINT_SUBCLASS_RANK_SPARING: + snprintf(repair_type, CXL_CMD_BUF_SIZE, "rank-sparing"); + sparing_type =3D CXL_CACHE_SPARING; + break; + default: + return -1; + } + + idx =3D cxl_find_spare(hdr->memdev, repair_type); + if (idx < 0) + return -1; + + snprintf(dir, CXL_CMD_BUF_SIZE, "%s%s%s/mem_repair%d", + edac_bus_path, EDAC_CXL_DEV_PREFIX, ev->hdr.memdev, idx); + + if (sparing_type =3D=3D CXL_CACHE_SPARING) + fill_cacheline_sparing_attrs(ev, dir); + else if (sparing_type =3D=3D CXL_ROW_SPARING) + fill_row_sparing_attrs(ev, dir); + else if (sparing_type =3D=3D CXL_BANK_SPARING) + fill_bank_sparing_attrs(ev, dir); + else if (sparing_type =3D=3D CXL_RANK_SPARING) + fill_rank_sparing_attrs(ev, dir); + else + return -1; + + set_sysfs_data_uint32(dir, "repair", 1); + + return 0; +} + int ras_cxl_dram_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) @@ -1231,6 +1516,8 @@ int ras_cxl_dram_event_handler(struct trace_seq *s, if (trace_seq_printf(s, "CVME Count:%u ", ev.cvme_count) <=3D 0) return -1; =20 + cxl_dram_sparing(&ev); + /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_dram_event(ras, &ev); --=20 2.43.0 From nobody Sun Feb 8 14:22:02 2026 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E33E20B817; Fri, 7 Feb 2025 14:30:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938660; cv=none; b=qHuz71J+wVEsv9JBbm0Y25leD3clCJPtFtgjAU1h7oAhbzc8jhEH34cbvrIrbSAaX0j2ulSutDw6HcbLo9GikXmbuPLLsgoit6IAgl7AMAqIgnNNhQtDWy/fL9kPcdRhQgh0m68LDMPDDWILOMzZAHxL5ojbMTWHqOjTdO2g6d0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938660; c=relaxed/simple; bh=JPGQWfCGfHcibenALWxkM4pdFRT55xRV1cpvh0S04P4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Oai+vf7AnOXc9YOg55PQ4ex74T8yz6OoiuhfplaU9qxkP8QR9/UG6h0m5hR/xAHSby0TLN7iflDjVtMH2zfTidy2Oe8h0OeqmuZIpEu3MsVdn09IrtnM5xDXs1pQYoukRzHbEELPUwvlutwLLhfJkSxqRfjIRv9wf385BvXAZq8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4YqGZq1g7Zz6HJZg; Fri, 7 Feb 2025 22:29:55 +0800 (CST) Received: from frapeml500007.china.huawei.com (unknown [7.182.85.172]) by mail.maildlp.com (Postfix) with ESMTPS id 8489C1406AC; Fri, 7 Feb 2025 22:30:56 +0800 (CST) Received: from P_UKIT01-A7bmah.china.huawei.com (10.126.173.5) by frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 7 Feb 2025 15:30:55 +0100 From: To: , , , , , , , , , , CC: , , , , Subject: [PATCH 2/4] rasdaemon: cxl: Add support for memory soft PPR operation Date: Fri, 7 Feb 2025 14:30:23 +0000 Message-ID: <20250207143028.1865-3-shiju.jose@huawei.com> X-Mailer: git-send-email 2.43.0.windows.1 In-Reply-To: <20250207143028.1865-1-shiju.jose@huawei.com> References: <20250207143028.1865-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: lhrpeml100002.china.huawei.com (7.191.160.241) To frapeml500007.china.huawei.com (7.182.85.172) Content-Type: text/plain; charset="utf-8" From: Shiju Jose CXL spec 3.1, Section 8.2.9.2.1, Table 8-43, "Common Event Record Format" table defines the Event Record Flags: 'Maintenance Needed' and 'Maintenance Operation Subclass Valid Flag' flags, which indicate when these flags are set, signaling that the memory device requires maintenance. When the device sets the maintenance operation class and maintenance operation subclass for memory soft PPR(Post Package Repair), the CXL DRAM event handler and CXL general media handler sets attributes for memory PPR via the EDAC memory repair sysfs interface, initiating the soft PPR operation in the CXL memory device. Add support for the memory soft PPR operation and enable for the CXL DRAM event and CXL general media event if auto repair is on. Signed-off-by: Shiju Jose --- ras-cxl-handler.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c index 3311949..ae49740 100644 --- a/ras-cxl-handler.c +++ b/ras-cxl-handler.c @@ -735,6 +735,10 @@ static int handle_ras_cxl_common_hdr(struct trace_seq = *s, #define CXL_MAINT_SUBCLASS_BANK_SPARING 0x02 #define CXL_MAINT_SUBCLASS_RANK_SPARING 0x03 =20 +#define CXL_MAINT_CLASS_PPR 0x01 +#define CXL_MAINT_SUBCLASS_SPPR 0x00 +#define CXL_MAINT_SUBCLASS_HPPR 0x01 + #define CXL_CMD_BUF_SIZE 256 =20 enum cxl_mem_sparing_type { @@ -791,6 +795,34 @@ error: return -1; } =20 +static int get_sysfs_data_uint32(const char *dir, const char *file) +{ + char path[CXL_CMD_BUF_SIZE]; + char buf[2] =3D ""; + int fd, num; + + snprintf(path, CXL_CMD_BUF_SIZE, "%s/%s", dir, file); + fd =3D open(path, O_RDONLY); + if (fd =3D=3D -1) { + log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, path); + return -1; + } + + if (read(fd, buf, 1) <=3D 0) + goto error; + + if (sscanf(buf, "%d", &num) <=3D 0) + goto error; + + close(fd); + + return num; + +error: + close(fd); + return -1; +} + static int set_sysfs_data_uint32(const char *dir, const char *file, uint32= _t data) { char path[CXL_CMD_BUF_SIZE]; @@ -858,6 +890,34 @@ static int cxl_find_spare(const char *repair_dev, cons= t char *repair_type) return -1; } =20 +static int cxl_find_ppr(const char *repair_dev, const char *repair_type) +{ + char dir[CXL_CMD_BUF_SIZE]; + char out[CXL_CMD_BUF_SIZE]; + int idx =3D 0; + int persist; + + while (1) { + snprintf(dir, CXL_CMD_BUF_SIZE, "%s%s%s/mem_repair%d", + edac_bus_path, EDAC_CXL_DEV_PREFIX, repair_dev, idx); + + persist =3D get_sysfs_data_uint32(dir, "persist_mode"); + if (persist < 0) + return -1; + if (persist) + continue; + + if (get_sysfs_data_str(dir, "repair_type", out)) + return -1; + + if (!strcmp(repair_type, out)) + return idx; + idx++; + } + + return -1; +} + int ras_cxl_generic_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) @@ -973,6 +1033,36 @@ static const char * const cxl_gmer_trans_type[] =3D { "Media Initialization", }; =20 +static int cxl_ppr(struct ras_cxl_event_common_hdr *hdr, uint64_t dpa, uin= t32_t nibble_mask) +{ + char dir[CXL_CMD_BUF_SIZE]; + int idx =3D 0; + + if (!(hdr->hdr_flags & CXL_EVENT_RECORD_FLAG_MAINT_NEEDED) || + !(hdr->hdr_flags & CXL_EVENT_RECORD_FLAG_MAINT_OP_SUB_CLASS_VALID) || + hdr->hdr_maint_op_class !=3D CXL_MAINT_CLASS_PPR || + hdr->hdr_maint_op_sub_class !=3D CXL_MAINT_SUBCLASS_SPPR) + return -1; + + idx =3D cxl_find_ppr(hdr->memdev, "ppr"); + if (idx < 0) + return -1; + + snprintf(dir, CXL_CMD_BUF_SIZE, "%s%s%s/mem_repair%d", + edac_bus_path, EDAC_CXL_DEV_PREFIX, hdr->memdev, idx); + + if (set_sysfs_data_uint64(dir, "dpa", dpa)) + return -1; + + if (set_sysfs_data_uint32(dir, "nibble_mask", nibble_mask)) + return -1; + + if (set_sysfs_data_uint32(dir, "repair", 1)) + return -1; + + return 0; +} + int ras_cxl_general_media_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) @@ -1133,6 +1223,9 @@ int ras_cxl_general_media_event_handler(struct trace_= seq *s, ras_report_cxl_general_media_event(ras, &ev); #endif =20 + if (!(ev.dpa_flags & CXL_DPA_NOT_REPAIRABLE)) + cxl_ppr(&ev.hdr, ev.dpa, 0); + return 0; } =20 @@ -1518,6 +1611,9 @@ int ras_cxl_dram_event_handler(struct trace_seq *s, =20 cxl_dram_sparing(&ev); =20 + if (!(ev.dpa_flags & CXL_DPA_NOT_REPAIRABLE)) + cxl_ppr(&ev.hdr, ev.dpa, ev.nibble_mask); + /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_dram_event(ras, &ev); --=20 2.43.0 From nobody Sun Feb 8 14:22:02 2026 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8BC1323ED5A; Fri, 7 Feb 2025 14:30:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938661; cv=none; b=qVulnyWRT3dWOzLjsdqDpP0ScK4/epYyXtccsrtYQfbjL9j1p3Uj/rYjN+8qXWam2ZRPSKxxTEkziLmMGeOo4D8g0yePj3Q3sCZn7OW08lI/sZIAmeGK27rgMaw0Y2mhF+SEKwygBpUc1WhyEsKp9zc6Anp8bqXgdQcoCWHoyiI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938661; c=relaxed/simple; bh=C7JSQoQzYhI5wXqkE1GUE3HNs7V2oCGTnCERl6QP33M=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=kHUaAYWFhBkRdJUNUBwCnkP/ENWxl9tIrz69TsQO0V6MCk2XLt8Q/XvLHiQ999SGFeTpvQHhcTRlkNAfSnEFUXccGQBFge0ueoAz7J6s7uhdoHplNDBtRcGVUusNLIjwTtxI0C94yU8Jjk6MtGfLSDomnAkSTALAnHAE5LEmiiI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4YqGZr2jXMz6HJZN; Fri, 7 Feb 2025 22:29:56 +0800 (CST) Received: from frapeml500007.china.huawei.com (unknown [7.182.85.172]) by mail.maildlp.com (Postfix) with ESMTPS id A5842140B55; Fri, 7 Feb 2025 22:30:57 +0800 (CST) Received: from P_UKIT01-A7bmah.china.huawei.com (10.126.173.5) by frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 7 Feb 2025 15:30:56 +0100 From: To: , , , , , , , , , , CC: , , , , Subject: [PATCH 3/4] rasdaemon: cxl: Add storing memory repair needed info in the DRAM event record Date: Fri, 7 Feb 2025 14:30:24 +0000 Message-ID: <20250207143028.1865-4-shiju.jose@huawei.com> X-Mailer: git-send-email 2.43.0.windows.1 In-Reply-To: <20250207143028.1865-1-shiju.jose@huawei.com> References: <20250207143028.1865-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: lhrpeml100002.china.huawei.com (7.191.160.241) To frapeml500007.china.huawei.com (7.182.85.172) Content-Type: text/plain; charset="utf-8" From: Shiju Jose Rasdaemon supports live memory repair for the CXL DRAM errors reported, with 'maintenance needed' flag set. However the kernel CXL driver rejects the request for the live memory repair in the following situations. 1. Memory is online and the repair is disruptive. 2. Memory is online and event record does not match. In addition, live memory repair is not requested if the auto repair option is switched off for the rasdaemon. In the above unrepaired cases, repair-needed information for CXL DRAM events must be stored in the CXL DRAM event record of the SQLite database. This allows a boot-up script to read repair status and repair attributes in the next boot. If the memory has not been repaired, the script will issue the memory repair operation requested by the memory device in the previous boot. The kernel CXL driver sends a repair command to the device if the memory to be repaired is offline. This change adds a field for storing the memory repair needed info in the CXL DRAM event record of the SQLite database and ensures that the repair needed status is stored. Signed-off-by: Shiju Jose --- ras-cxl-handler.c | 9 ++++++--- ras-record.c | 2 ++ ras-record.h | 1 + 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c index ae49740..25ef5c9 100644 --- a/ras-cxl-handler.c +++ b/ras-cxl-handler.c @@ -1609,10 +1609,13 @@ int ras_cxl_dram_event_handler(struct trace_seq *s, if (trace_seq_printf(s, "CVME Count:%u ", ev.cvme_count) <=3D 0) return -1; =20 - cxl_dram_sparing(&ev); + if (cxl_dram_sparing(&ev) < 0) + ev.repair_needed =3D true; =20 - if (!(ev.dpa_flags & CXL_DPA_NOT_REPAIRABLE)) - cxl_ppr(&ev.hdr, ev.dpa, ev.nibble_mask); + if (!(ev.dpa_flags & CXL_DPA_NOT_REPAIRABLE)) { + if (cxl_ppr(&ev.hdr, ev.dpa, ev.nibble_mask) < 0) + ev.repair_needed =3D true; + } =20 /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 diff --git a/ras-record.c b/ras-record.c index eed7aca..ed745f9 100644 --- a/ras-record.c +++ b/ras-record.c @@ -993,6 +993,7 @@ static const struct db_fields cxl_dram_event_fields[] = =3D { { .name =3D "sub_channel", .type =3D "INTEGER" }, { .name =3D "cme_threshold_ev_flags", .type =3D "INTEGER" }, { .name =3D "cvme_count", .type =3D "INTEGER" }, + { .name =3D "repair_needed", .type =3D "INTEGER" }, }; =20 static const struct db_table_descriptor cxl_dram_event_tab =3D { @@ -1043,6 +1044,7 @@ int ras_store_cxl_dram_event(struct ras_events *ras, = struct ras_cxl_dram_event * sqlite3_bind_int(priv->stmt_cxl_dram_event, idx++, ev->cme_threshold_ev_flags); sqlite3_bind_int(priv->stmt_cxl_dram_event, idx++, ev->cvme_count); + sqlite3_bind_int(priv->stmt_cxl_dram_event, idx++, ev->repair_needed); =20 rc =3D sqlite3_step(priv->stmt_cxl_dram_event); if (rc !=3D SQLITE_OK && rc !=3D SQLITE_DONE) diff --git a/ras-record.h b/ras-record.h index 5eab62c..35edae1 100644 --- a/ras-record.h +++ b/ras-record.h @@ -238,6 +238,7 @@ struct ras_cxl_dram_event { uint8_t res_id[CXL_PLDM_RES_ID_LEN]; uint8_t cme_threshold_ev_flags; uint32_t cvme_count; + bool repair_needed; }; =20 struct ras_cxl_memory_module_event { --=20 2.43.0 From nobody Sun Feb 8 14:22:02 2026 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0C5F23F29F; Fri, 7 Feb 2025 14:31:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938662; cv=none; b=VuMMbzvVeX6x+qRmzl9SXw+gqVI7V5Jw4cksNIpfoHmNkwtVDPHhVb++iugglTVlfMpJEgFSVm3BveduBpmKbyQZsq0Z7Dd5gD3iMiTNDKIQNnnsXblGmC8fdO9F/YLQoNBA8xheYNVB44Sc6jaIJZYMAT+z+q8Bs//o1DGgbZs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938662; c=relaxed/simple; bh=ZF9f7jGUfMLnxque3loxIfAs4Ce34nahszkMvVP6Nlo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=hBFcBH7LWYIt6HPvmnRIId288c/7UhLYj+ojgOpfaSynqBdgIiLAQgXUtkZWTIaze6S2LLR4D+j7KOgEd7pAFDOZr/xfiadKwJvScKHF7ZyXePkFfw/HHQ5SyISNo88+YwT8wfJwUltYGL0wjMO1JwTrrzhxO1Ul9V3CVeHOOwg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4YqGXs5yHfz6L5Bh; Fri, 7 Feb 2025 22:28:13 +0800 (CST) Received: from frapeml500007.china.huawei.com (unknown [7.182.85.172]) by mail.maildlp.com (Postfix) with ESMTPS id C317F1402A5; Fri, 7 Feb 2025 22:30:58 +0800 (CST) Received: from P_UKIT01-A7bmah.china.huawei.com (10.126.173.5) by frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 7 Feb 2025 15:30:57 +0100 From: To: , , , , , , , , , , CC: , , , , Subject: [PATCH 4/4] rasdaemon: cxl: Add CXL memory repair boot-up script for unrepaired memory errors Date: Fri, 7 Feb 2025 14:30:25 +0000 Message-ID: <20250207143028.1865-5-shiju.jose@huawei.com> X-Mailer: git-send-email 2.43.0.windows.1 In-Reply-To: <20250207143028.1865-1-shiju.jose@huawei.com> References: <20250207143028.1865-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: lhrpeml100002.china.huawei.com (7.191.160.241) To frapeml500007.china.huawei.com (7.182.85.172) Content-Type: text/plain; charset="utf-8" From: Shiju Jose Rasdaemon supports live memory repair for the CXL DRAM errors reported, with 'maintenance needed' flag set. However the kernel CXL driver rejects the request for the live memory repair in the following situations. 1. Memory is online and the repair is disruptive. 2. Memory is online and event record does not match. In addition, live memory repair is not requested if the auto repair option is switched off for the rasdaemon. In the above unrepaired cases, rasdaemon stores the repair-needed information in the DRAM event record of the SQLite database. This allows a boot-up script to read repair needed flag and repair attributes from the database. If the memory has not been repaired, the script will issue the memory repair operation needed by the CXL memory device in the previous boot. kernel CXL driver sends a repair command to the device if the memory to be repaired is offline. Add boot-up script for handling the unrepaired CXL DRAM memory errors from the previous boot. Signed-off-by: Shiju Jose --- util/cxl-mem-repair.sh | 189 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 189 insertions(+) create mode 100755 util/cxl-mem-repair.sh diff --git a/util/cxl-mem-repair.sh b/util/cxl-mem-repair.sh new file mode 100755 index 0000000..2e3d261 --- /dev/null +++ b/util/cxl-mem-repair.sh @@ -0,0 +1,189 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. +# +# Boot-up script for CXL memory repair features. +# + +CXL_MAINT_CLASS_SPARING=3D2 + +CXL_MAINT_SUBCLASS_CACHELINE_SPARING=3D0 +CXL_MAINT_SUBCLASS_ROW_SPARING=3D1 +CXL_MAINT_SUBCLASS_BANK_SPARING=3D2 +CXL_MAINT_SUBCLASS_RANK_SPARING=3D3 + +RASDAEMON_SQL_DB=3D/usr/local/var/lib/rasdaemon/ras-mc_event.db +EDAC_CXL_BUS_PATH=3D/sys/bus/edac/devices/cxl_ + +id=3D1 +idx=3D-1 +found_repair=3D-1 +repair_type=3D'' + +while [ "$id" ] +do + id=3D$(sqlite3 $RASDAEMON_SQL_DB "select id from cxl_dram_event where id= =3D$id") + if [ -z "$id" ] + then + break; + fi + + repair_needed=3D$(sqlite3 $RASDAEMON_SQL_DB "select repair_needed from cx= l_dram_event where id=3D$id") + if [[ -z "$repair_needed" || $repair_needed -eq 0 ]] + then + id=3D$((id+1)) + continue; + fi + + maint_op_class=3D$(sqlite3 $RASDAEMON_SQL_DB "select hdr_maint_op_class = from cxl_dram_event where id=3D$id") + if [ $maint_op_class -ne $CXL_MAINT_CLASS_SPARING ] + then + id=3D$((id+1)) + continue; + fi + + maint_op_sub_class=3D$(sqlite3 $RASDAEMON_SQL_DB "select hdr_maint_op_sub= _class from cxl_dram_event where id=3D$id") + if [ -z "$maint_op_sub_class" ] + then + id=3D$((id+1)) + continue; + fi + + repair_type=3D'' + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_CACHELINE_SPARING ] + then + repair_type=3D'cacheline-sparing' + fi + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_ROW_SPARING ] + then + repair_type=3D'row-sparing' + fi + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_BANK_SPARING ] + then + repair_type=3D'bank-sparing' + fi + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_RANK_SPARING ] + then + repair_type=3D'rank-sparing' + fi + + memdev=3D$(sqlite3 $RASDAEMON_SQL_DB "select memdev from cxl_dram_event w= here id=3D$id") + if [ -z "$memdev" ] + then + id=3D$((id+1)) + continue; + fi + + # find the matching sparing type in sysfs + idx=3D0 + found_repair=3D0 + while [ 1 ] + do + out=3D$(cat "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/repair_type") + if [ -z "$out" ] + then + break; + fi + + if [ "$repair_type" =3D "$out" ] + then + found_repair=3D1 + break; + fi + idx=3D$((idx+1)) + done + if [ $found_repair -eq 0 ] + then + id=3D$((id+1)) + continue; + fi + + if [[ $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_CACHELINE_SPARING |= | $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_ROW_SPARING || $maint_op_s= ub_class =3D=3D $CXL_MAINT_SUBCLASS_BANK_SPARING ]] + then + bank_group=3D$(sqlite3 $RASDAEMON_SQL_DB "select bank_group from cxl_dra= m_event where id=3D$id") + if [ "$bank_group" ] + then + echo $bank_group > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/bank= _group" + else + id=3D$((id+1)) + continue; + fi + + bank=3D$(sqlite3 $RASDAEMON_SQL_DB "select bank from cxl_dram_event wher= e id=3D$id") + if [ "$bank" ] + then + echo $bank > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/bank" + else + id=3D$((id+1)) + continue; + fi + + if [[ $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_CACHELINE_SPARING |= | $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_ROW_SPARING ]] + then + row=3D$(sqlite3 $RASDAEMON_SQL_DB "select row from cxl_dram_event where= id=3D$id") + if [ "$row" ] + then + echo $row > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/row" + else + id=3D$((id+1)) + continue; + fi + fi + + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_CACHELINE_SPARING ] + then + column=3D$(sqlite3 $RASDAEMON_SQL_DB "select column from cxl_dram_event= where id=3D$id") + if [ "$column" ] + then + echo $column > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/column" + else + id=3D$((id+1)) + continue; + fi + + sub_channel=3D$(sqlite3 $RASDAEMON_SQL_DB "select sub_channel from cxl_= dram_event where id=3D$id") + if [ "$sub_channel" ] + then + echo $sub_channel > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/su= b_channel" + else + id=3D$((id+1)) + continue; + fi + fi + fi + + channel=3D$(sqlite3 $RASDAEMON_SQL_DB "select channel from cxl_dram_event= where id=3D$id") + if [ "$channel" ] + then + echo $channel > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/channel" + else + id=3D$((id+1)) + continue; + fi + + rank=3D$(sqlite3 $RASDAEMON_SQL_DB "select rank from cxl_dram_event where= id=3D$id") + if [ "$rank" ] + then + echo $rank > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/rank" + else + id=3D$((id+1)) + continue; + fi + + nibble_mask=3D$(sqlite3 $RASDAEMON_SQL_DB "select nibble_mask from cxl_dr= am_event where id=3D$id") + if [ "$nibble_mask" ] + then + echo $nibble_mask > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/nibb= le_mask" + else + id=3D$((id+1)) + continue; + fi + + echo 1 > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/repair" + + #Clear repair_needed field of cxl_dram_event table + $(sqlite3 $RASDAEMON_SQL_DB "update cxl_dram_event set repair_needed =3D = 0 where id=3D$id") + + id=3D$((id+1)) +done --=20 2.43.0