From nobody Mon Feb 9 17:23:05 2026 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0C5F23F29F; Fri, 7 Feb 2025 14:31:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938662; cv=none; b=VuMMbzvVeX6x+qRmzl9SXw+gqVI7V5Jw4cksNIpfoHmNkwtVDPHhVb++iugglTVlfMpJEgFSVm3BveduBpmKbyQZsq0Z7Dd5gD3iMiTNDKIQNnnsXblGmC8fdO9F/YLQoNBA8xheYNVB44Sc6jaIJZYMAT+z+q8Bs//o1DGgbZs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738938662; c=relaxed/simple; bh=ZF9f7jGUfMLnxque3loxIfAs4Ce34nahszkMvVP6Nlo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=hBFcBH7LWYIt6HPvmnRIId288c/7UhLYj+ojgOpfaSynqBdgIiLAQgXUtkZWTIaze6S2LLR4D+j7KOgEd7pAFDOZr/xfiadKwJvScKHF7ZyXePkFfw/HHQ5SyISNo88+YwT8wfJwUltYGL0wjMO1JwTrrzhxO1Ul9V3CVeHOOwg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4YqGXs5yHfz6L5Bh; Fri, 7 Feb 2025 22:28:13 +0800 (CST) Received: from frapeml500007.china.huawei.com (unknown [7.182.85.172]) by mail.maildlp.com (Postfix) with ESMTPS id C317F1402A5; Fri, 7 Feb 2025 22:30:58 +0800 (CST) Received: from P_UKIT01-A7bmah.china.huawei.com (10.126.173.5) by frapeml500007.china.huawei.com (7.182.85.172) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 7 Feb 2025 15:30:57 +0100 From: To: , , , , , , , , , , CC: , , , , Subject: [PATCH 4/4] rasdaemon: cxl: Add CXL memory repair boot-up script for unrepaired memory errors Date: Fri, 7 Feb 2025 14:30:25 +0000 Message-ID: <20250207143028.1865-5-shiju.jose@huawei.com> X-Mailer: git-send-email 2.43.0.windows.1 In-Reply-To: <20250207143028.1865-1-shiju.jose@huawei.com> References: <20250207143028.1865-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: lhrpeml100002.china.huawei.com (7.191.160.241) To frapeml500007.china.huawei.com (7.182.85.172) Content-Type: text/plain; charset="utf-8" From: Shiju Jose Rasdaemon supports live memory repair for the CXL DRAM errors reported, with 'maintenance needed' flag set. However the kernel CXL driver rejects the request for the live memory repair in the following situations. 1. Memory is online and the repair is disruptive. 2. Memory is online and event record does not match. In addition, live memory repair is not requested if the auto repair option is switched off for the rasdaemon. In the above unrepaired cases, rasdaemon stores the repair-needed information in the DRAM event record of the SQLite database. This allows a boot-up script to read repair needed flag and repair attributes from the database. If the memory has not been repaired, the script will issue the memory repair operation needed by the CXL memory device in the previous boot. kernel CXL driver sends a repair command to the device if the memory to be repaired is offline. Add boot-up script for handling the unrepaired CXL DRAM memory errors from the previous boot. Signed-off-by: Shiju Jose --- util/cxl-mem-repair.sh | 189 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 189 insertions(+) create mode 100755 util/cxl-mem-repair.sh diff --git a/util/cxl-mem-repair.sh b/util/cxl-mem-repair.sh new file mode 100755 index 0000000..2e3d261 --- /dev/null +++ b/util/cxl-mem-repair.sh @@ -0,0 +1,189 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. +# +# Boot-up script for CXL memory repair features. +# + +CXL_MAINT_CLASS_SPARING=3D2 + +CXL_MAINT_SUBCLASS_CACHELINE_SPARING=3D0 +CXL_MAINT_SUBCLASS_ROW_SPARING=3D1 +CXL_MAINT_SUBCLASS_BANK_SPARING=3D2 +CXL_MAINT_SUBCLASS_RANK_SPARING=3D3 + +RASDAEMON_SQL_DB=3D/usr/local/var/lib/rasdaemon/ras-mc_event.db +EDAC_CXL_BUS_PATH=3D/sys/bus/edac/devices/cxl_ + +id=3D1 +idx=3D-1 +found_repair=3D-1 +repair_type=3D'' + +while [ "$id" ] +do + id=3D$(sqlite3 $RASDAEMON_SQL_DB "select id from cxl_dram_event where id= =3D$id") + if [ -z "$id" ] + then + break; + fi + + repair_needed=3D$(sqlite3 $RASDAEMON_SQL_DB "select repair_needed from cx= l_dram_event where id=3D$id") + if [[ -z "$repair_needed" || $repair_needed -eq 0 ]] + then + id=3D$((id+1)) + continue; + fi + + maint_op_class=3D$(sqlite3 $RASDAEMON_SQL_DB "select hdr_maint_op_class = from cxl_dram_event where id=3D$id") + if [ $maint_op_class -ne $CXL_MAINT_CLASS_SPARING ] + then + id=3D$((id+1)) + continue; + fi + + maint_op_sub_class=3D$(sqlite3 $RASDAEMON_SQL_DB "select hdr_maint_op_sub= _class from cxl_dram_event where id=3D$id") + if [ -z "$maint_op_sub_class" ] + then + id=3D$((id+1)) + continue; + fi + + repair_type=3D'' + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_CACHELINE_SPARING ] + then + repair_type=3D'cacheline-sparing' + fi + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_ROW_SPARING ] + then + repair_type=3D'row-sparing' + fi + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_BANK_SPARING ] + then + repair_type=3D'bank-sparing' + fi + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_RANK_SPARING ] + then + repair_type=3D'rank-sparing' + fi + + memdev=3D$(sqlite3 $RASDAEMON_SQL_DB "select memdev from cxl_dram_event w= here id=3D$id") + if [ -z "$memdev" ] + then + id=3D$((id+1)) + continue; + fi + + # find the matching sparing type in sysfs + idx=3D0 + found_repair=3D0 + while [ 1 ] + do + out=3D$(cat "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/repair_type") + if [ -z "$out" ] + then + break; + fi + + if [ "$repair_type" =3D "$out" ] + then + found_repair=3D1 + break; + fi + idx=3D$((idx+1)) + done + if [ $found_repair -eq 0 ] + then + id=3D$((id+1)) + continue; + fi + + if [[ $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_CACHELINE_SPARING |= | $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_ROW_SPARING || $maint_op_s= ub_class =3D=3D $CXL_MAINT_SUBCLASS_BANK_SPARING ]] + then + bank_group=3D$(sqlite3 $RASDAEMON_SQL_DB "select bank_group from cxl_dra= m_event where id=3D$id") + if [ "$bank_group" ] + then + echo $bank_group > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/bank= _group" + else + id=3D$((id+1)) + continue; + fi + + bank=3D$(sqlite3 $RASDAEMON_SQL_DB "select bank from cxl_dram_event wher= e id=3D$id") + if [ "$bank" ] + then + echo $bank > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/bank" + else + id=3D$((id+1)) + continue; + fi + + if [[ $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_CACHELINE_SPARING |= | $maint_op_sub_class =3D=3D $CXL_MAINT_SUBCLASS_ROW_SPARING ]] + then + row=3D$(sqlite3 $RASDAEMON_SQL_DB "select row from cxl_dram_event where= id=3D$id") + if [ "$row" ] + then + echo $row > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/row" + else + id=3D$((id+1)) + continue; + fi + fi + + if [ $maint_op_sub_class -eq $CXL_MAINT_SUBCLASS_CACHELINE_SPARING ] + then + column=3D$(sqlite3 $RASDAEMON_SQL_DB "select column from cxl_dram_event= where id=3D$id") + if [ "$column" ] + then + echo $column > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/column" + else + id=3D$((id+1)) + continue; + fi + + sub_channel=3D$(sqlite3 $RASDAEMON_SQL_DB "select sub_channel from cxl_= dram_event where id=3D$id") + if [ "$sub_channel" ] + then + echo $sub_channel > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/su= b_channel" + else + id=3D$((id+1)) + continue; + fi + fi + fi + + channel=3D$(sqlite3 $RASDAEMON_SQL_DB "select channel from cxl_dram_event= where id=3D$id") + if [ "$channel" ] + then + echo $channel > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/channel" + else + id=3D$((id+1)) + continue; + fi + + rank=3D$(sqlite3 $RASDAEMON_SQL_DB "select rank from cxl_dram_event where= id=3D$id") + if [ "$rank" ] + then + echo $rank > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/rank" + else + id=3D$((id+1)) + continue; + fi + + nibble_mask=3D$(sqlite3 $RASDAEMON_SQL_DB "select nibble_mask from cxl_dr= am_event where id=3D$id") + if [ "$nibble_mask" ] + then + echo $nibble_mask > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/nibb= le_mask" + else + id=3D$((id+1)) + continue; + fi + + echo 1 > "$EDAC_CXL_BUS_PATH""$memdev""/mem_repair"$idx"/repair" + + #Clear repair_needed field of cxl_dram_event table + $(sqlite3 $RASDAEMON_SQL_DB "update cxl_dram_event set repair_needed =3D = 0 where id=3D$id") + + id=3D$((id+1)) +done --=20 2.43.0