[PATCH 13/13] rasdaemon: ras-mc-ctl: Update logging of CXL memory module data to align with CXL spec rev 3.1

shiju.jose@huawei.com posted 13 patches 3 days, 1 hour ago
[PATCH 13/13] rasdaemon: ras-mc-ctl: Update logging of CXL memory module data to align with CXL spec rev 3.1
Posted by shiju.jose@huawei.com 3 days, 1 hour ago
From: Shiju Jose <shiju.jose@huawei.com>

CXL spec 3.1 section 8.2.9.2.1.3 Table 8-47, Memory Module Event Record
has updated with following new fields and new info for Device Event Type
and Device Health Information fields.
1. Validity Flags
2. Component Identifier
3. Device Event Sub-Type

This update modifies ras-mc-ctl to parse and log CXL memory module event
data stored in the RAS SQLite database table, reflecting the
specification changes introduced in revision 3.1.

Example output,

./util/ras-mc-ctl --errors
...
CXL memory module events:
1 2024-11-20 00:22:33 +0000 error: memdev=mem0, host=0000:0f:00.0, serial=0x3, \
log=Fatal, hdr_uuid=fe927475-dd59-4339-a586-79bab113b774, hdr_flags=0x1, , \
hdr_handle=0x1, hdr_related_handle=0x0, hdr_timestamp=1970-01-01 00:04:38 +0000, \
hdr_length=128, hdr_maint_op_class=0, hdr_maint_op_sub_class=1, \
event_type: Temperature Change, event_sub_type: Unsupported Config Data, \
health_status: 'MAINTENANCE_NEEDED' , 'REPLACEMENT_NEEDED' , \
media_status: All Data Loss in Event of Power Loss, life_used=8, \
dirty_shutdown_cnt=33, cor_vol_err_cnt=25, cor_per_err_cnt=45, \
device_temp=3, add_status=3 \
component_id:02 74 c5 08 9a 1a 0b fc d2 7e 2f 31 9b 3c 81 4d \
pldm_entity_id:00 00 00 00 00 00 pldm_resource_id:fc d2 7e 2f 
...

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 util/ras-mc-ctl.in | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in
index dbb1607..0990da9 100755
--- a/util/ras-mc-ctl.in
+++ b/util/ras-mc-ctl.in
@@ -1439,11 +1439,12 @@ sub get_cxl_transaction_type
     return $types[$_[0]];
 }
 
+# CXL rev 3.1 section 8.2.9.2.1.3; Table 8-47
 sub get_cxl_dev_event_type
 {
     my @types;
 
-    if ($_[0] < 0 || $_[0] > 5) {
+    if ($_[0] < 0 || $_[0] > 8) {
 	return "unknown-type";
     }
 
@@ -1452,15 +1453,37 @@ sub get_cxl_dev_event_type
 	      "Life Used Change",
 	      "Temperature Change",
 	      "Data Path Error",
-	      "LSA Error");
+	      "LSA Error",
+	      "Unrecoverable Internal Sideband Bus Error",
+	      "Memory Media FRU Error",
+	      "Power Management Fault");
 
     return $types[$_[0]];
 }
 
+sub get_cxl_dev_event_sub_type
+{
+    my @types;
+
+    if ($_[0] < 0 || $_[0] > 3) {
+	return "unknown-type";
+    }
+
+    @types = ("Not Reported",
+	      "Invalid Config Data",
+	      "Unsupported Config Data",
+	      "Unsupported Memory Media FRU");
+
+    return $types[$_[0]];
+}
+
+#CXL rev 3.1 section 8.2.9.9.3.1; Table 8-133
 use constant {
     CXL_DHI_HS_MAINTENANCE_NEEDED => 0x0001,
     CXL_DHI_HS_PERFORMANCE_DEGRADED => 0x0002,
     CXL_DHI_HS_HW_REPLACEMENT_NEEDED => 0x0004,
+    CXL_DHI_HS_HW_REPLACEMENT_NEEDED => 0x0004,
+    CXL_DHI_HS_MEM_CAPACITY_DEGRADED => 0x0008,
 };
 
 sub get_cxl_health_status_text
@@ -1477,6 +1500,9 @@ sub get_cxl_health_status_text
     if ($flags & CXL_DHI_HS_HW_REPLACEMENT_NEEDED) {
 	push @out, (sprintf "\'REPLACEMENT_NEEDED\' ");
     }
+    if ($flags & CXL_DHI_HS_MEM_CAPACITY_DEGRADED) {
+	push @out, (sprintf "\'MEM_CAPACITY_DEGRADED\' ");
+    }
 
     return join (", ", @out);
 }
@@ -1821,7 +1847,7 @@ sub errors
     my ($hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $hdr_maint_op_sub_class, $data);
     my ($dpa_flags, $descriptor, $mem_event_type, $mem_event_sub_type, $transaction_type, $channel, $rank, $device, $comp_id, $pldm_entity_id, $pldm_res_id);
     my ($nibble_mask, $bank_group, $row, $column, $cor_mask);
-    my ($event_type, $health_status, $media_status, $life_used, $dirty_shutdown_cnt, $cor_vol_err_cnt, $cor_per_err_cnt, $device_temp, $add_status);
+    my ($event_type, $event_sub_type, $health_status, $media_status, $life_used, $dirty_shutdown_cnt, $cor_vol_err_cnt, $cor_per_err_cnt, $device_temp, $add_status);
     my ($sub_channel, $cme_threshold_ev_flags, $cme_count, $cvme_count);
 
     my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {});
@@ -2155,10 +2181,10 @@ sub errors
 	}
 
 	# CXL memory module errors
-	$query = "select id, timestamp, memdev, host, serial, log_type, hdr_uuid, hdr_flags, hdr_handle, hdr_related_handle, hdr_ts, hdr_length, hdr_maint_op_class, event_type, health_status, media_status, life_used, dirty_shutdown_cnt, cor_vol_err_cnt, cor_per_err_cnt, device_temp, add_status, hdr_maint_op_sub_class from cxl_memory_module_event$conf{opt}{since} order by id";
+	$query = "select id, timestamp, memdev, host, serial, log_type, hdr_uuid, hdr_flags, hdr_handle, hdr_related_handle, hdr_ts, hdr_length, hdr_maint_op_class, event_type, health_status, media_status, life_used, dirty_shutdown_cnt, cor_vol_err_cnt, cor_per_err_cnt, device_temp, add_status, hdr_maint_op_sub_class, event_sub_type, comp_id, pldm_entity_id, pldm_resource_id from cxl_memory_module_event$conf{opt}{since} order by id";
 	$query_handle = $dbh->prepare($query);
 	$query_handle->execute();
-	$query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $log_type, $hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $event_type, $health_status, $media_status, $life_used, $dirty_shutdown_cnt, $cor_vol_err_cnt, $cor_per_err_cnt, $device_temp, $add_status, $hdr_maint_op_sub_class));
+	$query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $log_type, $hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $event_type, $health_status, $media_status, $life_used, $dirty_shutdown_cnt, $cor_vol_err_cnt, $cor_per_err_cnt, $device_temp, $add_status, $hdr_maint_op_sub_class, $event_sub_type, $comp_id, $pldm_entity_id, $pldm_res_id));
 	$out = "";
 	while($query_handle->fetch()) {
 	    $out .= "$id $timestamp error: ";
@@ -2175,6 +2201,7 @@ sub errors
 	    $out .= sprintf "hdr_maint_op_class=%u, ", $hdr_maint_op_class if (defined $hdr_maint_op_class && length $hdr_maint_op_class);
 	    $out .= sprintf "hdr_maint_op_sub_class=%u, ", $hdr_maint_op_sub_class if (defined $hdr_maint_op_sub_class && length $hdr_maint_op_sub_class);
 	    $out .= sprintf "event_type: %s, ", get_cxl_dev_event_type($event_type)  if (defined $event_type && length $event_type);
+            $out .= sprintf "event_sub_type: %s, ", get_cxl_dev_event_sub_type($event_sub_type)  if (defined $event_sub_type && length $event_sub_type);
 	    $out .= sprintf "health_status: %s, ", get_cxl_health_status_text($health_status)  if (defined $health_status && length $health_status);
 	    $out .= sprintf "media_status: %s, ", get_cxl_media_status($media_status)  if (defined $media_status && length $media_status);
 	    $out .= sprintf "life_used=%u, ", $life_used  if (defined $life_used && length $life_used);
@@ -2183,6 +2210,15 @@ sub errors
 	    $out .= sprintf "cor_per_err_cnt=%u, ", $cor_per_err_cnt  if (defined $cor_per_err_cnt && length $cor_per_err_cnt);
 	    $out .= sprintf "device_temp=%u, ", $device_temp  if (defined $device_temp && length $device_temp);
 	    $out .= sprintf "add_status=%u ", $add_status  if (defined $add_status && length $add_status);
+            if (defined $comp_id && length $comp_id) {
+                print_cxl_dev_id("component_id", $comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE, $out);
+            }
+            if (defined $pldm_entity_id && length $pldm_entity_id) {
+                print_cxl_dev_id("pldm_entity_id", $pldm_entity_id, CXL_EVENT_GEN_PLDM_ENTITY_ID_SIZE, $out);
+            }
+            if (defined $pldm_res_id && length $pldm_res_id) {
+                print_cxl_dev_id("pldm_resource_id", $pldm_res_id, CXL_EVENT_GEN_PLDM_RES_ID_SIZE, $out);
+            }
 	    $out .= "\n";
 	}
 	if ($out ne "") {
-- 
2.43.0
Re: [PATCH 13/13] rasdaemon: ras-mc-ctl: Update logging of CXL memory module data to align with CXL spec rev 3.1
Posted by Jonathan Cameron 1 day, 19 hours ago
On Wed, 20 Nov 2024 09:59:23 +0000
<shiju.jose@huawei.com> wrote:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> CXL spec 3.1 section 8.2.9.2.1.3 Table 8-47, Memory Module Event Record
> has updated with following new fields and new info for Device Event Type
> and Device Health Information fields.
> 1. Validity Flags
> 2. Component Identifier
> 3. Device Event Sub-Type
> 
> This update modifies ras-mc-ctl to parse and log CXL memory module event
> data stored in the RAS SQLite database table, reflecting the
> specification changes introduced in revision 3.1.
> 
> Example output,
> 
> ./util/ras-mc-ctl --errors
> ...
> CXL memory module events:
> 1 2024-11-20 00:22:33 +0000 error: memdev=mem0, host=0000:0f:00.0, serial=0x3, \
> log=Fatal, hdr_uuid=fe927475-dd59-4339-a586-79bab113b774, hdr_flags=0x1, , \
> hdr_handle=0x1, hdr_related_handle=0x0, hdr_timestamp=1970-01-01 00:04:38 +0000, \
> hdr_length=128, hdr_maint_op_class=0, hdr_maint_op_sub_class=1, \
> event_type: Temperature Change, event_sub_type: Unsupported Config Data, \
> health_status: 'MAINTENANCE_NEEDED' , 'REPLACEMENT_NEEDED' , \
> media_status: All Data Loss in Event of Power Loss, life_used=8, \
> dirty_shutdown_cnt=33, cor_vol_err_cnt=25, cor_per_err_cnt=45, \
> device_temp=3, add_status=3 \
> component_id:02 74 c5 08 9a 1a 0b fc d2 7e 2f 31 9b 3c 81 4d \
> pldm_entity_id:00 00 00 00 00 00 pldm_resource_id:fc d2 7e 2f 
> ...
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Feels like there is a lot of duplication in here, but you aren't
really making it any worse and maybe it is hard to reduce it.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
RE: [PATCH 13/13] rasdaemon: ras-mc-ctl: Update logging of CXL memory module data to align with CXL spec rev 3.1
Posted by Shiju Jose 1 day ago
Hi Jonathan,

>-----Original Message-----
>From: Jonathan Cameron <jonathan.cameron@huawei.com>
>Sent: 21 November 2024 15:39
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org;
>mchehab@kernel.org; dave.jiang@intel.com; dan.j.williams@intel.com;
>alison.schofield@intel.com; nifan.cxl@gmail.com; vishal.l.verma@intel.com;
>ira.weiny@intel.com; dave@stgolabs.net; linux-kernel@vger.kernel.org;
>Linuxarm <linuxarm@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>;
>Zengtao (B) <prime.zeng@hisilicon.com>
>Subject: Re: [PATCH 13/13] rasdaemon: ras-mc-ctl: Update logging of CXL
>memory module data to align with CXL spec rev 3.1
>
>On Wed, 20 Nov 2024 09:59:23 +0000
><shiju.jose@huawei.com> wrote:
>
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> CXL spec 3.1 section 8.2.9.2.1.3 Table 8-47, Memory Module Event
>> Record has updated with following new fields and new info for Device
>> Event Type and Device Health Information fields.
>> 1. Validity Flags
>> 2. Component Identifier
>> 3. Device Event Sub-Type
>>
>> This update modifies ras-mc-ctl to parse and log CXL memory module
>> event data stored in the RAS SQLite database table, reflecting the
>> specification changes introduced in revision 3.1.
>>
>> Example output,
>>
>> ./util/ras-mc-ctl --errors
>> ...
>> CXL memory module events:
>> 1 2024-11-20 00:22:33 +0000 error: memdev=mem0, host=0000:0f:00.0,
>> serial=0x3, \ log=Fatal,
>> hdr_uuid=fe927475-dd59-4339-a586-79bab113b774, hdr_flags=0x1, , \
>> hdr_handle=0x1, hdr_related_handle=0x0, hdr_timestamp=1970-01-01
>> 00:04:38 +0000, \ hdr_length=128, hdr_maint_op_class=0,
>> hdr_maint_op_sub_class=1, \
>> event_type: Temperature Change, event_sub_type: Unsupported Config
>> Data, \
>> health_status: 'MAINTENANCE_NEEDED' , 'REPLACEMENT_NEEDED' , \
>> media_status: All Data Loss in Event of Power Loss, life_used=8, \
>> dirty_shutdown_cnt=33, cor_vol_err_cnt=25, cor_per_err_cnt=45, \
>> device_temp=3, add_status=3 \
>> component_id:02 74 c5 08 9a 1a 0b fc d2 7e 2f 31 9b 3c 81 4d \
>> pldm_entity_id:00 00 00 00 00 00 pldm_resource_id:fc d2 7e 2f ...
>>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>Feels like there is a lot of duplication in here, but you aren't really making it any
>worse and maybe it is hard to reduce it.
>
ras-mc-ctl is a tool(script), used offline, to read, decode and print  the error event's data stored
by rasdaemon into the SQLite data base. Thus logging here is similar to those done in the rasdaemon.

>Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks,
Shiju