[PATCH v11 09/23] PCI/AER: Report CXL or PCIe bus error type in trace logging

Terry Bowman posted 23 patches 1 month, 1 week ago
There is a newer version of this series
[PATCH v11 09/23] PCI/AER: Report CXL or PCIe bus error type in trace logging
Posted by Terry Bowman 1 month, 1 week ago
The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors. Update the driver and aer_event tracing to log 'CXL Bus
Type' for CXL device errors.

This requires the AER can identify and distinguish between PCIe errors and
CXL errors.

Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
aer_get_device_error_info() and pci_print_aer().

Update the aer_event trace routine to accept a bus type string parameter.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---
Changes in v10->v11:
 - Remove duplicate call to trace_aer_event() (Shiju)
 - Added Dan William's and Dave Jiang's reviewed-by
---
 drivers/pci/pci.h       |  6 ++++++
 drivers/pci/pcie/aer.c  | 18 ++++++++++++------
 include/ras/ras_event.h |  9 ++++++---
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c8a0c0ec0073..417a088d815f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -608,6 +608,7 @@ struct aer_err_info {
 	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
 	int error_dev_num;
 	const char *level;		/* printk level */
+	bool is_cxl;
 
 	unsigned int id:16;
 
@@ -628,6 +629,11 @@ struct aer_err_info {
 int aer_get_device_error_info(struct aer_err_info *info, int i);
 void aer_print_error(struct aer_err_info *info, int i);
 
+static inline const char *aer_err_bus(struct aer_err_info *info)
+{
+	return info->is_cxl ? "CXL" : "PCIe";
+}
+
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
 		      unsigned int tlp_len, bool flit,
 		      struct pcie_tlp_log *log);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 29de7ee861f7..1b5f5b0cdc4f 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -837,6 +837,7 @@ void aer_print_error(struct aer_err_info *info, int i)
 	struct pci_dev *dev;
 	int layer, agent, id;
 	const char *level = info->level;
+	const char *bus_type = aer_err_bus(info);
 
 	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
 		return;
@@ -845,23 +846,23 @@ void aer_print_error(struct aer_err_info *info, int i)
 	id = pci_dev_id(dev);
 
 	pci_dev_aer_stats_incr(dev, info);
-	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
+	trace_aer_event(pci_name(dev), bus_type, (info->status & ~info->mask),
 			info->severity, info->tlp_header_valid, &info->tlp);
 
 	if (!info->ratelimit_print[i])
 		return;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
-			aer_error_severity_string[info->severity]);
+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+			bus_type, aer_error_severity_string[info->severity]);
 		goto out;
 	}
 
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
-	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-		   aer_error_severity_string[info->severity],
+	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+		   bus_type, aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -895,6 +896,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
+	const char *bus_type;
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
 	struct aer_err_info info = {
@@ -915,9 +917,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	info.status = status;
 	info.mask = mask;
+	info.is_cxl = pcie_is_cxl(dev);
+
+	bus_type = aer_err_bus(&info);
 
 	pci_dev_aer_stats_incr(dev, &info);
-	trace_aer_event(pci_name(dev), (status & ~mask),
+	trace_aer_event(pci_name(dev), bus_type, (status & ~mask),
 			aer_severity, tlp_header_valid, &aer->header_log);
 
 	if (!aer_ratelimit(dev, info.severity))
@@ -1277,6 +1282,7 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
 	/* Must reset in this function */
 	info->status = 0;
 	info->tlp_header_valid = 0;
+	info->is_cxl = pcie_is_cxl(dev);
 
 	/* The device might not support AER */
 	if (!aer)
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 14c9f943d53f..080829d59c36 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
 
 TRACE_EVENT(aer_event,
 	TP_PROTO(const char *dev_name,
+		 const char *bus_type,
 		 const u32 status,
 		 const u8 severity,
 		 const u8 tlp_header_valid,
 		 struct pcie_tlp_log *tlp),
 
-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
 
 	TP_STRUCT__entry(
 		__string(	dev_name,	dev_name	)
+		__string(	bus_type,	bus_type	)
 		__field(	u32,		status		)
 		__field(	u8,		severity	)
 		__field(	u8, 		tlp_header_valid)
@@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
 
 	TP_fast_assign(
 		__assign_str(dev_name);
+		__assign_str(bus_type);
 		__entry->status		= status;
 		__entry->severity	= severity;
 		__entry->tlp_header_valid = tlp_header_valid;
@@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
 		}
 	),
 
-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
-		__get_str(dev_name),
+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+		__get_str(dev_name), __get_str(bus_type),
 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
 			__entry->severity == AER_FATAL ?
 			"Fatal" : "Uncorrected, non-fatal",
-- 
2.51.0.rc2.21.ge5ab6b3e5a
Re: [PATCH v11 09/23] PCI/AER: Report CXL or PCIe bus error type in trace logging
Posted by Lukas Wunner 3 weeks, 1 day ago
On Tue, Aug 26, 2025 at 08:35:24PM -0500, Terry Bowman wrote:
> +++ b/drivers/pci/pci.h
> @@ -608,6 +608,7 @@ struct aer_err_info {
>  	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
>  	int error_dev_num;
>  	const char *level;		/* printk level */
> +	bool is_cxl;
>  
>  	unsigned int id:16;
>  

Commit 273024ded7b3 ("PCI: pcie, aer: flags to bits") made an effort
to reduce memory size of struct aer_err_info by converting flags and
small integers to bitfields.

So it seems the proper approach would be to take away 1 bit from
__pad1 or __pad2 and use that for the is_cxl flag.

I know bitfields are somewhat controversial, but it is what this
struct is using, so... :)

Thanks,

Lukas
Re: [PATCH v11 09/23] PCI/AER: Report CXL or PCIe bus error type in trace logging
Posted by Lukas Wunner 1 month, 1 week ago
On Tue, Aug 26, 2025 at 08:35:24PM -0500, Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.

aer_print_error() has a pointer to the struct pci_dev and you've added
an is_cxl bit to that struct in the preceding patch.

Is there a reason why you can't just use that dev->is_cxl bit, in lieu of
adding another is_cxl bit to struct aer_err_info?

If so, please document it in a code comment or at least in the commit
message.  If there isn't, please use dev->is_cxl.

Thanks,

Lukas
Re: [PATCH v11 09/23] PCI/AER: Report CXL or PCIe bus error type in trace logging
Posted by Bowman, Terry 3 weeks, 2 days ago

On 8/27/2025 2:37 AM, Lukas Wunner wrote:
> On Tue, Aug 26, 2025 at 08:35:24PM -0500, Terry Bowman wrote:
>> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
>> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
>> Type' for CXL device errors.
>>
>> This requires the AER can identify and distinguish between PCIe errors and
>> CXL errors.
>>
>> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
>> aer_get_device_error_info() and pci_print_aer().
>>
>> Update the aer_event trace routine to accept a bus type string parameter.
> aer_print_error() has a pointer to the struct pci_dev and you've added
> an is_cxl bit to that struct in the preceding patch.
>
> Is there a reason why you can't just use that dev->is_cxl bit, in lieu of
> adding another is_cxl bit to struct aer_err_info?
>
> If so, please document it in a code comment or at least in the commit
> message.  If there isn't, please use dev->is_cxl.
>
> Thanks,
>
> Lukas
Hi Lukas,

The addition of 'is_cxl' member to 'struct aer_err_info' was requested by Dan Williams
during v7 review:
https://lore.kernel.org/linux-cxl/67abe1903a8ed_2d1e2942f@dwillia2-xfh.jf.intel.com.notmuch/

My understanding is the change was requested to encapsulate the bus error 
type with the actual AER status. This is helpful when considering the 
actual device bus state can change between capturing the AER status and 
handling/logging. An example is a training HW error. Caching the 'is_cxl' will allow 
the drivers to properly identify the error bus type for further logging and 
handling.

Hopefully Dan will add his thoughts here.

Regards,
Terry

Re: [PATCH v11 09/23] PCI/AER: Report CXL or PCIe bus error type in trace logging
Posted by Lukas Wunner 3 weeks, 2 days ago
On Wed, Sep 10, 2025 at 10:26:19AM -0500, Bowman, Terry wrote:
> On 8/27/2025 2:37 AM, Lukas Wunner wrote:
> > On Tue, Aug 26, 2025 at 08:35:24PM -0500, Terry Bowman wrote:
> >> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> >> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> >> Type' for CXL device errors.
> >>
> >> This requires the AER can identify and distinguish between PCIe errors and
> >> CXL errors.
> >>
> >> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> >> aer_get_device_error_info() and pci_print_aer().
> >>
> >> Update the aer_event trace routine to accept a bus type string parameter.
> > aer_print_error() has a pointer to the struct pci_dev and you've added
> > an is_cxl bit to that struct in the preceding patch.
> >
> > Is there a reason why you can't just use that dev->is_cxl bit, in lieu of
> > adding another is_cxl bit to struct aer_err_info?
> >
> > If so, please document it in a code comment or at least in the commit
> > message.  If there isn't, please use dev->is_cxl.
> 
> [..] the
> actual device bus state can change between capturing the AER status and
> handling/logging. An example is a training HW error. Caching the 'is_cxl'
> will allow the drivers to properly identify the error bus type for
> further logging and handling.

Thanks for the explanation.  Could you document this in a code comment?
It's not obvious at least to me.

Lukas