[PATCH v14 14/34] PCI/AER: Report CXL or PCIe bus type in AER trace logging

Terry Bowman posted 34 patches 3 weeks, 4 days ago
There is a newer version of this series
[PATCH v14 14/34] PCI/AER: Report CXL or PCIe bus type in AER trace logging
Posted by Terry Bowman 3 weeks, 4 days ago
The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors. Update the driver and aer_event tracing to log 'CXL Bus
Type' for CXL device errors.

This requires that AER can identify and distinguish between PCIe errors and
CXL errors.

Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
aer_get_device_error_info() and pci_print_aer().

Update the aer_event trace routine to accept a bus type string parameter.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>

---

Changes in v13->v14:
- Merged with Dan's commit. Changes are moving bus_type the last
  parameter in function calls (Dan)
- Removed all DCOs because of changes (Terry)
- Update commit message (Bjorn)
- Add Bjorn's ack-by

Changes in v12->v13:
- Remove duplicated aer_err_info inline comments. Is already in the
  kernel-doc header (Ben)

Changes in v11->v12:
 - Change aer_err_info::is_cxl to be bool a bitfield. Update structure
 padding. (Lukas)
 - Add kernel-doc for 'struct aer_err_info' (Lukas)

Changes in v10->v11:
 - Remove duplicate call to trace_aer_event() (Shiju)
 - Added Dan William's and Dave Jiang's reviewed-by
---
 drivers/pci/pci.h       |  8 +++++++-
 drivers/pci/pcie/aer.c  | 20 +++++++++++++-------
 include/ras/ras_event.h | 12 ++++++++----
 3 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa001..41ec38e82c08 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -738,7 +738,8 @@ struct aer_err_info {
 	unsigned int multi_error_valid:1;
 
 	unsigned int first_error:5;
-	unsigned int __pad2:2;
+	unsigned int __pad2:1;
+	unsigned int is_cxl:1;
 	unsigned int tlp_header_valid:1;
 
 	unsigned int status;		/* COR/UNCOR Error Status */
@@ -749,6 +750,11 @@ struct aer_err_info {
 int aer_get_device_error_info(struct aer_err_info *info, int i);
 void aer_print_error(struct aer_err_info *info, int i);
 
+static inline const char *aer_err_bus(struct aer_err_info *info)
+{
+	return info->is_cxl ? "CXL" : "PCIe";
+}
+
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
 		      unsigned int tlp_len, bool flit,
 		      struct pcie_tlp_log *log);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index b1e6ee7468b9..d30a217fae46 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -870,6 +870,7 @@ void aer_print_error(struct aer_err_info *info, int i)
 	struct pci_dev *dev;
 	int layer, agent, id;
 	const char *level = info->level;
+	const char *bus_type = aer_err_bus(info);
 
 	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
 		return;
@@ -879,22 +880,22 @@ void aer_print_error(struct aer_err_info *info, int i)
 
 	pci_dev_aer_stats_incr(dev, info);
 	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
-			info->severity, info->tlp_header_valid, &info->tlp);
+			info->severity, info->tlp_header_valid, &info->tlp, bus_type);
 
 	if (!info->ratelimit_print[i])
 		return;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
-			aer_error_severity_string[info->severity]);
+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+			bus_type, aer_error_severity_string[info->severity]);
 		goto out;
 	}
 
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
-	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-		   aer_error_severity_string[info->severity],
+	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+		   bus_type, aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -928,6 +929,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
+	const char *bus_type;
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
 	struct aer_err_info info = {
@@ -948,10 +950,13 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	info.status = status;
 	info.mask = mask;
+	info.is_cxl = pcie_is_cxl(dev);
+
+	bus_type = aer_err_bus(&info);
 
 	pci_dev_aer_stats_incr(dev, &info);
-	trace_aer_event(pci_name(dev), (status & ~mask),
-			aer_severity, tlp_header_valid, &aer->header_log);
+	trace_aer_event(pci_name(dev), (status & ~mask), aer_severity,
+			tlp_header_valid, &aer->header_log, bus_type);
 
 	if (!aer_ratelimit(dev, info.severity))
 		return;
@@ -1301,6 +1306,7 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
 	/* Must reset in this function */
 	info->status = 0;
 	info->tlp_header_valid = 0;
+	info->is_cxl = pcie_is_cxl(dev);
 
 	/* The device might not support AER */
 	if (!aer)
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index eaecc3c5f772..fdb785fa4613 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -339,9 +339,11 @@ TRACE_EVENT(aer_event,
 		 const u32 status,
 		 const u8 severity,
 		 const u8 tlp_header_valid,
-		 struct pcie_tlp_log *tlp),
+		 struct pcie_tlp_log *tlp,
+		 const char *bus_type),
 
-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+
+	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp, bus_type),
 
 	TP_STRUCT__entry(
 		__string(	dev_name,	dev_name	)
@@ -349,10 +351,12 @@ TRACE_EVENT(aer_event,
 		__field(	u8,		severity	)
 		__field(	u8, 		tlp_header_valid)
 		__array(	u32, 		tlp_header, PCIE_STD_MAX_TLP_HEADERLOG)
+		__string(	bus_type,	bus_type	)
 	),
 
 	TP_fast_assign(
 		__assign_str(dev_name);
+		__assign_str(bus_type);
 		__entry->status		= status;
 		__entry->severity	= severity;
 		__entry->tlp_header_valid = tlp_header_valid;
@@ -364,8 +368,8 @@ TRACE_EVENT(aer_event,
 		}
 	),
 
-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
-		__get_str(dev_name),
+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+		__get_str(dev_name), __get_str(bus_type),
 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
 			__entry->severity == AER_FATAL ?
 			"Fatal" : "Uncorrected, non-fatal",
-- 
2.34.1
Re: [PATCH v14 14/34] PCI/AER: Report CXL or PCIe bus type in AER trace logging
Posted by Dave Jiang 3 weeks, 4 days ago

On 1/14/26 11:20 AM, Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires that AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Co-developed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>


> 
> ---
> 
> Changes in v13->v14:
> - Merged with Dan's commit. Changes are moving bus_type the last
>   parameter in function calls (Dan)
> - Removed all DCOs because of changes (Terry)
> - Update commit message (Bjorn)
> - Add Bjorn's ack-by
> 
> Changes in v12->v13:
> - Remove duplicated aer_err_info inline comments. Is already in the
>   kernel-doc header (Ben)
> 
> Changes in v11->v12:
>  - Change aer_err_info::is_cxl to be bool a bitfield. Update structure
>  padding. (Lukas)
>  - Add kernel-doc for 'struct aer_err_info' (Lukas)
> 
> Changes in v10->v11:
>  - Remove duplicate call to trace_aer_event() (Shiju)
>  - Added Dan William's and Dave Jiang's reviewed-by
> ---
>  drivers/pci/pci.h       |  8 +++++++-
>  drivers/pci/pcie/aer.c  | 20 +++++++++++++-------
>  include/ras/ras_event.h | 12 ++++++++----
>  3 files changed, 28 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 0e67014aa001..41ec38e82c08 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -738,7 +738,8 @@ struct aer_err_info {
>  	unsigned int multi_error_valid:1;
>  
>  	unsigned int first_error:5;
> -	unsigned int __pad2:2;
> +	unsigned int __pad2:1;
> +	unsigned int is_cxl:1;
>  	unsigned int tlp_header_valid:1;
>  
>  	unsigned int status;		/* COR/UNCOR Error Status */
> @@ -749,6 +750,11 @@ struct aer_err_info {
>  int aer_get_device_error_info(struct aer_err_info *info, int i);
>  void aer_print_error(struct aer_err_info *info, int i);
>  
> +static inline const char *aer_err_bus(struct aer_err_info *info)
> +{
> +	return info->is_cxl ? "CXL" : "PCIe";
> +}
> +
>  int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>  		      unsigned int tlp_len, bool flit,
>  		      struct pcie_tlp_log *log);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index b1e6ee7468b9..d30a217fae46 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -870,6 +870,7 @@ void aer_print_error(struct aer_err_info *info, int i)
>  	struct pci_dev *dev;
>  	int layer, agent, id;
>  	const char *level = info->level;
> +	const char *bus_type = aer_err_bus(info);
>  
>  	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
>  		return;
> @@ -879,22 +880,22 @@ void aer_print_error(struct aer_err_info *info, int i)
>  
>  	pci_dev_aer_stats_incr(dev, info);
>  	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
> -			info->severity, info->tlp_header_valid, &info->tlp);
> +			info->severity, info->tlp_header_valid, &info->tlp, bus_type);
>  
>  	if (!info->ratelimit_print[i])
>  		return;
>  
>  	if (!info->status) {
> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> -			aer_error_severity_string[info->severity]);
> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> +			bus_type, aer_error_severity_string[info->severity]);
>  		goto out;
>  	}
>  
>  	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>  	agent = AER_GET_AGENT(info->severity, info->status);
>  
> -	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> +	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> +		   bus_type, aer_error_severity_string[info->severity],
>  		   aer_error_layer[layer], aer_agent_string[agent]);
>  
>  	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -928,6 +929,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		   struct aer_capability_regs *aer)
>  {
> +	const char *bus_type;
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
>  	struct aer_err_info info = {
> @@ -948,10 +950,13 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  
>  	info.status = status;
>  	info.mask = mask;
> +	info.is_cxl = pcie_is_cxl(dev);
> +
> +	bus_type = aer_err_bus(&info);
>  
>  	pci_dev_aer_stats_incr(dev, &info);
> -	trace_aer_event(pci_name(dev), (status & ~mask),
> -			aer_severity, tlp_header_valid, &aer->header_log);
> +	trace_aer_event(pci_name(dev), (status & ~mask), aer_severity,
> +			tlp_header_valid, &aer->header_log, bus_type);
>  
>  	if (!aer_ratelimit(dev, info.severity))
>  		return;
> @@ -1301,6 +1306,7 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
>  	/* Must reset in this function */
>  	info->status = 0;
>  	info->tlp_header_valid = 0;
> +	info->is_cxl = pcie_is_cxl(dev);
>  
>  	/* The device might not support AER */
>  	if (!aer)
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index eaecc3c5f772..fdb785fa4613 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -339,9 +339,11 @@ TRACE_EVENT(aer_event,
>  		 const u32 status,
>  		 const u8 severity,
>  		 const u8 tlp_header_valid,
> -		 struct pcie_tlp_log *tlp),
> +		 struct pcie_tlp_log *tlp,
> +		 const char *bus_type),
>  
> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> +
> +	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp, bus_type),
>  
>  	TP_STRUCT__entry(
>  		__string(	dev_name,	dev_name	)
> @@ -349,10 +351,12 @@ TRACE_EVENT(aer_event,
>  		__field(	u8,		severity	)
>  		__field(	u8, 		tlp_header_valid)
>  		__array(	u32, 		tlp_header, PCIE_STD_MAX_TLP_HEADERLOG)
> +		__string(	bus_type,	bus_type	)
>  	),
>  
>  	TP_fast_assign(
>  		__assign_str(dev_name);
> +		__assign_str(bus_type);
>  		__entry->status		= status;
>  		__entry->severity	= severity;
>  		__entry->tlp_header_valid = tlp_header_valid;
> @@ -364,8 +368,8 @@ TRACE_EVENT(aer_event,
>  		}
>  	),
>  
> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> -		__get_str(dev_name),
> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> +		__get_str(dev_name), __get_str(bus_type),
>  		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>  			__entry->severity == AER_FATAL ?
>  			"Fatal" : "Uncorrected, non-fatal",
Re: [PATCH v14 14/34] PCI/AER: Report CXL or PCIe bus type in AER trace logging
Posted by Jonathan Cameron 3 weeks, 4 days ago
On Wed, 14 Jan 2026 12:20:35 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires that AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Co-developed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

I wonder if it is worth using __print_symbolic() etc and an integer
storage rather than a string for in the tracepoints.
However, not really that important to me as the strings are small anyway
and there is no precedence of this in ras trace events.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
RE: [PATCH v14 14/34] PCI/AER: Report CXL or PCIe bus type in AER trace logging
Posted by Mauro Carvalho Chehab 3 weeks, 3 days ago
> > The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> > for all errors. Update the driver and aer_event tracing to log 'CXL 
> > Bus Type' for CXL device errors.
> > 
> > This requires that AER can identify and distinguish between PCIe 
> > errors and CXL errors.
> > 
> > Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> > aer_get_device_error_info() and pci_print_aer().
> > 
> > Update the aer_event trace routine to accept a bus type string parameter.
> > 
> > Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > Co-developed-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>
> I wonder if it is worth using __print_symbolic() etc and an integer storage rather than a string for in the tracepoints.
> However, not really that important to me as the strings are small anyway and there is no precedence of this in ras trace events.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

It would be a lot better to pass integer values instead of strings. Right now, I'm working on a way
to have a CI to check if Kernel + rasdaemon is doing the right thing. The idea is to inject errors
on QEMU via QMP interface (main patches already there at QEMU tree).

By using integers, it sounds easier to veriy if everything was properly handled, as we can
ignore __print_symbolic at rasdaemon, picking the actual values directly.

Regards,
Mauro