[PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors

Terry Bowman posted 17 patches 3 months, 1 week ago
There is a newer version of this series
[PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Terry Bowman 3 months, 1 week ago
CXL error handling will soon be moved from the AER driver into the CXL
driver. This requires a notification mechanism for the AER driver to share
the AER interrupt with the CXL driver. The notification will be used
as an indication for the CXL drivers to handle and log the CXL RAS errors.

First, introduce cxl/core/native_ras.c to contain changes for the CXL
driver's RAS native handling. This as an alternative to dropping the
changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
conditionally compile the new file.

Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
driver will be the sole kfifo producer adding work and the cxl_core will be
the sole kfifo consumer removing work. Add the boilerplate kfifo support.

Add CXL work queue handler registration functions in the AER driver. Export
the functions allowing CXL driver to access. Implement registration
functions for the CXL driver to assign or clear the work handler function.

Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
will contain the erring device's PCI SBDF details used to rediscover the
device after the CXL driver dequeues the kfifo work. The device rediscovery
will be introduced along with the CXL handling in future patches.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/Kconfig           | 14 ++++++++
 drivers/cxl/core/Makefile     |  1 +
 drivers/cxl/core/core.h       |  8 +++++
 drivers/cxl/core/native_ras.c | 26 +++++++++++++++
 drivers/cxl/core/port.c       |  2 ++
 drivers/cxl/core/ras.c        |  1 +
 drivers/cxl/cxlpci.h          |  1 +
 drivers/pci/pci.h             |  4 +++
 drivers/pci/pcie/aer.c        |  7 ++--
 drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
 include/linux/aer.h           | 31 ++++++++++++++++++
 11 files changed, 153 insertions(+), 2 deletions(-)
 create mode 100644 drivers/cxl/core/native_ras.c

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 48b7314afdb8..57274de54a45 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -233,4 +233,18 @@ config CXL_MCE
 	def_bool y
 	depends on X86_MCE && MEMORY_FAILURE
 
+config CXL_NATIVE_RAS
+	bool "CXL: Enable CXL RAS native handling"
+	depends on PCIEAER_CXL
+	default CXL_BUS
+	help
+	  Enable native CXL RAS protocol error handling and logging in the CXL
+	  drivers. This functionality relies on the AER service driver being
+	  enabled, as the AER interrupt is used to inform the operating system
+	  of CXL RAS protocol errors. The platform must be configured to
+	  utilize AER reporting for interrupts.
+
+	  If unsure, or if this kernel is meant for production environments,
+	  say Y.
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 79e2ef81fde8..16f5832e5cc4 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
 cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
+cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 29b61828a847..4c08bb92e2f9 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
 int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
 					    int nid, resource_size_t *size);
 
+#ifdef CONFIG_PCIEAER_CXL
+void cxl_native_ras_init(void);
+void cxl_native_ras_exit(void);
+#else
+static inline void cxl_native_ras_init(void) { };
+static inline void cxl_native_ras_exit(void) { };
+#endif
+
 #ifdef CONFIG_CXL_FEATURES
 struct cxl_feat_entry *
 cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
new file mode 100644
index 000000000000..011815ddaae3
--- /dev/null
+++ b/drivers/cxl/core/native_ras.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/pci.h>
+#include <linux/aer.h>
+#include <cxl/event.h>
+#include <cxlmem.h>
+#include <core/core.h>
+
+static void cxl_proto_err_work_fn(struct work_struct *work)
+{
+}
+
+static struct work_struct cxl_proto_err_work;
+static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
+
+void cxl_native_ras_init(void)
+{
+	cxl_register_proto_err_work(&cxl_proto_err_work);
+}
+
+void cxl_native_ras_exit(void)
+{
+	cxl_unregister_proto_err_work();
+	cancel_work_sync(&cxl_proto_err_work);
+}
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index eb46c6764d20..8e8f21197c86 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
 	if (rc)
 		goto err_ras;
 
+	cxl_native_ras_init();
+
 	return 0;
 
 err_ras:
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 485a831695c7..962dc94fed8c 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -5,6 +5,7 @@
 #include <linux/aer.h>
 #include <cxl/event.h>
 #include <cxlmem.h>
+#include <cxlpci.h>
 #include "trace.h"
 
 static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 54e219b0049e..6f1396ef7b77 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -4,6 +4,7 @@
 #define __CXL_PCI_H__
 #include <linux/pci.h>
 #include "cxl.h"
+#include "linux/aer.h"
 
 #define CXL_MEMORY_PROGIF	0x10
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 91b583cf18eb..29c11c7136d3 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
 void cxl_rch_enable_rcec(struct pci_dev *rcec);
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
+void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
 #else
 static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
 static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
+static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
 #endif
 
 #ifdef CONFIG_ACPI
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 0b4d721980ef..8417a49c71f2 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
-	cxl_rch_handle_error(dev, info);
-	pci_aer_handle_error(dev, info);
+	if (is_cxl_error(dev, info))
+		forward_cxl_error(dev, info);
+	else
+		pci_aer_handle_error(dev, info);
+
 	pci_dev_put(dev);
 }
 
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
index b2ea14f70055..846ab55d747c 100644
--- a/drivers/pci/pcie/cxl_aer.c
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -3,8 +3,11 @@
 
 #include <linux/pci.h>
 #include <linux/aer.h>
+#include <linux/kfifo.h>
 #include "../pci.h"
 
+#define CXL_ERROR_SOURCES_MAX          128
+
 /**
  * pci_aer_unmask_internal_errors - unmask internal errors
  * @dev: pointer to the pci_dev data structure
@@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
 	return info->status & PCI_ERR_UNC_INTN;
 }
 
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+	if (!info || !info->is_cxl)
+		return false;
+
+	/* Only CXL Endpoints are currently supported */
+	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
+		return false;
+
+	return is_internal_error(info);
+}
+
 static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 {
 	struct aer_err_info *info = (struct aer_err_info *)data;
@@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
 	pci_info(rcec, "CXL: Internal errors unmasked");
 }
 
+static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
+		    CXL_ERROR_SOURCES_MAX);
+static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
+struct work_struct *cxl_proto_err_work;
+
+void cxl_register_proto_err_work(struct work_struct *work)
+{
+	guard(spinlock)(&cxl_proto_err_fifo_lock);
+	cxl_proto_err_work = work;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
+
+void cxl_unregister_proto_err_work(void)
+{
+	guard(spinlock)(&cxl_proto_err_fifo_lock);
+	cxl_proto_err_work = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
+
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
+{
+	return kfifo_get(&cxl_proto_err_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
+
+void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
+{
+	struct cxl_proto_err_work_data wd;
+
+	wd.err_info = (struct cxl_proto_error_info) {
+		.severity = aer_err_info->severity,
+		.devfn = pdev->devfn,
+		.bus = pdev->bus->number,
+		.segment = pci_domain_nr(pdev->bus)
+	};
+
+	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
+		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_proto_err_work);
+}
+
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 02940be66324..24c3d9e18ad5 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -10,6 +10,7 @@
 
 #include <linux/errno.h>
 #include <linux/types.h>
+#include <linux/workqueue_types.h>
 
 #define AER_NONFATAL			0
 #define AER_FATAL			1
@@ -53,6 +54,26 @@ struct aer_capability_regs {
 	u16 uncor_err_source;
 };
 
+/**
+ * struct cxl_proto_err_info - Error information used in CXL error handling
+ * @severity: AER severity
+ * @function: Device's PCI function
+ * @device: Device's PCI device
+ * @bus: Device's PCI bus
+ * @segment: Device's PCI segment
+ */
+struct cxl_proto_error_info {
+	int severity;
+
+	u8 devfn;
+	u8 bus;
+	u16 segment;
+};
+
+struct cxl_proto_err_work_data {
+	struct cxl_proto_error_info err_info;
+};
+
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
@@ -64,6 +85,16 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 #endif
 
+#if defined(CONFIG_PCIEAER_CXL)
+void cxl_register_proto_err_work(struct work_struct *work);
+void cxl_unregister_proto_err_work(void);
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
+#else
+static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline void cxl_unregister_proto_err_work(void) { }
+static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
+#endif
+
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		    struct aer_capability_regs *aer);
 int cper_severity_to_aer(int cper_severity);
-- 
2.34.1
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by dan.j.williams@intel.com 2 months, 2 weeks ago
Terry Bowman wrote:
> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used
> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> driver's RAS native handling. This as an alternative to dropping the
> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> conditionally compile the new file.

I see no daylight between CXL_NATIVE_RAS and PCIEAER_CXL, one of those
needs to subsume the other. I also do not understand the point of
"NATIVE" in the name. Will not CPER notified protocol errors be routed
to the same CXL error handling infrastructure as AER notified protocol
errors? I.e. the aer_recover_queue() path?

> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> will contain the erring device's PCI SBDF details used to rediscover the
> device after the CXL driver dequeues the kfifo work. The device rediscovery
> will be introduced along with the CXL handling in future patches.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/Kconfig           | 14 ++++++++
>  drivers/cxl/core/Makefile     |  1 +
>  drivers/cxl/core/core.h       |  8 +++++
>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>  drivers/cxl/core/port.c       |  2 ++
>  drivers/cxl/core/ras.c        |  1 +
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/pci/pci.h             |  4 +++
>  drivers/pci/pcie/aer.c        |  7 ++--
>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>  include/linux/aer.h           | 31 ++++++++++++++++++
>  11 files changed, 153 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/cxl/core/native_ras.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 48b7314afdb8..57274de54a45 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -233,4 +233,18 @@ config CXL_MCE
>  	def_bool y
>  	depends on X86_MCE && MEMORY_FAILURE
>  
> +config CXL_NATIVE_RAS
> +	bool "CXL: Enable CXL RAS native handling"
> +	depends on PCIEAER_CXL

This nice helpful option is hidden if someone forgets to set the
PCIEAER_CXL option which does not have helpful text. Given the
dependencies, I am leaning towards drop this new option, move the
help text to PCIEAER_CXL... but let me read the rest of the patch first.

You can move PCIEAER_CXL to drivers/cxl/Kconfig if you want to keep all
the CXL options in the CXL menu.

> +	default CXL_BUS
> +	help
> +	  Enable native CXL RAS protocol error handling and logging in the CXL
> +	  drivers. This functionality relies on the AER service driver being
> +	  enabled,

No need to put dependencies in the help text the tool will tell them
that PCIEAER=y is a dependency.

>         as the AER interrupt is used to inform the operating system
> +	  of CXL RAS protocol errors. The platform must be configured to
> +	  utilize AER reporting for interrupts.

Per above, does any of CXL CPER reporting make its way into this path?

> +
> +	  If unsure, or if this kernel is meant for production environments,
> +	  say Y.

I think: "If unsure, say Y" is sufficient.

> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 79e2ef81fde8..16f5832e5cc4 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 29b61828a847..4c08bb92e2f9 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>  					    int nid, resource_size_t *size);
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +void cxl_native_ras_init(void);
> +void cxl_native_ras_exit(void);
> +#else
> +static inline void cxl_native_ras_init(void) { };
> +static inline void cxl_native_ras_exit(void) { };
> +#endif
> +
>  #ifdef CONFIG_CXL_FEATURES
>  struct cxl_feat_entry *
>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> new file mode 100644
> index 000000000000..011815ddaae3
> --- /dev/null
> +++ b/drivers/cxl/core/native_ras.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/aer.h>
> +#include <cxl/event.h>
> +#include <cxlmem.h>
> +#include <core/core.h>
> +
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> +}
> +
> +static struct work_struct cxl_proto_err_work;
> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
> +
> +void cxl_native_ras_init(void)
> +{
> +	cxl_register_proto_err_work(&cxl_proto_err_work);
> +}
> +
> +void cxl_native_ras_exit(void)
> +{
> +	cxl_unregister_proto_err_work();
> +	cancel_work_sync(&cxl_proto_err_work);
> +}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index eb46c6764d20..8e8f21197c86 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>  	if (rc)
>  		goto err_ras;
>  
> +	cxl_native_ras_init();
> +
>  	return 0;
>  
>  err_ras:
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 485a831695c7..962dc94fed8c 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -5,6 +5,7 @@
>  #include <linux/aer.h>
>  #include <cxl/event.h>
>  #include <cxlmem.h>
> +#include <cxlpci.h>
>  #include "trace.h"
>  
>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..6f1396ef7b77 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"
>  
>  #define CXL_MEMORY_PROGIF	0x10
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 91b583cf18eb..29c11c7136d3 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>  #else
>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>  #endif
>  
>  #ifdef CONFIG_ACPI
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 0b4d721980ef..8417a49c71f2 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_rch_handle_error(dev, info);

No, can not just drop what was working before, even if you restore the
functionality in a later patch in the same series.

I would expect that this patch at a minimum maintains RCH handling and
forwards anything else to the CXL core for VH handling.

> -	pci_aer_handle_error(dev, info);
> +	if (is_cxl_error(dev, info))
> +		forward_cxl_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
> +
>  	pci_dev_put(dev);
>  }
>  
> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> index b2ea14f70055..846ab55d747c 100644
> --- a/drivers/pci/pcie/cxl_aer.c
> +++ b/drivers/pci/pcie/cxl_aer.c

With the RCH bits moved to its own file then this file would be 100%
concerned with typical CXL VH error handling and deserve to carry the
"cxl_aer.c" name.

> @@ -3,8 +3,11 @@
>  
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/kfifo.h>
>  #include "../pci.h"
>  
> +#define CXL_ERROR_SOURCES_MAX          128
> +
>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pci_dev data structure
> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>  	return info->status & PCI_ERR_UNC_INTN;
>  }
>  
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	if (!info || !info->is_cxl)
> +		return false;
> +
> +	/* Only CXL Endpoints are currently supported */
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
> +		return false;
> +
> +	return is_internal_error(info);
> +}
> +
>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  	pci_info(rcec, "CXL: Internal errors unmasked");
>  }
>  
> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
> +struct work_struct *cxl_proto_err_work;

Please make this one combo object with one registration entry point. 

struct cxl_prot_err_work {
        struct work_struct work;
        DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
                      CXL_ERROR_SOURCES_MAX);
};      

Bonus points to go back and clean up the CPER code to do the same to
reduce the amount of "registration" APIs.

> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);

This lock acquisition is not protecting anything. 'unsigned long'
assignments are already atomic and forward_cxl_error() looks like it
happily de-references NULL pointers without checking the lock.

I would make it an rwsem. Hold the rwsem for write at registration /
unregistration...

> +	cxl_proto_err_work = work;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
> +
> +void cxl_unregister_proto_err_work(void)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> +	cxl_proto_err_work = NULL;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
> +
> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_proto_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
> +
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
> +{
> +	struct cxl_proto_err_work_data wd;
> +
> +	wd.err_info = (struct cxl_proto_error_info) {
> +		.severity = aer_err_info->severity,
> +		.devfn = pdev->devfn,
> +		.bus = pdev->bus->number,
> +		.segment = pci_domain_nr(pdev->bus)
> +	};

...hold the rwsem for read when de-referencing a 'struct
cxl_prot_err_work *'

> +
> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");

In the case that 'struct cxl_prot_err_work *' is NULL, perhaps this
should be a dev_warn_once() to say "hey, we're seeing CXL errors, but
nobody registered the CXL core!?".

> +		return;
> +	}
> +
> +	schedule_work(cxl_proto_err_work);
> +}
> +
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 02940be66324..24c3d9e18ad5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_proto_err_info - Error information used in CXL error handling
> + * @severity: AER severity
> + * @function: Device's PCI function
> + * @device: Device's PCI device
> + * @bus: Device's PCI bus
> + * @segment: Device's PCI segment
> + */
> +struct cxl_proto_error_info {
> +	int severity;
> +
> +	u8 devfn;
> +	u8 bus;
> +	u16 segment;
> +};
> +
> +struct cxl_proto_err_work_data {
> +	struct cxl_proto_error_info err_info;
> +};

Why not use cxl_proto_error_info directly?
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Bowman, Terry 2 months, 2 weeks ago

On 7/23/2025 9:01 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt with the CXL driver. The notification will be used
>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>
>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>> driver's RAS native handling. This as an alternative to dropping the
>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>> conditionally compile the new file.
> I see no daylight between CXL_NATIVE_RAS and PCIEAER_CXL, one of those
> needs to subsume the other. I also do not understand the point of
> "NATIVE" in the name. Will not CPER notified protocol errors be routed
> to the same CXL error handling infrastructure as AER notified protocol
> errors? I.e. the aer_recover_queue() path?

This change and comment is planned to be removed in v11. Instead of introducing this 
as a new file. The same changes will instead be added to pci_aer.c/pci_ras.c Dave Jiang
is introducing here:

https://lore.kernel.org/linux-cxl/20250721170415.285961-1-dave.jiang@intel.com/
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work and the cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>> will contain the erring device's PCI SBDF details used to rediscover the
>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>> will be introduced along with the CXL handling in future patches.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/Kconfig           | 14 ++++++++
>>  drivers/cxl/core/Makefile     |  1 +
>>  drivers/cxl/core/core.h       |  8 +++++
>>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>>  drivers/cxl/core/port.c       |  2 ++
>>  drivers/cxl/core/ras.c        |  1 +
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/pci/pci.h             |  4 +++
>>  drivers/pci/pcie/aer.c        |  7 ++--
>>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>>  include/linux/aer.h           | 31 ++++++++++++++++++
>>  11 files changed, 153 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/cxl/core/native_ras.c
>>
>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
>> index 48b7314afdb8..57274de54a45 100644
>> --- a/drivers/cxl/Kconfig
>> +++ b/drivers/cxl/Kconfig
>> @@ -233,4 +233,18 @@ config CXL_MCE
>>  	def_bool y
>>  	depends on X86_MCE && MEMORY_FAILURE
>>  
>> +config CXL_NATIVE_RAS
>> +	bool "CXL: Enable CXL RAS native handling"
>> +	depends on PCIEAER_CXL
> This nice helpful option is hidden if someone forgets to set the
> PCIEAER_CXL option which does not have helpful text. Given the
> dependencies, I am leaning towards drop this new option, move the
> help text to PCIEAER_CXL... but let me read the rest of the patch first.
>
> You can move PCIEAER_CXL to drivers/cxl/Kconfig if you want to keep all
> the CXL options in the CXL menu.
>
>> +	default CXL_BUS
>> +	help
>> +	  Enable native CXL RAS protocol error handling and logging in the CXL
>> +	  drivers. This functionality relies on the AER service driver being
>> +	  enabled,
> No need to put dependencies in the help text the tool will tell them
> that PCIEAER=y is a dependency.
>
>>         as the AER interrupt is used to inform the operating system
>> +	  of CXL RAS protocol errors. The platform must be configured to
>> +	  utilize AER reporting for interrupts.
> Per above, does any of CXL CPER reporting make its way into this path?
>
>> +
>> +	  If unsure, or if this kernel is meant for production environments,
>> +	  say Y.
> I think: "If unsure, say Y" is sufficient.
>
>> +
>>  endif
>> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
>> index 79e2ef81fde8..16f5832e5cc4 100644
>> --- a/drivers/cxl/core/Makefile
>> +++ b/drivers/cxl/core/Makefile
>> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
>> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>> index 29b61828a847..4c08bb92e2f9 100644
>> --- a/drivers/cxl/core/core.h
>> +++ b/drivers/cxl/core/core.h
>> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>>  					    int nid, resource_size_t *size);
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
>> +void cxl_native_ras_init(void);
>> +void cxl_native_ras_exit(void);
>> +#else
>> +static inline void cxl_native_ras_init(void) { };
>> +static inline void cxl_native_ras_exit(void) { };
>> +#endif
>> +
>>  #ifdef CONFIG_CXL_FEATURES
>>  struct cxl_feat_entry *
>>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
>> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
>> new file mode 100644
>> index 000000000000..011815ddaae3
>> --- /dev/null
>> +++ b/drivers/cxl/core/native_ras.c
>> @@ -0,0 +1,26 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/aer.h>
>> +#include <cxl/event.h>
>> +#include <cxlmem.h>
>> +#include <core/core.h>
>> +
>> +static void cxl_proto_err_work_fn(struct work_struct *work)
>> +{
>> +}
>> +
>> +static struct work_struct cxl_proto_err_work;
>> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
>> +
>> +void cxl_native_ras_init(void)
>> +{
>> +	cxl_register_proto_err_work(&cxl_proto_err_work);
>> +}
>> +
>> +void cxl_native_ras_exit(void)
>> +{
>> +	cxl_unregister_proto_err_work();
>> +	cancel_work_sync(&cxl_proto_err_work);
>> +}
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index eb46c6764d20..8e8f21197c86 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>>  	if (rc)
>>  		goto err_ras;
>>  
>> +	cxl_native_ras_init();
>> +
>>  	return 0;
>>  
>>  err_ras:
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 485a831695c7..962dc94fed8c 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/aer.h>
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>> +#include <cxlpci.h>
>>  #include "trace.h"
>>  
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..6f1396ef7b77 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 91b583cf18eb..29c11c7136d3 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>>  #ifdef CONFIG_PCIEAER_CXL
>>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>>  #else
>>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
>> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
>> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>>  #endif
>>  
>>  #ifdef CONFIG_ACPI
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 0b4d721980ef..8417a49c71f2 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  
>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> -	cxl_rch_handle_error(dev, info);
> No, can not just drop what was working before, even if you restore the
> functionality in a later patch in the same series.
>
> I would expect that this patch at a minimum maintains RCH handling and
> forwards anything else to the CXL core for VH handling.

You want RCH handling to stay here in the AER driver or does the patch changes need 
to be reworked to present the change better?

>> -	pci_aer_handle_error(dev, info);
>> +	if (is_cxl_error(dev, info))
>> +		forward_cxl_error(dev, info);
>> +	else
>> +		pci_aer_handle_error(dev, info);
>> +
>>  	pci_dev_put(dev);
>>  }
>>  
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> index b2ea14f70055..846ab55d747c 100644
>> --- a/drivers/pci/pcie/cxl_aer.c
>> +++ b/drivers/pci/pcie/cxl_aer.c
> With the RCH bits moved to its own file then this file would be 100%
> concerned with typical CXL VH error handling and deserve to carry the
> "cxl_aer.c" name.
The plan was to move all the handling to cxl/core/pci_aer.c or whatever it is renamed.
>> @@ -3,8 +3,11 @@
>>  
>>  #include <linux/pci.h>
>>  #include <linux/aer.h>
>> +#include <linux/kfifo.h>
>>  #include "../pci.h"
>>  
>> +#define CXL_ERROR_SOURCES_MAX          128
>> +
>>  /**
>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>   * @dev: pointer to the pci_dev data structure
>> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>>  	return info->status & PCI_ERR_UNC_INTN;
>>  }
>>  
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>> +{
>> +	if (!info || !info->is_cxl)
>> +		return false;
>> +
>> +	/* Only CXL Endpoints are currently supported */
>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
>> +		return false;
>> +
>> +	return is_internal_error(info);
>> +}
>> +
>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>  }
>>  
>> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
>> +struct work_struct *cxl_proto_err_work;
> Please make this one combo object with one registration entry point. 
>
> struct cxl_prot_err_work {
>         struct work_struct work;
>         DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
>                       CXL_ERROR_SOURCES_MAX);
> };      
>
> Bonus points to go back and clean up the CPER code to do the same to
> reduce the amount of "registration" APIs.

Ok.

>> +
>> +void cxl_register_proto_err_work(struct work_struct *work)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> This lock acquisition is not protecting anything. 'unsigned long'
> assignments are already atomic and forward_cxl_error() looks like it
> happily de-references NULL pointers without checking the lock.
>
> I would make it an rwsem. Hold the rwsem for write at registration /
> unregistration...

Ok.

>> +	cxl_proto_err_work = work;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
>> +
>> +void cxl_unregister_proto_err_work(void)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
>> +	cxl_proto_err_work = NULL;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
>> +
>> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
>> +{
>> +	return kfifo_get(&cxl_proto_err_fifo, wd);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
>> +
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
>> +{
>> +	struct cxl_proto_err_work_data wd;
>> +
>> +	wd.err_info = (struct cxl_proto_error_info) {
>> +		.severity = aer_err_info->severity,
>> +		.devfn = pdev->devfn,
>> +		.bus = pdev->bus->number,
>> +		.segment = pci_domain_nr(pdev->bus)
>> +	};
> ...hold the rwsem for read when de-referencing a 'struct
> cxl_prot_err_work *'
>
>> +
>> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
>> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
> In the case that 'struct cxl_prot_err_work *' is NULL, perhaps this
> should be a dev_warn_once() to say "hey, we're seeing CXL errors, but
> nobody registered the CXL core!?".

Ok.

>> +		return;
>> +	}
>> +
>> +	schedule_work(cxl_proto_err_work);
>> +}
>> +
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 02940be66324..24c3d9e18ad5 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_proto_err_info - Error information used in CXL error handling
>> + * @severity: AER severity
>> + * @function: Device's PCI function
>> + * @device: Device's PCI device
>> + * @bus: Device's PCI bus
>> + * @segment: Device's PCI segment
>> + */
>> +struct cxl_proto_error_info {
>> +	int severity;
>> +
>> +	u8 devfn;
>> +	u8 bus;
>> +	u16 segment;
>> +};
>> +
>> +struct cxl_proto_err_work_data {
>> +	struct cxl_proto_error_info err_info;
>> +};
> Why not use cxl_proto_error_info directly?
At one point I thought there was a good reason for using it later in another case.
I'll use cxl_proto_error_info directly.

-Terry

Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by dan.j.williams@intel.com 2 months, 2 weeks ago
Bowman, Terry wrote:
> On 7/23/2025 9:01 PM, dan.j.williams@intel.com wrote:
> > Terry Bowman wrote:
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt with the CXL driver. The notification will be used
> >> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> >>
> >> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> >> driver's RAS native handling. This as an alternative to dropping the
> >> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> >> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> >> conditionally compile the new file.
> > I see no daylight between CXL_NATIVE_RAS and PCIEAER_CXL, one of those
> > needs to subsume the other. I also do not understand the point of
> > "NATIVE" in the name. Will not CPER notified protocol errors be routed
> > to the same CXL error handling infrastructure as AER notified protocol
> > errors? I.e. the aer_recover_queue() path?
> 
> This change and comment is planned to be removed in v11. Instead of introducing this 
> as a new file. The same changes will instead be added to pci_aer.c/pci_ras.c Dave Jiang
> is introducing here:
> 
> https://lore.kernel.org/linux-cxl/20250721170415.285961-1-dave.jiang@intel.com/

Lets just put it all in cxl/core/ras.c, I don't think we need to have
fine grained file distinctions between the "native", "aer", and "ras
component registers" cases.

"Want CXL error handling? Need ras.c."

[..]
<trim three pages of context that had no new comments to respond, please
 trim your replies>

> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index 0b4d721980ef..8417a49c71f2 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> >>  
> >>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
> >>  {
> >> -	cxl_rch_handle_error(dev, info);
> > No, can not just drop what was working before, even if you restore the
> > functionality in a later patch in the same series.
> >
> > I would expect that this patch at a minimum maintains RCH handling and
> > forwards anything else to the CXL core for VH handling.
> 
> You want RCH handling to stay here in the AER driver or does the patch changes need 
> to be reworked to present the change better?

Stay here. The cost of exporting PCI core functionality for this one-off
seems not worth it.

> >> -	pci_aer_handle_error(dev, info);
> >> +	if (is_cxl_error(dev, info))
> >> +		forward_cxl_error(dev, info);
> >> +	else
> >> +		pci_aer_handle_error(dev, info);
> >> +
> >>  	pci_dev_put(dev);
> >>  }
> >>  
> >> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> >> index b2ea14f70055..846ab55d747c 100644
> >> --- a/drivers/pci/pcie/cxl_aer.c
> >> +++ b/drivers/pci/pcie/cxl_aer.c
> > With the RCH bits moved to its own file then this file would be 100%
> > concerned with typical CXL VH error handling and deserve to carry the
> > "cxl_aer.c" name.
> The plan was to move all the handling to cxl/core/pci_aer.c or whatever it is renamed.

It would need more justification to overcome the perception of "new
exports for code that is already on a deprecation watch"

[..]
> >> diff --git a/include/linux/aer.h b/include/linux/aer.h
> >> index 02940be66324..24c3d9e18ad5 100644
> >> --- a/include/linux/aer.h
> >> +++ b/include/linux/aer.h
> >> @@ -10,6 +10,7 @@
> >>  
> >>  #include <linux/errno.h>
> >>  #include <linux/types.h>
> >> +#include <linux/workqueue_types.h>
> >>  
> >>  #define AER_NONFATAL			0
> >>  #define AER_FATAL			1
> >> @@ -53,6 +54,26 @@ struct aer_capability_regs {
> >>  	u16 uncor_err_source;
> >>  };
> >>  
> >> +/**
> >> + * struct cxl_proto_err_info - Error information used in CXL error handling
> >> + * @severity: AER severity
> >> + * @function: Device's PCI function
> >> + * @device: Device's PCI device
> >> + * @bus: Device's PCI bus
> >> + * @segment: Device's PCI segment
> >> + */
> >> +struct cxl_proto_error_info {
> >> +	int severity;
> >> +
> >> +	u8 devfn;
> >> +	u8 bus;
> >> +	u16 segment;
> >> +};
> >> +
> >> +struct cxl_proto_err_work_data {
> >> +	struct cxl_proto_error_info err_info;
> >> +};
> > Why not use cxl_proto_error_info directly?
> At one point I thought there was a good reason for using it later in another case.
> I'll use cxl_proto_error_info directly.

Maybe that was more the CPER case where the CXL standard mailbox payload
structure is wrapped with a CPER-record envelope? In any event, I think
it is safe to drop that indirection.
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Dave Jiang 3 months, 1 week ago

On 6/26/25 3:42 PM, Terry Bowman wrote:
> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used
> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> driver's RAS native handling. This as an alternative to dropping the
> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> conditionally compile the new file.
> 
> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> will contain the erring device's PCI SBDF details used to rediscover the
> device after the CXL driver dequeues the kfifo work. The device rediscovery
> will be introduced along with the CXL handling in future patches.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/Kconfig           | 14 ++++++++
>  drivers/cxl/core/Makefile     |  1 +
>  drivers/cxl/core/core.h       |  8 +++++
>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++

With the addition of a new file to cxl_core, can you please also fix up tools/testing/cxl/Kbuild?

DJ

>  drivers/cxl/core/port.c       |  2 ++
>  drivers/cxl/core/ras.c        |  1 +
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/pci/pci.h             |  4 +++
>  drivers/pci/pcie/aer.c        |  7 ++--
>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>  include/linux/aer.h           | 31 ++++++++++++++++++
>  11 files changed, 153 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/cxl/core/native_ras.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 48b7314afdb8..57274de54a45 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -233,4 +233,18 @@ config CXL_MCE
>  	def_bool y
>  	depends on X86_MCE && MEMORY_FAILURE
>  
> +config CXL_NATIVE_RAS
> +	bool "CXL: Enable CXL RAS native handling"
> +	depends on PCIEAER_CXL
> +	default CXL_BUS
> +	help
> +	  Enable native CXL RAS protocol error handling and logging in the CXL
> +	  drivers. This functionality relies on the AER service driver being
> +	  enabled, as the AER interrupt is used to inform the operating system
> +	  of CXL RAS protocol errors. The platform must be configured to
> +	  utilize AER reporting for interrupts.
> +
> +	  If unsure, or if this kernel is meant for production environments,
> +	  say Y.
> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 79e2ef81fde8..16f5832e5cc4 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 29b61828a847..4c08bb92e2f9 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>  					    int nid, resource_size_t *size);
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +void cxl_native_ras_init(void);
> +void cxl_native_ras_exit(void);
> +#else
> +static inline void cxl_native_ras_init(void) { };
> +static inline void cxl_native_ras_exit(void) { };
> +#endif
> +
>  #ifdef CONFIG_CXL_FEATURES
>  struct cxl_feat_entry *
>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> new file mode 100644
> index 000000000000..011815ddaae3
> --- /dev/null
> +++ b/drivers/cxl/core/native_ras.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/aer.h>
> +#include <cxl/event.h>
> +#include <cxlmem.h>
> +#include <core/core.h>
> +
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> +}
> +
> +static struct work_struct cxl_proto_err_work;
> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
> +
> +void cxl_native_ras_init(void)
> +{
> +	cxl_register_proto_err_work(&cxl_proto_err_work);
> +}
> +
> +void cxl_native_ras_exit(void)
> +{
> +	cxl_unregister_proto_err_work();
> +	cancel_work_sync(&cxl_proto_err_work);
> +}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index eb46c6764d20..8e8f21197c86 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>  	if (rc)
>  		goto err_ras;
>  
> +	cxl_native_ras_init();
> +
>  	return 0;
>  
>  err_ras:
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 485a831695c7..962dc94fed8c 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -5,6 +5,7 @@
>  #include <linux/aer.h>
>  #include <cxl/event.h>
>  #include <cxlmem.h>
> +#include <cxlpci.h>
>  #include "trace.h"
>  
>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..6f1396ef7b77 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"
>  
>  #define CXL_MEMORY_PROGIF	0x10
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 91b583cf18eb..29c11c7136d3 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>  #else
>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>  #endif
>  
>  #ifdef CONFIG_ACPI
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 0b4d721980ef..8417a49c71f2 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_rch_handle_error(dev, info);
> -	pci_aer_handle_error(dev, info);
> +	if (is_cxl_error(dev, info))
> +		forward_cxl_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
> +
>  	pci_dev_put(dev);
>  }
>  
> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> index b2ea14f70055..846ab55d747c 100644
> --- a/drivers/pci/pcie/cxl_aer.c
> +++ b/drivers/pci/pcie/cxl_aer.c
> @@ -3,8 +3,11 @@
>  
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/kfifo.h>
>  #include "../pci.h"
>  
> +#define CXL_ERROR_SOURCES_MAX          128
> +
>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pci_dev data structure
> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>  	return info->status & PCI_ERR_UNC_INTN;
>  }
>  
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	if (!info || !info->is_cxl)
> +		return false;
> +
> +	/* Only CXL Endpoints are currently supported */
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
> +		return false;
> +
> +	return is_internal_error(info);
> +}
> +
>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  	pci_info(rcec, "CXL: Internal errors unmasked");
>  }
>  
> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
> +struct work_struct *cxl_proto_err_work;
> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> +	cxl_proto_err_work = work;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
> +
> +void cxl_unregister_proto_err_work(void)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> +	cxl_proto_err_work = NULL;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
> +
> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_proto_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
> +
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
> +{
> +	struct cxl_proto_err_work_data wd;
> +
> +	wd.err_info = (struct cxl_proto_error_info) {
> +		.severity = aer_err_info->severity,
> +		.devfn = pdev->devfn,
> +		.bus = pdev->bus->number,
> +		.segment = pci_domain_nr(pdev->bus)
> +	};
> +
> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
> +		return;
> +	}
> +
> +	schedule_work(cxl_proto_err_work);
> +}
> +
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 02940be66324..24c3d9e18ad5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_proto_err_info - Error information used in CXL error handling
> + * @severity: AER severity
> + * @function: Device's PCI function
> + * @device: Device's PCI device
> + * @bus: Device's PCI bus
> + * @segment: Device's PCI segment
> + */
> +struct cxl_proto_error_info {
> +	int severity;
> +
> +	u8 devfn;
> +	u8 bus;
> +	u16 segment;
> +};
> +
> +struct cxl_proto_err_work_data {
> +	struct cxl_proto_error_info err_info;
> +};
> +
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
> @@ -64,6 +85,16 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +void cxl_register_proto_err_work(struct work_struct *work);
> +void cxl_unregister_proto_err_work(void);
> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
> +#else
> +static inline void cxl_register_proto_err_work(struct work_struct *work) { }
> +static inline void cxl_unregister_proto_err_work(void) { }
> +static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
> +#endif
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		    struct aer_capability_regs *aer);
>  int cper_severity_to_aer(int cper_severity);
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Bowman, Terry 3 months, 1 week ago

On 7/1/2025 4:53 PM, Dave Jiang wrote:
>
> On 6/26/25 3:42 PM, Terry Bowman wrote:
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt with the CXL driver. The notification will be used
>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>
>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>> driver's RAS native handling. This as an alternative to dropping the
>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>> conditionally compile the new file.
>>
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work and the cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>> will contain the erring device's PCI SBDF details used to rediscover the
>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>> will be introduced along with the CXL handling in future patches.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/Kconfig           | 14 ++++++++
>>  drivers/cxl/core/Makefile     |  1 +
>>  drivers/cxl/core/core.h       |  8 +++++
>>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
> With the addition of a new file to cxl_core, can you please also fix up tools/testing/cxl/Kbuild?
>
> DJ
Hi Dave,

Yes, I'll update the CXL test's Kbuild.

-Terry

>>  drivers/cxl/core/port.c       |  2 ++
>>  drivers/cxl/core/ras.c        |  1 +
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/pci/pci.h             |  4 +++
>>  drivers/pci/pcie/aer.c        |  7 ++--
>>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>>  include/linux/aer.h           | 31 ++++++++++++++++++
>>  11 files changed, 153 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/cxl/core/native_ras.c
>>
>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
>> index 48b7314afdb8..57274de54a45 100644
>> --- a/drivers/cxl/Kconfig
>> +++ b/drivers/cxl/Kconfig
>> @@ -233,4 +233,18 @@ config CXL_MCE
>>  	def_bool y
>>  	depends on X86_MCE && MEMORY_FAILURE
>>  
>> +config CXL_NATIVE_RAS
>> +	bool "CXL: Enable CXL RAS native handling"
>> +	depends on PCIEAER_CXL
>> +	default CXL_BUS
>> +	help
>> +	  Enable native CXL RAS protocol error handling and logging in the CXL
>> +	  drivers. This functionality relies on the AER service driver being
>> +	  enabled, as the AER interrupt is used to inform the operating system
>> +	  of CXL RAS protocol errors. The platform must be configured to
>> +	  utilize AER reporting for interrupts.
>> +
>> +	  If unsure, or if this kernel is meant for production environments,
>> +	  say Y.
>> +
>>  endif
>> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
>> index 79e2ef81fde8..16f5832e5cc4 100644
>> --- a/drivers/cxl/core/Makefile
>> +++ b/drivers/cxl/core/Makefile
>> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
>> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>> index 29b61828a847..4c08bb92e2f9 100644
>> --- a/drivers/cxl/core/core.h
>> +++ b/drivers/cxl/core/core.h
>> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>>  					    int nid, resource_size_t *size);
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
>> +void cxl_native_ras_init(void);
>> +void cxl_native_ras_exit(void);
>> +#else
>> +static inline void cxl_native_ras_init(void) { };
>> +static inline void cxl_native_ras_exit(void) { };
>> +#endif
>> +
>>  #ifdef CONFIG_CXL_FEATURES
>>  struct cxl_feat_entry *
>>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
>> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
>> new file mode 100644
>> index 000000000000..011815ddaae3
>> --- /dev/null
>> +++ b/drivers/cxl/core/native_ras.c
>> @@ -0,0 +1,26 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/aer.h>
>> +#include <cxl/event.h>
>> +#include <cxlmem.h>
>> +#include <core/core.h>
>> +
>> +static void cxl_proto_err_work_fn(struct work_struct *work)
>> +{
>> +}
>> +
>> +static struct work_struct cxl_proto_err_work;
>> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
>> +
>> +void cxl_native_ras_init(void)
>> +{
>> +	cxl_register_proto_err_work(&cxl_proto_err_work);
>> +}
>> +
>> +void cxl_native_ras_exit(void)
>> +{
>> +	cxl_unregister_proto_err_work();
>> +	cancel_work_sync(&cxl_proto_err_work);
>> +}
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index eb46c6764d20..8e8f21197c86 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>>  	if (rc)
>>  		goto err_ras;
>>  
>> +	cxl_native_ras_init();
>> +
>>  	return 0;
>>  
>>  err_ras:
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 485a831695c7..962dc94fed8c 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/aer.h>
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>> +#include <cxlpci.h>
>>  #include "trace.h"
>>  
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..6f1396ef7b77 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 91b583cf18eb..29c11c7136d3 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>>  #ifdef CONFIG_PCIEAER_CXL
>>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>>  #else
>>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
>> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
>> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>>  #endif
>>  
>>  #ifdef CONFIG_ACPI
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 0b4d721980ef..8417a49c71f2 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  
>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> -	cxl_rch_handle_error(dev, info);
>> -	pci_aer_handle_error(dev, info);
>> +	if (is_cxl_error(dev, info))
>> +		forward_cxl_error(dev, info);
>> +	else
>> +		pci_aer_handle_error(dev, info);
>> +
>>  	pci_dev_put(dev);
>>  }
>>  
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> index b2ea14f70055..846ab55d747c 100644
>> --- a/drivers/pci/pcie/cxl_aer.c
>> +++ b/drivers/pci/pcie/cxl_aer.c
>> @@ -3,8 +3,11 @@
>>  
>>  #include <linux/pci.h>
>>  #include <linux/aer.h>
>> +#include <linux/kfifo.h>
>>  #include "../pci.h"
>>  
>> +#define CXL_ERROR_SOURCES_MAX          128
>> +
>>  /**
>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>   * @dev: pointer to the pci_dev data structure
>> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>>  	return info->status & PCI_ERR_UNC_INTN;
>>  }
>>  
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>> +{
>> +	if (!info || !info->is_cxl)
>> +		return false;
>> +
>> +	/* Only CXL Endpoints are currently supported */
>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
>> +		return false;
>> +
>> +	return is_internal_error(info);
>> +}
>> +
>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>  }
>>  
>> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
>> +struct work_struct *cxl_proto_err_work;
>> +
>> +void cxl_register_proto_err_work(struct work_struct *work)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
>> +	cxl_proto_err_work = work;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
>> +
>> +void cxl_unregister_proto_err_work(void)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
>> +	cxl_proto_err_work = NULL;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
>> +
>> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
>> +{
>> +	return kfifo_get(&cxl_proto_err_fifo, wd);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
>> +
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
>> +{
>> +	struct cxl_proto_err_work_data wd;
>> +
>> +	wd.err_info = (struct cxl_proto_error_info) {
>> +		.severity = aer_err_info->severity,
>> +		.devfn = pdev->devfn,
>> +		.bus = pdev->bus->number,
>> +		.segment = pci_domain_nr(pdev->bus)
>> +	};
>> +
>> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
>> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
>> +		return;
>> +	}
>> +
>> +	schedule_work(cxl_proto_err_work);
>> +}
>> +
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 02940be66324..24c3d9e18ad5 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_proto_err_info - Error information used in CXL error handling
>> + * @severity: AER severity
>> + * @function: Device's PCI function
>> + * @device: Device's PCI device
>> + * @bus: Device's PCI bus
>> + * @segment: Device's PCI segment
>> + */
>> +struct cxl_proto_error_info {
>> +	int severity;
>> +
>> +	u8 devfn;
>> +	u8 bus;
>> +	u16 segment;
>> +};
>> +
>> +struct cxl_proto_err_work_data {
>> +	struct cxl_proto_error_info err_info;
>> +};
>> +
>>  #if defined(CONFIG_PCIEAER)
>>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>>  int pcie_aer_is_native(struct pci_dev *dev);
>> @@ -64,6 +85,16 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>>  #endif
>>  
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +void cxl_register_proto_err_work(struct work_struct *work);
>> +void cxl_unregister_proto_err_work(void);
>> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
>> +#else
>> +static inline void cxl_register_proto_err_work(struct work_struct *work) { }
>> +static inline void cxl_unregister_proto_err_work(void) { }
>> +static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
>> +#endif
>> +
>>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  		    struct aer_capability_regs *aer);
>>  int cper_severity_to_aer(int cper_severity);
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Jonathan Cameron 3 months, 1 week ago
On Thu, 26 Jun 2025 17:42:40 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used
> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> driver's RAS native handling. This as an alternative to dropping the
> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> conditionally compile the new file.
> 
> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> will contain the erring device's PCI SBDF details used to rediscover the
> device after the CXL driver dequeues the kfifo work. The device rediscovery
> will be introduced along with the CXL handling in future patches.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,

Whilst it obviously makes patch preparation a bit more time consuming
for series like this with many patches it can be useful to add a brief
change log to the individual patches as well as the cover letter.
That helps reviewers figure out where they need to look again.

A few trivial things inline.

With those fixed up
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

Jonathan


> ---
>  drivers/cxl/Kconfig           | 14 ++++++++
>  drivers/cxl/core/Makefile     |  1 +
>  drivers/cxl/core/core.h       |  8 +++++
>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>  drivers/cxl/core/port.c       |  2 ++
>  drivers/cxl/core/ras.c        |  1 +
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/pci/pci.h             |  4 +++
>  drivers/pci/pcie/aer.c        |  7 ++--
>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>  include/linux/aer.h           | 31 ++++++++++++++++++
>  11 files changed, 153 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/cxl/core/native_ras.c


>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..6f1396ef7b77 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"

Why?  There are no changes in this header other than the include and the changes
to linux/aer.h are new stuff so I can't see how it becomes necessary if it
wasn't before.

Might well have always been missing and should have been here. If so separate
patch to tidy that up.

>  
>  #define CXL_MEMORY_PROGIF	0x10
>  


> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> index b2ea14f70055..846ab55d747c 100644
> --- a/drivers/pci/pcie/cxl_aer.c
> +++ b/drivers/pci/pcie/cxl_aer.c

>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  	pci_info(rcec, "CXL: Internal errors unmasked");
>  }
>  
> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
> +struct work_struct *cxl_proto_err_work;

I'm not seeing a declaration for this in the headers, so can it be static?

This is made a little more confusing as in this patch we have both
a structure called cxl_proto_err_work and a pointer to it with exactly the
same name.  Maybe rename this so it's subtly different.  cxl_protocol_err_work
or something silly like that just to make reviewers life a tiny bit easier!

> +

> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 02940be66324..24c3d9e18ad5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_proto_err_info - Error information used in CXL error handling
> + * @severity: AER severity
> + * @function: Device's PCI function

Run kernel-doc over the files and fix errors / warning.
Missed updating this to devfn which it would have shouted about.

> + * @device: Device's PCI device
> + * @bus: Device's PCI bus
> + * @segment: Device's PCI segment
> + */
> +struct cxl_proto_error_info {
> +	int severity;
> +
> +	u8 devfn;
> +	u8 bus;
> +	u16 segment;
> +};
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Bowman, Terry 3 months, 1 week ago

On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
> On Thu, 26 Jun 2025 17:42:40 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt with the CXL driver. The notification will be used
>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>
>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>> driver's RAS native handling. This as an alternative to dropping the
>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>> conditionally compile the new file.
>>
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work and the cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>> will contain the erring device's PCI SBDF details used to rediscover the
>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>> will be introduced along with the CXL handling in future patches.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
>
> Whilst it obviously makes patch preparation a bit more time consuming
> for series like this with many patches it can be useful to add a brief
> change log to the individual patches as well as the cover letter.
> That helps reviewers figure out where they need to look again.
>
> A few trivial things inline.
>
> With those fixed up
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
> Jonathan

Hi Jonathan,

Do you have an example you can point me to with a change log in the
individual patch? I want to make certain I change correctly.

 
>
>> ---
>>  drivers/cxl/Kconfig           | 14 ++++++++
>>  drivers/cxl/core/Makefile     |  1 +
>>  drivers/cxl/core/core.h       |  8 +++++
>>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>>  drivers/cxl/core/port.c       |  2 ++
>>  drivers/cxl/core/ras.c        |  1 +
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/pci/pci.h             |  4 +++
>>  drivers/pci/pcie/aer.c        |  7 ++--
>>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>>  include/linux/aer.h           | 31 ++++++++++++++++++
>>  11 files changed, 153 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/cxl/core/native_ras.c
>
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..6f1396ef7b77 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
> Why?  There are no changes in this header other than the include and the changes
> to linux/aer.h are new stuff so I can't see how it becomes necessary if it
> wasn't before.
>
> Might well have always been missing and should have been here. If so separate
> patch to tidy that up.
You're correct, this can be removed and added later.

>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> index b2ea14f70055..846ab55d747c 100644
>> --- a/drivers/pci/pcie/cxl_aer.c
>> +++ b/drivers/pci/pcie/cxl_aer.c
>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>  }
>>  
>> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
>> +struct work_struct *cxl_proto_err_work;
> I'm not seeing a declaration for this in the headers, so can it be static?
>
> This is made a little more confusing as in this patch we have both
> a structure called cxl_proto_err_work and a pointer to it with exactly the
> same name.  Maybe rename this so it's subtly different.  cxl_protocol_err_work
> or something silly like that just to make reviewers life a tiny bit easier!
Yes, I'll make 'static' and rename to be cxl_protocol_err_work.

>> +
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 02940be66324..24c3d9e18ad5 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_proto_err_info - Error information used in CXL error handling
>> + * @severity: AER severity
>> + * @function: Device's PCI function
> Run kernel-doc over the files and fix errors / warning.
> Missed updating this to devfn which it would have shouted about.
I haven't used kernel-doc, obviously. :) Ill add that to the list of checks before sending. Thanks.

-Terry

>> + * @device: Device's PCI device
>> + * @bus: Device's PCI bus
>> + * @segment: Device's PCI segment
>> + */
>> +struct cxl_proto_error_info {
>> +	int severity;
>> +
>> +	u8 devfn;
>> +	u8 bus;
>> +	u16 segment;
>> +};

Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Jonathan Cameron 3 months, 1 week ago
On Wed, 2 Jul 2025 11:21:20 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
> > On Thu, 26 Jun 2025 17:42:40 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >  
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt with the CXL driver. The notification will be used
> >> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> >>
> >> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> >> driver's RAS native handling. This as an alternative to dropping the
> >> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> >> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> >> conditionally compile the new file.
> >>
> >> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> >> driver will be the sole kfifo producer adding work and the cxl_core will be
> >> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> >>
> >> Add CXL work queue handler registration functions in the AER driver. Export
> >> the functions allowing CXL driver to access. Implement registration
> >> functions for the CXL driver to assign or clear the work handler function.
> >>
> >> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> >> will contain the erring device's PCI SBDF details used to rediscover the
> >> device after the CXL driver dequeues the kfifo work. The device rediscovery
> >> will be introduced along with the CXL handling in future patches.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
> > Hi Terry,
> >
> > Whilst it obviously makes patch preparation a bit more time consuming
> > for series like this with many patches it can be useful to add a brief
> > change log to the individual patches as well as the cover letter.
> > That helps reviewers figure out where they need to look again.
> >
> > A few trivial things inline.
> >
> > With those fixed up
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >
> > Jonathan  
> 
> Hi Jonathan,
> 
> Do you have an example you can point me to with a change log in the
> individual patch? I want to make certain I change correctly.
> 
>  
> >  
> >> ---

You just put it here.  No particular formatting standard for it other than making
sure its after the --- so it doesn't end up in the git log.
https://lore.kernel.org/all/2025070138-vigorous-negative-eae7@gregkh/

Is first example I scrolled down to.  This is a single patch series but
the principle is the same.
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Dan Carpenter 3 months, 1 week ago
On Wed, Jul 02, 2025 at 11:21:20AM -0500, Bowman, Terry wrote:
> 
> 
> On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
> > On Thu, 26 Jun 2025 17:42:40 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt with the CXL driver. The notification will be used
> >> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> >>
> >> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> >> driver's RAS native handling. This as an alternative to dropping the
> >> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> >> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> >> conditionally compile the new file.
> >>
> >> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> >> driver will be the sole kfifo producer adding work and the cxl_core will be
> >> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> >>
> >> Add CXL work queue handler registration functions in the AER driver. Export
> >> the functions allowing CXL driver to access. Implement registration
> >> functions for the CXL driver to assign or clear the work handler function.
> >>
> >> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> >> will contain the erring device's PCI SBDF details used to rediscover the
> >> device after the CXL driver dequeues the kfifo work. The device rediscovery
> >> will be introduced along with the CXL handling in future patches.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > Hi Terry,
> >
> > Whilst it obviously makes patch preparation a bit more time consuming
> > for series like this with many patches it can be useful to add a brief
> > change log to the individual patches as well as the cover letter.
> > That helps reviewers figure out where they need to look again.
> >
> > A few trivial things inline.
> >
> > With those fixed up
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >
> > Jonathan
> 
> Hi Jonathan,
> 
> Do you have an example you can point me to with a change log in the
> individual patch? I want to make certain I change correctly.
> 

https://staticthinking.wordpress.com/2022/07/27/how-to-send-a-v2-patch/

Just put a:
---
v2: white space changes

or whatever.

regards,
dan carpenter
Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
Posted by Bowman, Terry 3 months, 1 week ago

On 7/2/2025 2:54 PM, Dan Carpenter wrote:
> On Wed, Jul 02, 2025 at 11:21:20AM -0500, Bowman, Terry wrote:
>>
>> On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
>>> On Thu, 26 Jun 2025 17:42:40 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>
>>>> CXL error handling will soon be moved from the AER driver into the CXL
>>>> driver. This requires a notification mechanism for the AER driver to share
>>>> the AER interrupt with the CXL driver. The notification will be used
>>>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>>>
>>>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>>>> driver's RAS native handling. This as an alternative to dropping the
>>>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>>>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>>>> conditionally compile the new file.
>>>>
>>>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>>>> driver will be the sole kfifo producer adding work and the cxl_core will be
>>>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>>>
>>>> Add CXL work queue handler registration functions in the AER driver. Export
>>>> the functions allowing CXL driver to access. Implement registration
>>>> functions for the CXL driver to assign or clear the work handler function.
>>>>
>>>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>>>> will contain the erring device's PCI SBDF details used to rediscover the
>>>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>>>> will be introduced along with the CXL handling in future patches.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> Hi Terry,
>>>
>>> Whilst it obviously makes patch preparation a bit more time consuming
>>> for series like this with many patches it can be useful to add a brief
>>> change log to the individual patches as well as the cover letter.
>>> That helps reviewers figure out where they need to look again.
>>>
>>> A few trivial things inline.
>>>
>>> With those fixed up
>>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>>
>>> Jonathan
>> Hi Jonathan,
>>
>> Do you have an example you can point me to with a change log in the
>> individual patch? I want to make certain I change correctly.
>>
> https://staticthinking.wordpress.com/2022/07/27/how-to-send-a-v2-patch/
>
> Just put a:
> ---
> v2: white space changes
>
> or whatever.
>
> regards,
> dan carpenter
>
Thanks Dan Carpenter.

-Terry