[v14] Enable CXL PCIe Port Protocol Error handling and logging

[PATCH v14 32/34] cxl: Update Endpoint uncorrectable protocol error handling

Posted by Terry Bowman 3 weeks, 4 days ago

The CXL drivers must support handling Endpoint CXL and PCI uncorrectable
(UCE) protocol errors. Update the drivers to support both.

Introduce cxl_pci_error_detected() to handle PCI correctable errors,
replacing cxl_error_detected(). Implement this new function to call
the existing CXL Port uncorrectable handler, cxl_port_error_detected().

Update cxl_port_error_detected() for Endpoint handling. Take the CXL
memory device lock, check for a valid driver, and handle restricted
CXL device (RCH) if needed. This is the same sequence initially in
cxl_error_detected(). But, the UCE handler's logic for the returned
result errors is simplified because recovery will not be tried and
instead UCE's will result in the CXL driver invoking system panic.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v13->v14:
- Update commit headline (Bjorn)
- Rename pci_error_detected()/pci_cor_error_detected() ->
  cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
- Remove now-invalid comment in cxl_error_detected() (Jonathan)
- Split into separate patches for UCE and CE (Terry)

Changes in v12->v13:
- Update commit messaqge (Terry)
- Updated all the implementation and commit message. (Terry)
- Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
  pdev (Dave Jiang)

Changes in v11->v12:
- None

Changes in v10->v11:
- cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
- cxl_error_detected() - Remove extra line (Shiju)
- Changes moved to core/ras.c (Terry)
- cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
- Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
- Move #include "pci.h from cxl.h to core.h (Terry)
- Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
---
 drivers/cxl/core/core.h |  9 ++--
 drivers/cxl/core/ras.c  | 92 +++++++++++++++++++----------------------
 drivers/cxl/cxlpci.h    | 15 ++++---
 drivers/cxl/pci.c       |  6 +--
 4 files changed, 60 insertions(+), 62 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 39324e1b8940..96c6cf478427 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -4,6 +4,7 @@
 #ifndef __CXL_CORE_H__
 #define __CXL_CORE_H__
 
+#include <linux/pci.h>
 #include <cxl/mailbox.h>
 #include <linux/rwsem.h>
 
@@ -147,7 +148,7 @@ int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 #ifdef CONFIG_CXL_RAS
 int cxl_ras_init(void);
 void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
 void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base);
 void cxl_dport_map_rch_aer(struct cxl_dport *dport);
 void cxl_disable_rch_root_ints(struct cxl_dport *dport);
@@ -158,11 +159,11 @@ static inline int cxl_ras_init(void)
 	return 0;
 }
 static inline void cxl_ras_exit(void) { }
-static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+static inline pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
-	return false;
+	return PCI_ERS_RESULT_NONE;
 }
-static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
+static inline void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base) { }
 static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
 static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
 static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 96ce85cc0a46..dc6e02d64821 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -218,6 +218,7 @@ static void __iomem *cxl_get_ras_base(struct device *dev)
 		return dport->regs.ras;
 	}
 	case PCI_EXP_TYPE_UPSTREAM:
+	case PCI_EXP_TYPE_ENDPOINT:
 	{
 		struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
 
@@ -302,20 +303,22 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
+pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
 	u32 status;
 	u32 fe;
 
-	if (!ras_base)
-		return false;
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
+		return PCI_ERS_RESULT_NONE;
+	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
 	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
-		return false;
+		return PCI_ERS_RESULT_NONE;
 
 	/* If multiple errors, log header points to first error from ctrl reg */
 	if (hweight32(status) > 1) {
@@ -337,7 +340,7 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
-	return true;
+	return PCI_ERS_RESULT_PANIC;
 }
 
 static void cxl_port_cor_error_detected(struct device *dev)
@@ -347,7 +350,30 @@ static void cxl_port_cor_error_detected(struct device *dev)
 
 static pci_ers_result_t cxl_port_error_detected(struct device *dev)
 {
-	return cxl_handle_ras(dev, 0, cxl_get_ras_base(dev));
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+	u64 serial = 0;
+
+	if (is_cxl_endpoint(port)) {
+		struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+		struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+		guard(device)(&cxlmd->dev);
+
+		if (!dev->driver) {
+			dev_warn(&pdev->dev,
+				 "%s: memdev disabled, abort error handling\n",
+				 dev_name(dev));
+			return PCI_ERS_RESULT_NONE;
+		}
+
+		if (cxlds->rcd)
+			cxl_handle_rdport_errors(cxlds);
+
+		serial = cxlds->serial;
+	}
+
+	return cxl_handle_ras(dev, serial, cxl_get_ras_base(dev));
 }
 
 void cxl_cor_error_detected(struct pci_dev *pdev)
@@ -373,55 +399,21 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
 
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state)
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+					pci_channel_state_t error)
 {
-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct cxl_memdev *cxlmd = cxlds->cxlmd;
-	struct device *dev = &cxlmd->dev;
-	bool ue;
+	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+	pci_ers_result_t rc;
 
-	guard(device)(dev);
+	guard(device)(&port->dev);
 
-	if (!dev->driver) {
-		dev_warn(&pdev->dev,
-			 "%s: memdev disabled, abort error handling\n",
-			 dev_name(dev));
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
+	rc = cxl_port_error_detected(&pdev->dev);
+	if (rc == PCI_ERS_RESULT_PANIC)
+		panic("CXL cachemem error.");
 
-	if (cxlds->rcd)
-		cxl_handle_rdport_errors(cxlds);
-	/*
-	 * A frozen channel indicates an impending reset which is fatal to
-	 * CXL.mem operation, and will likely crash the system. On the off
-	 * chance the situation is recoverable dump the status of the RAS
-	 * capability registers and bounce the active state of the memdev.
-	 */
-	ue = cxl_handle_ras(&cxlmd->dev, cxlds->serial,
-			    cxlmd->endpoint->regs.ras);
-
-	switch (state) {
-	case pci_channel_io_normal:
-		if (ue) {
-			device_release_driver(dev);
-			return PCI_ERS_RESULT_NEED_RESET;
-		}
-		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		dev_warn(&pdev->dev,
-			 "%s: frozen state error detected, disable CXL.mem\n",
-			 dev_name(dev));
-		device_release_driver(dev);
-		return PCI_ERS_RESULT_NEED_RESET;
-	case pci_channel_io_perm_failure:
-		dev_warn(&pdev->dev,
-			 "failure state error detected, request disconnect\n");
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-	return PCI_ERS_RESULT_NEED_RESET;
+	return rc;
 }
-EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
 
 static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
 {
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 532506595d0f..f218b343e179 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -79,15 +79,20 @@ void read_cdat_data(struct cxl_port *port);
 
 #ifdef CONFIG_CXL_RAS
 void cxl_cor_error_detected(struct pci_dev *pdev);
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state);
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+					pci_channel_state_t error);
 void devm_cxl_dport_ras_setup(struct cxl_dport *dport);
 void devm_cxl_port_ras_setup(struct cxl_port *port);
+void __cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+void __cxl_uport_init_ras_reporting(struct cxl_port *port,
+				    struct device *host);
+int __cxl_await_media_ready(struct cxl_dev_state *cxlds);
+resource_size_t __cxl_rcd_component_reg_phys(struct device *dev,
+					     struct cxl_dport *dport);
 #else
 static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
-
-static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
+static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+						      pci_channel_state_t state)
 {
 	return PCI_ERS_RESULT_NONE;
 }
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index acb0eb2a13c3..ff741adc7c7f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
 	}
 }
 
-static const struct pci_error_handlers cxl_error_handlers = {
-	.error_detected	= cxl_error_detected,
+static const struct pci_error_handlers pci_error_handlers = {
+	.error_detected	= cxl_pci_error_detected,
 	.slot_reset	= cxl_slot_reset,
 	.resume		= cxl_error_resume,
 	.cor_error_detected	= cxl_cor_error_detected,
@@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = {
 	.name			= KBUILD_MODNAME,
 	.id_table		= cxl_mem_pci_tbl,
 	.probe			= cxl_pci_probe,
-	.err_handler		= &cxl_error_handlers,
+	.err_handler		= &pci_error_handlers,
 	.dev_groups		= cxl_rcd_groups,
 	.driver	= {
 		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
-- 
2.34.1

Re: [PATCH v14 32/34] cxl: Update Endpoint uncorrectable protocol error handling

Posted by dan.j.williams@intel.com 3 weeks, 3 days ago

Terry Bowman wrote:
> The CXL drivers must support handling Endpoint CXL and PCI uncorrectable
> (UCE) protocol errors. Update the drivers to support both.
> 
> Introduce cxl_pci_error_detected() to handle PCI correctable errors,
> replacing cxl_error_detected(). Implement this new function to call
> the existing CXL Port uncorrectable handler, cxl_port_error_detected().
> 
> Update cxl_port_error_detected() for Endpoint handling. Take the CXL
> memory device lock, check for a valid driver, and handle restricted
> CXL device (RCH) if needed. This is the same sequence initially in
> cxl_error_detected(). But, the UCE handler's logic for the returned
> result errors is simplified because recovery will not be tried and
> instead UCE's will result in the CXL driver invoking system panic.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> ---
> 
> Changes in v13->v14:
> - Update commit headline (Bjorn)
> - Rename pci_error_detected()/pci_cor_error_detected() ->
>   cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
> - Split into separate patches for UCE and CE (Terry)
> 
> Changes in v12->v13:
> - Update commit messaqge (Terry)
> - Updated all the implementation and commit message. (Terry)
> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>   pdev (Dave Jiang)
> 
> Changes in v11->v12:
> - None
> 
> Changes in v10->v11:
> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
> - cxl_error_detected() - Remove extra line (Shiju)
> - Changes moved to core/ras.c (Terry)
> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
> - Move #include "pci.h from cxl.h to core.h (Terry)
> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
[..]
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 96ce85cc0a46..dc6e02d64821 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
[..]
> @@ -373,55 +399,21 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>  
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -				    pci_channel_state_t state)
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> +					pci_channel_state_t error)
>  {
> -	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> -	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> -	struct device *dev = &cxlmd->dev;
> -	bool ue;
> +	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +	pci_ers_result_t rc;
>  
> -	guard(device)(dev);
> +	guard(device)(&port->dev);
>  
> -	if (!dev->driver) {
> -		dev_warn(&pdev->dev,
> -			 "%s: memdev disabled, abort error handling\n",
> -			 dev_name(dev));
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> +	rc = cxl_port_error_detected(&pdev->dev);
> +	if (rc == PCI_ERS_RESULT_PANIC)
> +		panic("CXL cachemem error.");
[..]
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index acb0eb2a13c3..ff741adc7c7f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
>  	}
>  }
>  
> -static const struct pci_error_handlers cxl_error_handlers = {
> -	.error_detected	= cxl_error_detected,
> +static const struct pci_error_handlers pci_error_handlers = {
> +	.error_detected	= cxl_pci_error_detected,

I still feel like we are disconnected on the fundamental question of who
is responsible for invoking CXL protocol error handling.

To be clear, all of this:

  cxl/port: Remove "enumerate dports" helpers
  cxl/port: Fix devm resource leaks around with dport management
  cxl/port: Move dport operations to a driver event
  cxl/port: Move dport RAS reporting to a port resource
  cxl/port: Move endpoint component register management to cxl_port
  cxl/port: Unify endpoint and switch port lookup

Was with the intent that cxl_pci and any other driver that creates a
cxl_memdev never needs to worry about CXL protocol error handling. It
comes "for free" by registering a "struct cxl_memdev".

This is the rationale for "struct pci_dev" to grow an "is_cxl"
attribute, and for the PCI core to learn how to forward PCIE internal
errors on CXL devices to the CXL core.

The only errors that cxl_pci needs to worry about are non-internal /
native PCI errors. All CXL errors will have already been routed to the
CXL core for generic handling based on a port lookup.

So the end state I am looking for is no call to
cxl_port_error_detected() from any 'struct pci_error_handlers'
implementation. Untangle that ambiguity in the AER core and do not
inflict it on every CXL driver that comes after.

I think we are close to that outcome if not already there by simply
deleting this last cxl_pci_error_detected() -> cxl_port_error_detected()
"false dependency".

Now, if an endpoint driver ever thinks it can do anything sane with CXL
protocol error beyond what the core is already handling, then we can
think about complications like passing a cxl_port error handler
template. I struggle to think of a case like that.

Re: [PATCH v14 32/34] cxl: Update Endpoint uncorrectable protocol error handling

Posted by Bowman, Terry 3 weeks, 3 days ago

On 1/14/2026 4:07 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> The CXL drivers must support handling Endpoint CXL and PCI uncorrectable
>> (UCE) protocol errors. Update the drivers to support both.
>>
>> Introduce cxl_pci_error_detected() to handle PCI correctable errors,
>> replacing cxl_error_detected(). Implement this new function to call
>> the existing CXL Port uncorrectable handler, cxl_port_error_detected().
>>
>> Update cxl_port_error_detected() for Endpoint handling. Take the CXL
>> memory device lock, check for a valid driver, and handle restricted
>> CXL device (RCH) if needed. This is the same sequence initially in
>> cxl_error_detected(). But, the UCE handler's logic for the returned
>> result errors is simplified because recovery will not be tried and
>> instead UCE's will result in the CXL driver invoking system panic.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>
>> ---
>>
>> Changes in v13->v14:
>> - Update commit headline (Bjorn)
>> - Rename pci_error_detected()/pci_cor_error_detected() ->
>>   cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
>> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
>> - Split into separate patches for UCE and CE (Terry)
>>
>> Changes in v12->v13:
>> - Update commit messaqge (Terry)
>> - Updated all the implementation and commit message. (Terry)
>> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>>   pdev (Dave Jiang)
>>
>> Changes in v11->v12:
>> - None
>>
>> Changes in v10->v11:
>> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
>> - cxl_error_detected() - Remove extra line (Shiju)
>> - Changes moved to core/ras.c (Terry)
>> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
>> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
>> - Move #include "pci.h from cxl.h to core.h (Terry)
>> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
> [..]
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 96ce85cc0a46..dc6e02d64821 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
> [..]
>> @@ -373,55 +399,21 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>>  
>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>> -				    pci_channel_state_t state)
>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>> +					pci_channel_state_t error)
>>  {
>> -	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>> -	struct cxl_memdev *cxlmd = cxlds->cxlmd;
>> -	struct device *dev = &cxlmd->dev;
>> -	bool ue;
>> +	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> +	pci_ers_result_t rc;
>>  
>> -	guard(device)(dev);
>> +	guard(device)(&port->dev);
>>  
>> -	if (!dev->driver) {
>> -		dev_warn(&pdev->dev,
>> -			 "%s: memdev disabled, abort error handling\n",
>> -			 dev_name(dev));
>> -		return PCI_ERS_RESULT_DISCONNECT;
>> -	}
>> +	rc = cxl_port_error_detected(&pdev->dev);
>> +	if (rc == PCI_ERS_RESULT_PANIC)
>> +		panic("CXL cachemem error.");
> [..]
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index acb0eb2a13c3..ff741adc7c7f 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
>>  	}
>>  }
>>  
>> -static const struct pci_error_handlers cxl_error_handlers = {
>> -	.error_detected	= cxl_error_detected,
>> +static const struct pci_error_handlers pci_error_handlers = {
>> +	.error_detected	= cxl_pci_error_detected,
> 
> I still feel like we are disconnected on the fundamental question of who
> is responsible for invoking CXL protocol error handling.
> 
> To be clear, all of this:
> 
>   cxl/port: Remove "enumerate dports" helpers
>   cxl/port: Fix devm resource leaks around with dport management
>   cxl/port: Move dport operations to a driver event
>   cxl/port: Move dport RAS reporting to a port resource
>   cxl/port: Move endpoint component register management to cxl_port
>   cxl/port: Unify endpoint and switch port lookup
> 
> Was with the intent that cxl_pci and any other driver that creates a
> cxl_memdev never needs to worry about CXL protocol error handling. It
> comes "for free" by registering a "struct cxl_memdev".
> 
> This is the rationale for "struct pci_dev" to grow an "is_cxl"
> attribute, and for the PCI core to learn how to forward PCIE internal
> errors on CXL devices to the CXL core.
> 
> The only errors that cxl_pci needs to worry about are non-internal /
> native PCI errors. All CXL errors will have already been routed to the
> CXL core for generic handling based on a port lookup.
> 
> So the end state I am looking for is no call to
> cxl_port_error_detected() from any 'struct pci_error_handlers'
> implementation. Untangle that ambiguity in the AER core and do not
> inflict it on every CXL driver that comes after.
> 
> I think we are close to that outcome if not already there by simply
> deleting this last cxl_pci_error_detected() -> cxl_port_error_detected()
> "false dependency".
> 
> Now, if an endpoint driver ever thinks it can do anything sane with CXL
> protocol error beyond what the core is already handling, then we can
> think about complications like passing a cxl_port error handler
> template. I struggle to think of a case like that.

Thanks for explaining. If I understand correctly the CXL PCI error handlers
should only look at AER (no CXL RAS). We probably don't need a CXL PCI CE 
handler in this case either because the AER is already handled & logged by 
the AER driver. The UCE CXL PCI handler is needed to return a pci_ers_result 
to the AER driver. How does this sound ?

-Terry

Re: [PATCH v14 32/34] cxl: Update Endpoint uncorrectable protocol error handling

Posted by Bowman, Terry 3 weeks, 3 days ago

On 1/14/2026 4:07 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> The CXL drivers must support handling Endpoint CXL and PCI uncorrectable
>> (UCE) protocol errors. Update the drivers to support both.
>>
>> Introduce cxl_pci_error_detected() to handle PCI correctable errors,
>> replacing cxl_error_detected(). Implement this new function to call
>> the existing CXL Port uncorrectable handler, cxl_port_error_detected().
>>
>> Update cxl_port_error_detected() for Endpoint handling. Take the CXL
>> memory device lock, check for a valid driver, and handle restricted
>> CXL device (RCH) if needed. This is the same sequence initially in
>> cxl_error_detected(). But, the UCE handler's logic for the returned
>> result errors is simplified because recovery will not be tried and
>> instead UCE's will result in the CXL driver invoking system panic.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>
>> ---
>>
>> Changes in v13->v14:
>> - Update commit headline (Bjorn)
>> - Rename pci_error_detected()/pci_cor_error_detected() ->
>>   cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
>> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
>> - Split into separate patches for UCE and CE (Terry)
>>
>> Changes in v12->v13:
>> - Update commit messaqge (Terry)
>> - Updated all the implementation and commit message. (Terry)
>> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>>   pdev (Dave Jiang)
>>
>> Changes in v11->v12:
>> - None
>>
>> Changes in v10->v11:
>> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
>> - cxl_error_detected() - Remove extra line (Shiju)
>> - Changes moved to core/ras.c (Terry)
>> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
>> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
>> - Move #include "pci.h from cxl.h to core.h (Terry)
>> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
> [..]
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 96ce85cc0a46..dc6e02d64821 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
> [..]
>> @@ -373,55 +399,21 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>>  
>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>> -				    pci_channel_state_t state)
>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>> +					pci_channel_state_t error)
>>  {
>> -	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>> -	struct cxl_memdev *cxlmd = cxlds->cxlmd;
>> -	struct device *dev = &cxlmd->dev;
>> -	bool ue;
>> +	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> +	pci_ers_result_t rc;
>>  
>> -	guard(device)(dev);
>> +	guard(device)(&port->dev);
>>  
>> -	if (!dev->driver) {
>> -		dev_warn(&pdev->dev,
>> -			 "%s: memdev disabled, abort error handling\n",
>> -			 dev_name(dev));
>> -		return PCI_ERS_RESULT_DISCONNECT;
>> -	}
>> +	rc = cxl_port_error_detected(&pdev->dev);
>> +	if (rc == PCI_ERS_RESULT_PANIC)
>> +		panic("CXL cachemem error.");
> [..]
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index acb0eb2a13c3..ff741adc7c7f 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
>>  	}
>>  }
>>  
>> -static const struct pci_error_handlers cxl_error_handlers = {
>> -	.error_detected	= cxl_error_detected,
>> +static const struct pci_error_handlers pci_error_handlers = {
>> +	.error_detected	= cxl_pci_error_detected,
> 
> I still feel like we are disconnected on the fundamental question of who
> is responsible for invoking CXL protocol error handling.
> 
> To be clear, all of this:
> 
>   cxl/port: Remove "enumerate dports" helpers
>   cxl/port: Fix devm resource leaks around with dport management
>   cxl/port: Move dport operations to a driver event
>   cxl/port: Move dport RAS reporting to a port resource
>   cxl/port: Move endpoint component register management to cxl_port
>   cxl/port: Unify endpoint and switch port lookup
> 
> Was with the intent that cxl_pci and any other driver that creates a
> cxl_memdev never needs to worry about CXL protocol error handling. It
> comes "for free" by registering a "struct cxl_memdev".
> 
> This is the rationale for "struct pci_dev" to grow an "is_cxl"
> attribute, and for the PCI core to learn how to forward PCIE internal
> errors on CXL devices to the CXL core.
> 
> The only errors that cxl_pci needs to worry about are non-internal /
> native PCI errors. All CXL errors will have already been routed to the
> CXL core for generic handling based on a port lookup.
> 
> So the end state I am looking for is no call to
> cxl_port_error_detected() from any 'struct pci_error_handlers'
> implementation. Untangle that ambiguity in the AER core and do not
> inflict it on every CXL driver that comes after.
> 
> I think we are close to that outcome if not already there by simply
> deleting this last cxl_pci_error_detected() -> cxl_port_error_detected()
> "false dependency".
> 
> Now, if an endpoint driver ever thinks it can do anything sane with CXL
> protocol error beyond what the core is already handling, then we can
> think about complications like passing a cxl_port error handler
> template. I struggle to think of a case like that.

Thanks for explaining. If I understand correctly the CXL PCI error handlers
should only look at AER (no CXL RAS). We probably don't need a CXL PCI CE 
handler in this case either because the AER is already handled & logged by 
the AER driver. The UCE CXL PCI handler is needed to return a pci_ers_result 
to the AER driver. How does this sound ?

-Terry