[PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error

Terry Bowman posted 9 patches 5 days, 15 hours ago
[PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
Posted by Terry Bowman 5 days, 15 hours ago
The AER driver now forwards CXL protocol errors to the CXL driver via a
kfifo. The CXL driver must consume these work items and initiate protocol
error handling while ensuring the device's RAS mappings remain valid
throughout processing.

Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
AER service driver. Lock the parent CXL Port device to ensure the CXL
device's RAS registers are accessible during handling. Add pdev reference-put
to match reference-get in AER driver. This will ensure pdev access after
kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.

Update is_cxl_error() to recognize CXL Port devices with errors.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>

---

Changes in v14->v15:
- Move pci_dev_get() to first patch (Dave)
- Move in is_cxl_error() change from later patch (Terry)
- Use pr_err_ratelimited() with PCI device name (Terry)

Changes in v13->v14:
- Update commit title's prefix (Bjorn)
- Add pdev ref get in AER driver before enqueue and add pdev ref put in
  CXL driver after dequeue and handling (Dan)
- Removed handling to simplify patch context (Terry)

Changes in v12->v13:
- Add cxlmd lock using guard() (Terry)
- Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
- Change pr_err() calls to ratelimited. (Terry)
- Update commit message. (Terry)
- Remove namespace qualifier from pcie_clear_device_status()
  export (Dave Jiang)
- Move locks into cxl_proto_err_work_fn() (Dave)
- Update log messages in cxl_forward_error() (Ben)

Changes in v11->v12:
- Add guard for CE case in cxl_handle_proto_error() (Dave)

Changes in v10->v11:
- Reword patch commit message to remove RCiEP details (Jonathan)
- Add #include <linux/bitfield.h> (Terry)
- is_cxl_rcd() - Fix short comment message wrap  (Jonathan)
- is_cxl_rcd() - Combine return calls into 1  (Jonathan)
- cxl_handle_proto_error() - Move comment earlier  (Jonathan)
- Use FIELD_GET() in discovering class code (Jonathan)
- Remove BDF from cxl_proto_err_work_data. Use 'struct
pci_dev *' (Dan)
---
 drivers/cxl/core/core.h       |   3 +
 drivers/cxl/core/port.c       |   6 +-
 drivers/cxl/core/ras.c        | 106 ++++++++++++++++++++++++++++++----
 drivers/pci/pcie/aer_cxl_vh.c |   5 +-
 4 files changed, 105 insertions(+), 15 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index c6cfaf2720e1..92aea110817d 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -182,6 +182,9 @@ static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
 #endif /* CONFIG_CXL_RAS */
 
 int cxl_gpf_port_setup(struct cxl_dport *dport);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport);
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
 
 struct cxl_hdm;
 int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index ee7d14528867..8e30a3e7f610 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1402,8 +1402,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
 	return NULL;
 }
 
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
-				      struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport)
 {
 	struct cxl_find_port_ctx ctx = {
 		.dport_dev = dport_dev,
@@ -1607,7 +1607,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
  * Function takes a device reference on the port device. Caller should do a
  * put_device() when done.
  */
-static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
 {
 	struct device *dev;
 
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 74df561ed32e..a6c0bc6d7203 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
 }
 static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
 
-int cxl_ras_init(void)
-{
-	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
-}
-
-void cxl_ras_exit(void)
-{
-	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
-	cancel_work_sync(&cxl_cper_prot_err_work);
-}
-
 static void cxl_dport_map_ras(struct cxl_dport *dport)
 {
 	struct cxl_register_map *map = &dport->reg_map;
@@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
 
+/*
+ * get_cxl_port - Return the parent CXL Port of a PCI device
+ * @pdev: PCI device whose parent CXL Port is being queried
+ *
+ * Looks up and returns the parent CXL Port associated with @pdev. On
+ * success, the returned port has its reference count incremented and must
+ * be released by the caller. Returns NULL if no associated CXL port is
+ * found.
+ *
+ * Return: Pointer to the parent &struct cxl_port or NULL on failure
+ */
+static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
+{
+	switch (pci_pcie_type(pdev)) {
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+	{
+		struct cxl_dport *dport;
+		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
+
+		if (!port) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+		return port;
+	}
+	case PCI_EXP_TYPE_UPSTREAM:
+	case PCI_EXP_TYPE_ENDPOINT:
+	{
+		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
+
+		if (!port) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+		return port;
+	}
+	}
+
+	pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
+			   pci_name(pdev), pci_pcie_type(pdev));
+	return NULL;
+}
+
 void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
 	void __iomem *addr;
@@ -327,3 +360,54 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 	return PCI_ERS_RESULT_NEED_RESET;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+
+static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
+{
+}
+
+static void cxl_proto_err_work_fn(struct work_struct *work)
+{
+	struct cxl_proto_err_work_data wd;
+
+	while (cxl_proto_err_kfifo_get(&wd)) {
+		struct pci_dev *pdev __free(pci_dev_put) = wd.pdev;
+
+		if (!pdev) {
+			pr_err_ratelimited("%s: NULL PCI device passed in AER-CXL KFifo\n",
+					   pci_name(pdev));
+			continue;
+		}
+
+		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+		if (!port) {
+			pr_err_ratelimited("%s: Failed to find parent port device in CXL topology\n",
+					   pci_name(pdev));
+			continue;
+		}
+		guard(device)(&port->dev);
+
+		cxl_handle_proto_error(&wd);
+	}
+}
+
+static struct work_struct cxl_proto_err_work;
+static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
+
+int cxl_ras_init(void)
+{
+	if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
+		pr_err("Failed to initialize CXL RAS CPER\n");
+
+	cxl_register_proto_err_work(&cxl_proto_err_work);
+
+	return 0;
+}
+
+void cxl_ras_exit(void)
+{
+	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
+	cancel_work_sync(&cxl_cper_prot_err_work);
+
+	cxl_unregister_proto_err_work();
+	cancel_work_sync(&cxl_proto_err_work);
+}
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
index de8bca383159..6bcd6271afbf 100644
--- a/drivers/pci/pcie/aer_cxl_vh.c
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -34,7 +34,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
 	if (!info || !info->is_cxl)
 		return false;
 
-	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
 		return false;
 
 	return is_aer_internal_error(info);
-- 
2.34.1
Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
Posted by dan.j.williams@intel.com 4 days, 13 hours ago
Terry Bowman wrote:
> The AER driver now forwards CXL protocol errors to the CXL driver via a
> kfifo. The CXL driver must consume these work items and initiate protocol
> error handling while ensuring the device's RAS mappings remain valid
> throughout processing.
> 
> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> AER service driver. Lock the parent CXL Port device to ensure the CXL
> device's RAS registers are accessible during handling. Add pdev reference-put
> to match reference-get in AER driver. This will ensure pdev access after
> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
> 
> Update is_cxl_error() to recognize CXL Port devices with errors.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> ---
> 
> Changes in v14->v15:
> - Move pci_dev_get() to first patch (Dave)
> - Move in is_cxl_error() change from later patch (Terry)
> - Use pr_err_ratelimited() with PCI device name (Terry)
> 
> Changes in v13->v14:
> - Update commit title's prefix (Bjorn)
> - Add pdev ref get in AER driver before enqueue and add pdev ref put in
>   CXL driver after dequeue and handling (Dan)
> - Removed handling to simplify patch context (Terry)
> 
> Changes in v12->v13:
> - Add cxlmd lock using guard() (Terry)
> - Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
> - Change pr_err() calls to ratelimited. (Terry)
> - Update commit message. (Terry)
> - Remove namespace qualifier from pcie_clear_device_status()
>   export (Dave Jiang)
> - Move locks into cxl_proto_err_work_fn() (Dave)
> - Update log messages in cxl_forward_error() (Ben)
> 
> Changes in v11->v12:
> - Add guard for CE case in cxl_handle_proto_error() (Dave)
> 
> Changes in v10->v11:
> - Reword patch commit message to remove RCiEP details (Jonathan)
> - Add #include <linux/bitfield.h> (Terry)
> - is_cxl_rcd() - Fix short comment message wrap  (Jonathan)
> - is_cxl_rcd() - Combine return calls into 1  (Jonathan)
> - cxl_handle_proto_error() - Move comment earlier  (Jonathan)
> - Use FIELD_GET() in discovering class code (Jonathan)
> - Remove BDF from cxl_proto_err_work_data. Use 'struct
> pci_dev *' (Dan)
> ---
>  drivers/cxl/core/core.h       |   3 +
>  drivers/cxl/core/port.c       |   6 +-
>  drivers/cxl/core/ras.c        | 106 ++++++++++++++++++++++++++++++----
>  drivers/pci/pcie/aer_cxl_vh.c |   5 +-
>  4 files changed, 105 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index c6cfaf2720e1..92aea110817d 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -182,6 +182,9 @@ static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
>  #endif /* CONFIG_CXL_RAS */
>  
>  int cxl_gpf_port_setup(struct cxl_dport *dport);
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport);
> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
>  
>  struct cxl_hdm;
>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index ee7d14528867..8e30a3e7f610 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1402,8 +1402,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>  	return NULL;
>  }
>  
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> -				      struct cxl_dport **dport)
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport)
>  {
>  	struct cxl_find_port_ctx ctx = {
>  		.dport_dev = dport_dev,
> @@ -1607,7 +1607,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
>   * Function takes a device reference on the port device. Caller should do a
>   * put_device() when done.
>   */
> -static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>  {
>  	struct device *dev;
>  
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 74df561ed32e..a6c0bc6d7203 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
> -int cxl_ras_init(void)
> -{
> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> -}
> -
> -void cxl_ras_exit(void)
> -{
> -	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> -	cancel_work_sync(&cxl_cper_prot_err_work);
> -}
> -
>  static void cxl_dport_map_ras(struct cxl_dport *dport)
>  {
>  	struct cxl_register_map *map = &dport->reg_map;
> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
>  }
>  EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>  
> +/*
> + * get_cxl_port - Return the parent CXL Port of a PCI device
> + * @pdev: PCI device whose parent CXL Port is being queried
> + *
> + * Looks up and returns the parent CXL Port associated with @pdev. On
> + * success, the returned port has its reference count incremented and must
> + * be released by the caller. Returns NULL if no associated CXL port is
> + * found.
> + *
> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
> + */
> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> +{
> +	switch (pci_pcie_type(pdev)) {
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	{
> +		struct cxl_dport *dport;
> +		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	{
> +		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	}
> +
> +	pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
> +			   pci_name(pdev), pci_pcie_type(pdev));
> +	return NULL;
> +}
> +
>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>  {
>  	void __iomem *addr;
> @@ -327,3 +360,54 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  	return PCI_ERS_RESULT_NEED_RESET;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> +
> +static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
> +{
> +}
> +
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> +	struct cxl_proto_err_work_data wd;
> +
> +	while (cxl_proto_err_kfifo_get(&wd)) {
> +		struct pci_dev *pdev __free(pci_dev_put) = wd.pdev;

This is a bit clever, might want a comment that it pairs with the
pci_dev_get() in cxl_forward_error().

> +
> +		if (!pdev) {

There is no way for pdev to be NULL in this path.
cxl_cper_handle_prot_err() is different because the CPER record is not
100% reliable and pci_get_domain_bus_and_slot() can fail. No worries
about that here.

> +			pr_err_ratelimited("%s: NULL PCI device passed in AER-CXL KFifo\n",
> +					   pci_name(pdev));
> +			continue;
> +		}
> +
> +		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +		if (!port) {
> +			pr_err_ratelimited("%s: Failed to find parent port device in CXL topology\n",
> +					   pci_name(pdev));
> +			continue;
> +		}
> +		guard(device)(&port->dev);

I expect this also wants to check port->dev.driver?

> +		cxl_handle_proto_error(&wd);

It feels odd to keep passing @wd when the pdev and port have already been
extracted. ...but that is minor.

> +	}
> +}
> +
> +static struct work_struct cxl_proto_err_work;
> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
> +
> +int cxl_ras_init(void)
> +{
> +	if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
> +		pr_err("Failed to initialize CXL RAS CPER\n");

cxl_cper_register_prot_err_work() should return void like
cxl_register_proto_err_work(). If someone registers a NULL work that is
the same as not registering anything, caller gets to keep the pieces.

No real bugs that need addressing, so:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
Posted by Jonathan Cameron 5 days, 2 hours ago
On Mon, 2 Feb 2026 20:52:39 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER driver now forwards CXL protocol errors to the CXL driver via a
> kfifo. The CXL driver must consume these work items and initiate protocol
> error handling while ensuring the device's RAS mappings remain valid
> throughout processing.
> 
> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> AER service driver. Lock the parent CXL Port device to ensure the CXL
> device's RAS registers are accessible during handling. Add pdev reference-put
> to match reference-get in AER driver. This will ensure pdev access after
> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
> 
> Update is_cxl_error() to recognize CXL Port devices with errors.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

There are some small functional changes to existing paths (maybe)
that I think need explanations in this commit message.

Otherwise, one suggests small simplification.

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 74df561ed32e..a6c0bc6d7203 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
> -int cxl_ras_init(void)
> -{
> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> -}
> -
> -void cxl_ras_exit(void)
> -{
> -	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> -	cancel_work_sync(&cxl_cper_prot_err_work);
> -}
> -
>  static void cxl_dport_map_ras(struct cxl_dport *dport)
>  {
>  	struct cxl_register_map *map = &dport->reg_map;
> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
>  }
>  EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>  
> +/*
> + * get_cxl_port - Return the parent CXL Port of a PCI device
> + * @pdev: PCI device whose parent CXL Port is being queried
> + *
> + * Looks up and returns the parent CXL Port associated with @pdev. On
> + * success, the returned port has its reference count incremented and must
> + * be released by the caller. Returns NULL if no associated CXL port is
> + * found.
> + *
> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
> + */
> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> +{
> +	switch (pci_pcie_type(pdev)) {
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	{
> +		struct cxl_dport *dport;
> +		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);

Can you pass NULL for dport?  Looks like it to me as that ultimately ends
up in match_port_by_dport() and 
if (ctx->dport)
	*ctx->dport = dport;

where with this as null means ctx->dport == NULL.

> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	{
> +		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	}
> +
> +	pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
> +			   pci_name(pdev), pci_pcie_type(pdev));
> +	return NULL;
> +}


> +int cxl_ras_init(void)
> +{
> +	if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
> +		pr_err("Failed to initialize CXL RAS CPER\n");

Why introduce a new error print?  I don't particularly mind
but wasn't obvious to me why one has become appropriate and why only
for the first call here.

More importantly - if this failed it would previously have resulted
in cxl_core_init() failing and things getting torn down.

> +
> +	cxl_register_proto_err_work(&cxl_proto_err_work);
> +
> +	return 0;
> +}
> +
> +void cxl_ras_exit(void)
> +{
> +	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> +	cancel_work_sync(&cxl_cper_prot_err_work);
> +
> +	cxl_unregister_proto_err_work();
> +	cancel_work_sync(&cxl_proto_err_work);
> +}
Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
Posted by Bowman, Terry 5 days, 1 hour ago
On 2/3/2026 9:26 AM, Jonathan Cameron wrote:
> On Mon, 2 Feb 2026 20:52:39 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> The AER driver now forwards CXL protocol errors to the CXL driver via a
>> kfifo. The CXL driver must consume these work items and initiate protocol
>> error handling while ensuring the device's RAS mappings remain valid
>> throughout processing.
>>
>> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
>> AER service driver. Lock the parent CXL Port device to ensure the CXL
>> device's RAS registers are accessible during handling. Add pdev reference-put
>> to match reference-get in AER driver. This will ensure pdev access after
>> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
>>
>> Update is_cxl_error() to recognize CXL Port devices with errors.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> There are some small functional changes to existing paths (maybe)
> that I think need explanations in this commit message.
> 
> Otherwise, one suggests small simplification.
> 
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 74df561ed32e..a6c0bc6d7203 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>>  }
>>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>  
>> -int cxl_ras_init(void)
>> -{
>> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
>> -}
>> -
>> -void cxl_ras_exit(void)
>> -{
>> -	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>> -	cancel_work_sync(&cxl_cper_prot_err_work);
>> -}
>> -
>>  static void cxl_dport_map_ras(struct cxl_dport *dport)
>>  {
>>  	struct cxl_register_map *map = &dport->reg_map;
>> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
>>  }
>>  EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>>  
>> +/*
>> + * get_cxl_port - Return the parent CXL Port of a PCI device
>> + * @pdev: PCI device whose parent CXL Port is being queried
>> + *
>> + * Looks up and returns the parent CXL Port associated with @pdev. On
>> + * success, the returned port has its reference count incremented and must
>> + * be released by the caller. Returns NULL if no associated CXL port is
>> + * found.
>> + *
>> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
>> + */
>> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
>> +{
>> +	switch (pci_pcie_type(pdev)) {
>> +	case PCI_EXP_TYPE_ROOT_PORT:
>> +	case PCI_EXP_TYPE_DOWNSTREAM:
>> +	{
>> +		struct cxl_dport *dport;
>> +		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
> 
> Can you pass NULL for dport?  Looks like it to me as that ultimately ends
> up in match_port_by_dport() and 
> if (ctx->dport)
> 	*ctx->dport = dport;
> 
> where with this as null means ctx->dport == NULL.
> 

Yes.


>> +
>> +		if (!port) {
>> +			pci_err(pdev, "Failed to find the CXL device");
>> +			return NULL;
>> +		}
>> +		return port;
>> +	}
>> +	case PCI_EXP_TYPE_UPSTREAM:
>> +	case PCI_EXP_TYPE_ENDPOINT:
>> +	{
>> +		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
>> +
>> +		if (!port) {
>> +			pci_err(pdev, "Failed to find the CXL device");
>> +			return NULL;
>> +		}
>> +		return port;
>> +	}
>> +	}
>> +
>> +	pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
>> +			   pci_name(pdev), pci_pcie_type(pdev));
>> +	return NULL;
>> +}
> 
> 
>> +int cxl_ras_init(void)
>> +{
>> +	if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
>> +		pr_err("Failed to initialize CXL RAS CPER\n");
> 
> Why introduce a new error print?  I don't particularly mind
> but wasn't obvious to me why one has become appropriate and why only
> for the first call here.
> 

This was introduced before v10. 

RAS initialization failure should not fail cxl_core probe.

OSfirst AER support was added in this series in this file next to CPER. 
CPER initialization can fail and OSFirst can not is the reason for only 
one log. 

When I look at this block of code I'm drawn to the return value. It looks 
like it should be a void function. Thoughts?

- Terry


> More importantly - if this failed it would previously have resulted
> in cxl_core_init() failing and things getting torn down.
> 
>> +
>> +	cxl_register_proto_err_work(&cxl_proto_err_work);
>> +
>> +	return 0;
>> +}
>> +
>> +void cxl_ras_exit(void)
>> +{
>> +	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>> +	cancel_work_sync(&cxl_cper_prot_err_work);
>> +
>> +	cxl_unregister_proto_err_work();
>> +	cancel_work_sync(&cxl_proto_err_work);
>> +}
> 
>
Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
Posted by Jonathan Cameron 3 days, 1 hour ago
On Tue, 3 Feb 2026 11:00:40 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 2/3/2026 9:26 AM, Jonathan Cameron wrote:
> > On Mon, 2 Feb 2026 20:52:39 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >   
> >> The AER driver now forwards CXL protocol errors to the CXL driver via a
> >> kfifo. The CXL driver must consume these work items and initiate protocol
> >> error handling while ensuring the device's RAS mappings remain valid
> >> throughout processing.
> >>
> >> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> >> AER service driver. Lock the parent CXL Port device to ensure the CXL
> >> device's RAS registers are accessible during handling. Add pdev reference-put
> >> to match reference-get in AER driver. This will ensure pdev access after
> >> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
> >>
> >> Update is_cxl_error() to recognize CXL Port devices with errors.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Acked-by: Bjorn Helgaas <bhelgaas@google.com>  
> > 
> > There are some small functional changes to existing paths (maybe)
> > that I think need explanations in this commit message.
> > 
> > Otherwise, one suggests small simplification.
> >   
> >> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> >> index 74df561ed32e..a6c0bc6d7203 100644
> >> --- a/drivers/cxl/core/ras.c
> >> +++ b/drivers/cxl/core/ras.c
> >> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
> >>  }
> >>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
> >>  
> >> -int cxl_ras_init(void)
> >> -{
> >> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> >> -}
> >> -
> >> -void cxl_ras_exit(void)
> >> -{
> >> -	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> >> -	cancel_work_sync(&cxl_cper_prot_err_work);
> >> -}
> >> -
> >>  static void cxl_dport_map_ras(struct cxl_dport *dport)
> >>  {
> >>  	struct cxl_register_map *map = &dport->reg_map;
> >> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
> >>  }
> >>  EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
> >>  
> >> +/*
> >> + * get_cxl_port - Return the parent CXL Port of a PCI device
> >> + * @pdev: PCI device whose parent CXL Port is being queried
> >> + *
> >> + * Looks up and returns the parent CXL Port associated with @pdev. On
> >> + * success, the returned port has its reference count incremented and must
> >> + * be released by the caller. Returns NULL if no associated CXL port is
> >> + * found.
> >> + *
> >> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
> >> + */
> >> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> >> +{
> >> +	switch (pci_pcie_type(pdev)) {
> >> +	case PCI_EXP_TYPE_ROOT_PORT:
> >> +	case PCI_EXP_TYPE_DOWNSTREAM:
> >> +	{
> >> +		struct cxl_dport *dport;
> >> +		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);  
> > 
> > Can you pass NULL for dport?  Looks like it to me as that ultimately ends
> > up in match_port_by_dport() and 
> > if (ctx->dport)
> > 	*ctx->dport = dport;
> > 
> > where with this as null means ctx->dport == NULL.
> >   
> 
> Yes.
> 
> 
> >> +
> >> +		if (!port) {
> >> +			pci_err(pdev, "Failed to find the CXL device");
> >> +			return NULL;
> >> +		}
> >> +		return port;
> >> +	}
> >> +	case PCI_EXP_TYPE_UPSTREAM:
> >> +	case PCI_EXP_TYPE_ENDPOINT:
> >> +	{
> >> +		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> >> +
> >> +		if (!port) {
> >> +			pci_err(pdev, "Failed to find the CXL device");
> >> +			return NULL;
> >> +		}
> >> +		return port;
> >> +	}
> >> +	}
> >> +
> >> +	pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
> >> +			   pci_name(pdev), pci_pcie_type(pdev));
> >> +	return NULL;
> >> +}  
> > 
> >   
> >> +int cxl_ras_init(void)
> >> +{
> >> +	if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
> >> +		pr_err("Failed to initialize CXL RAS CPER\n");  
> > 
> > Why introduce a new error print?  I don't particularly mind
> > but wasn't obvious to me why one has become appropriate and why only
> > for the first call here.
> >   
> 
> This was introduced before v10. 
> 
> RAS initialization failure should not fail cxl_core probe.
> 
> OSfirst AER support was added in this series in this file next to CPER. 
> CPER initialization can fail and OSFirst can not is the reason for only 
> one log. 
> 
> When I look at this block of code I'm drawn to the return value. It looks 
> like it should be a void function. Thoughts?

I'd return an error code, then at caller decide to not treat that
as a failure case. That gives a clear place to add a print + maybe
a comment that says - yes it's an error, but for 'reasons' we carry on
anyway

Jonathan


> 
> - Terry
> 
> 
> > More importantly - if this failed it would previously have resulted
> > in cxl_core_init() failing and things getting torn down.
> >   
> >> +
> >> +	cxl_register_proto_err_work(&cxl_proto_err_work);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +void cxl_ras_exit(void)
> >> +{
> >> +	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> >> +	cancel_work_sync(&cxl_cper_prot_err_work);
> >> +
> >> +	cxl_unregister_proto_err_work();
> >> +	cancel_work_sync(&cxl_proto_err_work);
> >> +}  
> > 
> >   
> 
>