[v14] Enable CXL PCIe Port Protocol Error handling and logging

[PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by Terry Bowman 3 weeks, 4 days ago

The AER driver includes significant logic for handling CXL protocol errors.
The AER driver will be updated in the future to separate the AER and CXL
logic.

Rename the is_internal_error() function to is_aer_internal_error() as it
gives a more precise indication of the purpose. Make is_aer_internal_error()
non-static to allow for other PCI drivers to access.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v13->v14:
- New patch
---
 drivers/pci/pcie/aer.c     | 4 ++--
 drivers/pci/pcie/portdrv.h | 9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 63658e691aa2..2527e8370186 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1166,7 +1166,7 @@ static bool is_cxl_mem_dev(struct pci_dev *dev)
 	return true;
 }
 
-static bool is_internal_error(struct aer_err_info *info)
+bool is_aer_internal_error(struct aer_err_info *info)
 {
 	if (info->severity == AER_CORRECTABLE)
 		return info->status & PCI_ERR_COR_INTERNAL;
@@ -1211,7 +1211,7 @@ static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 	 * device driver.
 	 */
 	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    is_internal_error(info))
+	    is_aer_internal_error(info))
 		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
 }
 
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index bd29d1cc7b8b..e7a0a2cffea9 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -123,4 +123,13 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
 #endif /* !CONFIG_PCIE_PME */
 
 struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
+
+struct aer_err_info;
+
+#ifdef CONFIG_PCIEAER_CXL
+bool is_aer_internal_error(struct aer_err_info *info);
+#else
+static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
+#endif /* CONFIG_PCIEAER_CXL */
+
 #endif /* _PORTDRV_H_ */
-- 
2.34.1

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by Bjorn Helgaas 2 weeks, 3 days ago

On Wed, Jan 14, 2026 at 12:20:31PM -0600, Terry Bowman wrote:
> The AER driver includes significant logic for handling CXL protocol errors.
> The AER driver will be updated in the future to separate the AER and CXL
> logic.
> 
> Rename the is_internal_error() function to is_aer_internal_error() as it
> gives a more precise indication of the purpose. Make is_aer_internal_error()
> non-static to allow for other PCI drivers to access.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Personally I would put "aer_" at the beginning, i.e.,
"aer_is_internal_error()" to match other AER functions.
But either is OK.

> ---
> 
> Changes in v13->v14:
> - New patch
> ---
>  drivers/pci/pcie/aer.c     | 4 ++--
>  drivers/pci/pcie/portdrv.h | 9 +++++++++
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 63658e691aa2..2527e8370186 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1166,7 +1166,7 @@ static bool is_cxl_mem_dev(struct pci_dev *dev)
>  	return true;
>  }
>  
> -static bool is_internal_error(struct aer_err_info *info)
> +bool is_aer_internal_error(struct aer_err_info *info)
>  {
>  	if (info->severity == AER_CORRECTABLE)
>  		return info->status & PCI_ERR_COR_INTERNAL;
> @@ -1211,7 +1211,7 @@ static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  	 * device driver.
>  	 */
>  	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    is_internal_error(info))
> +	    is_aer_internal_error(info))
>  		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>  }
>  
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index bd29d1cc7b8b..e7a0a2cffea9 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -123,4 +123,13 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
>  #endif /* !CONFIG_PCIE_PME */
>  
>  struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
> +
> +struct aer_err_info;
> +
> +#ifdef CONFIG_PCIEAER_CXL
> +bool is_aer_internal_error(struct aer_err_info *info);
> +#else
> +static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
> +#endif /* CONFIG_PCIEAER_CXL */
> +
>  #endif /* _PORTDRV_H_ */
> -- 
> 2.34.1
>

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by dan.j.williams@intel.com 2 weeks, 5 days ago

Terry Bowman wrote:
> The AER driver includes significant logic for handling CXL protocol errors.
> The AER driver will be updated in the future to separate the AER and CXL
> logic.
> 
> Rename the is_internal_error() function to is_aer_internal_error() as it
> gives a more precise indication of the purpose. Make is_aer_internal_error()
> non-static to allow for other PCI drivers to access.

Not even sure this rename is needed given that it is private to
drivers/pci/pcie/ and the sharing is only for cxl_{rch,vh}.c, not for
"other PCI drivers". Consistent with the idea that internal errors are
not going to become a first-class citizen let us keep this a CXL-only
consideration.

I'll update the changelog to drop the "other PCI drivers" comment.

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by Bowman, Terry 2 weeks, 5 days ago

On 1/19/2026 8:20 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> The AER driver includes significant logic for handling CXL protocol errors.
>> The AER driver will be updated in the future to separate the AER and CXL
>> logic.
>>
>> Rename the is_internal_error() function to is_aer_internal_error() as it
>> gives a more precise indication of the purpose. Make is_aer_internal_error()
>> non-static to allow for other PCI drivers to access.
> 
> Not even sure this rename is needed given that it is private to
> drivers/pci/pcie/ and the sharing is only for cxl_{rch,vh}.c, not for
> "other PCI drivers". Consistent with the idea that internal errors are
> not going to become a first-class citizen let us keep this a CXL-only
> consideration.
> 
> I'll update the changelog to drop the "other PCI drivers" comment.

The name choice was addressed by Bjorn here:

https://lore.kernel.org/linux-cxl/20251208180624.GA3300935@bhelgaas/ 

Terry

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by dan.j.williams@intel.com 2 weeks, 5 days ago

Bowman, Terry wrote:
> On 1/19/2026 8:20 PM, dan.j.williams@intel.com wrote:
> > Terry Bowman wrote:
> >> The AER driver includes significant logic for handling CXL protocol errors.
> >> The AER driver will be updated in the future to separate the AER and CXL
> >> logic.
> >>
> >> Rename the is_internal_error() function to is_aer_internal_error() as it
> >> gives a more precise indication of the purpose. Make is_aer_internal_error()
> >> non-static to allow for other PCI drivers to access.
> > 
> > Not even sure this rename is needed given that it is private to
> > drivers/pci/pcie/ and the sharing is only for cxl_{rch,vh}.c, not for
> > "other PCI drivers". Consistent with the idea that internal errors are
> > not going to become a first-class citizen let us keep this a CXL-only
> > consideration.
> > 
> > I'll update the changelog to drop the "other PCI drivers" comment.
> 
> The name choice was addressed by Bjorn here:
> 
> https://lore.kernel.org/linux-cxl/20251208180624.GA3300935@bhelgaas/ 

Thanks, yes, I only folded in the following changes to the changelog:

10:  417535d35e9f ! 11:  098f14e1d884 PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()
    @@ Commit message
         logic.
     
         Rename the is_internal_error() function to is_aer_internal_error() as it
    -    gives a more precise indication of the purpose. Make is_aer_internal_error()
    -    non-static to allow for other PCI drivers to access.
    +    gives a more precise indication of the purpose. Make
    +    is_aer_internal_error() non-static to allow for the 2 different CXL
    +    topology error model implementations (RCH and VH) to share this helper.
     
         Signed-off-by: Terry Bowman <terry.bowman@amd.com>
    -
    -    ---
    -
    -    Changes in v13->v14:
    -    - New patch
    +    Link: https://patch.msgid.link/20260114182055.46029-11-terry.bowman@amd.com
    +    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by Jonathan Cameron 3 weeks, 4 days ago

On Wed, 14 Jan 2026 12:20:31 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER driver includes significant logic for handling CXL protocol errors.
> The AER driver will be updated in the future to separate the AER and CXL
> logic.
> 
> Rename the is_internal_error() function to is_aer_internal_error() as it
> gives a more precise indication of the purpose. Make is_aer_internal_error()
> non-static to allow for other PCI drivers to access.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,

I don't see it as sensible to have is_aer_internal_error()
return false if CXL is not built. That question has nothing to
do with CXL.  Hence if we are doing generic naming, I think we
should just always have the function available.  Gating on CXL
belongs at whatever called it.  Which is the case already for
cxl_rch_handle_error() which has a stub that doesn't call this for
when CXL stuff isn't built.

Should just be a case of moving out of if the ifdef in aer.c
as part of this patch.

Jonathan

> 
> ---
> 
> Changes in v13->v14:
> - New patch
> ---
>  drivers/pci/pcie/aer.c     | 4 ++--
>  drivers/pci/pcie/portdrv.h | 9 +++++++++
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 63658e691aa2..2527e8370186 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1166,7 +1166,7 @@ static bool is_cxl_mem_dev(struct pci_dev *dev)
>  	return true;
>  }
>  
> -static bool is_internal_error(struct aer_err_info *info)
> +bool is_aer_internal_error(struct aer_err_info *info)
>  {
>  	if (info->severity == AER_CORRECTABLE)
>  		return info->status & PCI_ERR_COR_INTERNAL;
> @@ -1211,7 +1211,7 @@ static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  	 * device driver.
>  	 */
>  	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    is_internal_error(info))
> +	    is_aer_internal_error(info))
>  		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>  }
>  
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index bd29d1cc7b8b..e7a0a2cffea9 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -123,4 +123,13 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
>  #endif /* !CONFIG_PCIE_PME */
>  
>  struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
> +
> +struct aer_err_info;
> +
> +#ifdef CONFIG_PCIEAER_CXL
> +bool is_aer_internal_error(struct aer_err_info *info);
> +#else
> +static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }

This seems odd.  It's either an AER internal error or it isn't, whether
or not CXL is enabled. That stubbing out should I think go up to the
caller that can decide whether it cares or not.

> +#endif /* CONFIG_PCIEAER_CXL */
> +
>  #endif /* _PORTDRV_H_ */

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by dan.j.williams@intel.com 3 weeks, 3 days ago

Jonathan Cameron wrote:
> On Wed, 14 Jan 2026 12:20:31 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
> > The AER driver includes significant logic for handling CXL protocol errors.
> > The AER driver will be updated in the future to separate the AER and CXL
> > logic.
> > 
> > Rename the is_internal_error() function to is_aer_internal_error() as it
> > gives a more precise indication of the purpose. Make is_aer_internal_error()
> > non-static to allow for other PCI drivers to access.
> > 
> > Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
> 
> I don't see it as sensible to have is_aer_internal_error()
> return false if CXL is not built. That question has nothing to
> do with CXL.  Hence if we are doing generic naming, I think we
> should just always have the function available.  Gating on CXL
> belongs at whatever called it.  Which is the case already for
> cxl_rch_handle_error() which has a stub that doesn't call this for
> when CXL stuff isn't built.
> 
> Should just be a case of moving out of if the ifdef in aer.c
> as part of this patch.

I agree with the general sentiment, but not the conclusion, especially
because this is a private detail. Linux has long ignored internal
errors. The only reason to consider them now is because CXL decided to
multiplex its error model on top of this oft-ignored feature of PCIe
AER.

Specifically, portdrv.h is not in the global include namespace, this is
a private detail of the only conumer of internal errors:
drivers/pci/pcie/aer_cxl_{rch,vh}.c

At most we should have this as a comment to clarify:

/*
 * Note, internal errors are only considered for the CXL error model,
 * not for other implementations.
 */

...and the pci_aer_unmask_internal_errors() export should be:

EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")

...for the same reason. Steer folks away from thinking that it is open
season for adding more internal error support.

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by Lukas Wunner 2 weeks, 3 days ago

On Thu, Jan 15, 2026 at 12:42:36PM -0800, dan.j.williams@intel.com wrote:
> I agree with the general sentiment, but not the conclusion, especially
> because this is a private detail. Linux has long ignored internal
> errors. The only reason to consider them now is because CXL decided to
> multiplex its error model on top of this oft-ignored feature of PCIe
> AER.
> 
> Specifically, portdrv.h is not in the global include namespace, this is
> a private detail of the only conumer of internal errors:
> drivers/pci/pcie/aer_cxl_{rch,vh}.c
> 
> At most we should have this as a comment to clarify:
> 
> /*
>  * Note, internal errors are only considered for the CXL error model,
>  * not for other implementations.
>  */
> 
> ...and the pci_aer_unmask_internal_errors() export should be:
> 
> EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")
> 
> ...for the same reason. Steer folks away from thinking that it is open
> season for adding more internal error support.

It's not like Internal Errors are a bad thing per se.  They're a way
to signal "other" errors besides the spec-defined ones.

As an example, and I'm keeping this in general terms to avoid devulging
information about future products, a device possessing ECC RAM may raise
a Correctable Internal Error when ECC successfully recovers from flipped
bits because it allows alerting the user in advance that the device might
need to be replaced in the near future.  If ECC recovery fails, the device
might try to use a reserved spare portion of RAM in lieu of the failing one
and instruct the AER driver to recover through a bus reset.  Such errors
are not covered by the spec-defined types.  Using the Internal Error type
is the only possibility it seems.

My point is, there are valid (upcoming, not theoretical) use cases for
Internal Errors and creating infrastructure in the kernel to take advantage
of them is a good thing.  Hence my continued pushing back on hiding or
discouraging their use.

Thanks,

Lukas

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by dan.j.williams@intel.com 2 weeks, 3 days ago

Lukas Wunner wrote:
> On Thu, Jan 15, 2026 at 12:42:36PM -0800, dan.j.williams@intel.com wrote:
> > I agree with the general sentiment, but not the conclusion, especially
> > because this is a private detail. Linux has long ignored internal
> > errors. The only reason to consider them now is because CXL decided to
> > multiplex its error model on top of this oft-ignored feature of PCIe
> > AER.
> > 
> > Specifically, portdrv.h is not in the global include namespace, this is
> > a private detail of the only conumer of internal errors:
> > drivers/pci/pcie/aer_cxl_{rch,vh}.c
> > 
> > At most we should have this as a comment to clarify:
> > 
> > /*
> >  * Note, internal errors are only considered for the CXL error model,
> >  * not for other implementations.
> >  */
> > 
> > ...and the pci_aer_unmask_internal_errors() export should be:
> > 
> > EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")
> > 
> > ...for the same reason. Steer folks away from thinking that it is open
> > season for adding more internal error support.
> 
> It's not like Internal Errors are a bad thing per se.  They're a way
> to signal "other" errors besides the spec-defined ones.
> 
> As an example, and I'm keeping this in general terms to avoid devulging
> information about future products, a device possessing ECC RAM may raise
> a Correctable Internal Error when ECC successfully recovers from flipped
> bits because it allows alerting the user in advance that the device might
> need to be replaced in the near future.  If ECC recovery fails, the device
> might try to use a reserved spare portion of RAM in lieu of the failing one
> and instruct the AER driver to recover through a bus reset.  Such errors
> are not covered by the spec-defined types.  Using the Internal Error type
> is the only possibility it seems.

The Internal Error type is a poor fit for that. This ECC RAM scenario is simply
an internal device event, not a PCIe visible error case. Consider that CXL
Memory Expanders are nothing if not "devices possessing ECC RAM" that may
encounter correctable errors in that RAM. Yes, the user has need for those
correctable errors to be reported, and no, PCIe AER has no reason to care about
conveying those reports. CXL bypasses AER for internal ECC RAM events.

PCIe AER only notices device-internal ECC RAM events in the case where a PCIe
transaction encounters an error. For example, a completer abort attempting to
pull from bad RAM.

So if CXL saw no need to architect internal ECC events into AER, why does Xe
think it is special in this regard?

The CXL solution is simply a typical device interrupt that notifies new entries
in the device event log. See trace_cxl_dram() and trace_cxl_general_media() for
that event handling.

> My point is, there are valid (upcoming, not theoretical) use cases for
> Internal Errors and creating infrastructure in the kernel to take advantage
> of them is a good thing.  Hence my continued pushing back on hiding or
> discouraging their use.

It is fine to look ahead, but I would not go so far as to pull in future
requirements into a present patch set. Especially when those future
requirements are suspect.

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by Lukas Wunner 2 weeks, 3 days ago

On Thu, Jan 22, 2026 at 11:09:48AM -0800, dan.j.williams@intel.com wrote:
> Lukas Wunner wrote:
> > a device possessing ECC RAM may raise
> > a Correctable Internal Error when ECC successfully recovers from flipped
> > bits because it allows alerting the user in advance that the device might
> > need to be replaced in the near future.  If ECC recovery fails, the device
> > might try to use a reserved spare portion of RAM in lieu of the failing one
> > and instruct the AER driver to recover through a bus reset.  Such errors
> > are not covered by the spec-defined types.  Using the Internal Error type
> > is the only possibility it seems.
> 
> The Internal Error type is a poor fit for that. This ECC RAM scenario is
> simply an internal device event, not a PCIe visible error case. Consider
> that CXL Memory Expanders are nothing if not "devices possessing ECC RAM"
> that may encounter correctable errors in that RAM. Yes, the user has need
> for those correctable errors to be reported, and no, PCIe AER has no reason
> to care about conveying those reports.

I'm not aware of a better PCIe spec-defined mechanism to report such
errors besides AER (Advanced Error *Reporting*), so I'm not sure why
you consider it a poor fit.

However, reporting corrected ECC errors is only half of the equation.
As stated above, if the ECC error is not correctable, the device may
choose to replace the faulty memory region with reserved spare memory,
but then a reset is required to recover from the error.  Precisely what
the AER driver provides, so again I'm not sure why it's a poor fit.

> So if CXL saw no need to architect internal ECC events into AER, why does Xe
> think it is special in this regard?

The most charitable interpretation is that it's just the first mover
and others will follow.  Well actually CXL is the first mover. ;)

> The CXL solution is simply a typical device interrupt that notifies
> new entries in the device event log. See trace_cxl_dram() and
> trace_cxl_general_media() for that event handling.

This seems to be based on CPER, which is not part of the PCIe Base Spec.
I can only guess that xe devices are intended to be used on non-ACPI
platforms as well, which may have led to the decision to use a
PCIe spec-defined mechanism.

Thanks,

Lukas

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by dan.j.williams@intel.com 2 weeks, 2 days ago

Lukas Wunner wrote:
> On Thu, Jan 22, 2026 at 11:09:48AM -0800, dan.j.williams@intel.com wrote:
> > Lukas Wunner wrote:
> > > a device possessing ECC RAM may raise
> > > a Correctable Internal Error when ECC successfully recovers from flipped
> > > bits because it allows alerting the user in advance that the device might
> > > need to be replaced in the near future.  If ECC recovery fails, the device
> > > might try to use a reserved spare portion of RAM in lieu of the failing one
> > > and instruct the AER driver to recover through a bus reset.  Such errors
> > > are not covered by the spec-defined types.  Using the Internal Error type
> > > is the only possibility it seems.
> > 
> > The Internal Error type is a poor fit for that. This ECC RAM scenario is
> > simply an internal device event, not a PCIe visible error case. Consider
> > that CXL Memory Expanders are nothing if not "devices possessing ECC RAM"
> > that may encounter correctable errors in that RAM. Yes, the user has need
> > for those correctable errors to be reported, and no, PCIe AER has no reason
> > to care about conveying those reports.
> 
> I'm not aware of a better PCIe spec-defined mechanism to report such
> errors besides AER (Advanced Error *Reporting*), so I'm not sure why
> you consider it a poor fit.

PCIe spec has no role defining the internal error model of devices.
Linux has reason to not endorse a blurring of the lines of where the
PCIe error model ends and the device-specific error model begins. CXL
respects those boundaries, Xe is pushing the boundary.

> However, reporting corrected ECC errors is only half of the equation.
> As stated above, if the ECC error is not correctable, the device may
> choose to replace the faulty memory region with reserved spare memory,
> but then a reset is required to recover from the error.  Precisely what
> the AER driver provides, so again I'm not sure why it's a poor fit.

Again CXL has a model for this, those are the "post-package repair"
events handled internally to the device / driver either transparently or
user coordinated. No AER needed. In general devices have plenty of
reasons that the driver determines they need to be reset, they do not
need AER core help to reset themselves on error.

AER is there for link recovery.

> > So if CXL saw no need to architect internal ECC events into AER, why does Xe
> > think it is special in this regard?
> 
> The most charitable interpretation is that it's just the first mover
> and others will follow.  Well actually CXL is the first mover. ;)

...first mover that helps clarify the role of AER that just happens to
match the status quo that PCIe AER core ignore internal errors.

> > The CXL solution is simply a typical device interrupt that notifies
> > new entries in the device event log. See trace_cxl_dram() and
> > trace_cxl_general_media() for that event handling.
> 
> This seems to be based on CPER, which is not part of the PCIe Base Spec.
> I can only guess that xe devices are intended to be used on non-ACPI
> platforms as well, which may have led to the decision to use a
> PCIe spec-defined mechanism.

CPER is compatibility hack for operating systems that do not have native
CXL drivers. The native support is just an interrupt fronting an event
log retrieved with mailbox commands.

Re: [PATCH v14 10/34] PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()

Posted by Jonathan Cameron 2 weeks, 2 days ago

On Thu, 22 Jan 2026 13:32:08 -0800
dan.j.williams@intel.com wrote:

> Lukas Wunner wrote:
> > On Thu, Jan 22, 2026 at 11:09:48AM -0800, dan.j.williams@intel.com wrote:  
> > > Lukas Wunner wrote:  
> > > > a device possessing ECC RAM may raise
> > > > a Correctable Internal Error when ECC successfully recovers from flipped
> > > > bits because it allows alerting the user in advance that the device might
> > > > need to be replaced in the near future.  If ECC recovery fails, the device
> > > > might try to use a reserved spare portion of RAM in lieu of the failing one
> > > > and instruct the AER driver to recover through a bus reset.  Such errors
> > > > are not covered by the spec-defined types.  Using the Internal Error type
> > > > is the only possibility it seems.  
> > > 
> > > The Internal Error type is a poor fit for that. This ECC RAM scenario is
> > > simply an internal device event, not a PCIe visible error case. Consider
> > > that CXL Memory Expanders are nothing if not "devices possessing ECC RAM"
> > > that may encounter correctable errors in that RAM. Yes, the user has need
> > > for those correctable errors to be reported, and no, PCIe AER has no reason
> > > to care about conveying those reports.  
> > 
> > I'm not aware of a better PCIe spec-defined mechanism to report such
> > errors besides AER (Advanced Error *Reporting*), so I'm not sure why
> > you consider it a poor fit.  
> 
> PCIe spec has no role defining the internal error model of devices.
> Linux has reason to not endorse a blurring of the lines of where the
> PCIe error model ends and the device-specific error model begins. CXL
> respects those boundaries, Xe is pushing the boundary.

FWIW we have a bunch of older hardware where we could report this sort
of error either via AER or via an MSI. After some push back years
ago, we flipped them all to the MSI path. That includes stuff that
triggers device resets.  I don't think it caused us too much trouble
to make that switch.

> 
> > However, reporting corrected ECC errors is only half of the equation.
> > As stated above, if the ECC error is not correctable, the device may
> > choose to replace the faulty memory region with reserved spare memory,
> > but then a reset is required to recover from the error.  Precisely what
> > the AER driver provides, so again I'm not sure why it's a poor fit.  
> 
> Again CXL has a model for this, those are the "post-package repair"
> events handled internally to the device / driver either transparently or
> user coordinated. No AER needed. In general devices have plenty of
> reasons that the driver determines they need to be reset, they do not
> need AER core help to reset themselves on error.
> 
> AER is there for link recovery.
> 
> > > So if CXL saw no need to architect internal ECC events into AER, why does Xe
> > > think it is special in this regard?  
> > 
> > The most charitable interpretation is that it's just the first mover
> > and others will follow.  Well actually CXL is the first mover. ;)  
> 
> ...first mover that helps clarify the role of AER that just happens to
> match the status quo that PCIe AER core ignore internal errors.
> 
> > > The CXL solution is simply a typical device interrupt that notifies
> > > new entries in the device event log. See trace_cxl_dram() and
> > > trace_cxl_general_media() for that event handling.  
> > 
> > This seems to be based on CPER, which is not part of the PCIe Base Spec.
> > I can only guess that xe devices are intended to be used on non-ACPI
> > platforms as well, which may have led to the decision to use a
> > PCIe spec-defined mechanism.  
> 
> CPER is compatibility hack for operating systems that do not have native
> CXL drivers. The native support is just an interrupt fronting an event
> log retrieved with mailbox commands.
Just as a side note, CXL also has FW specific interrupts with a negotation
process for whether they are used, or MSI-X is used for event queues.

Jonathan