[v1] Enable CXL PCIe port protocol error handling and logging

[PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Terry Bowman 1 year, 4 months ago

CXL protocol errors are reported to the OS through PCIe correctable and
uncorrectable internal errors. However, since CXL PCIe port devices
are currently bound to the portdrv driver, there is no mechanism to
notify the CXL driver, which is necessary for proper logging and
handling.

To address this, introduce CXL PCIe port error callbacks along with
register/unregister and accessor functions. The callbacks will be
invoked by the AER driver in the case protocol errors are reported by
a CXL port device.

The AER driver callbacks will be used in future patches implementing
CXL PCIe port error handling.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++
 include/linux/aer.h    | 14 ++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 13b8586924ea..a9792b9576b4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -50,6 +50,8 @@ struct aer_rpc {
 	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
 };
 
+static struct cxl_port_err_hndlrs cxl_port_hndlrs;
+
 /* AER stats for the device */
 struct aer_stats {
 
@@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev,
 					struct aer_err_info *info) { }
 #endif
 
+void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
+{
+	cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected;
+	cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected;
+}
+EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL);
+
+void unregister_cxl_port_hndlrs(void)
+{
+	cxl_port_hndlrs.error_detected = NULL;
+	cxl_port_hndlrs.cor_error_detected = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL);
+
+struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void)
+{
+	return &cxl_port_hndlrs;
+}
+EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL);
+
 /**
  * pci_aer_handle_error - handle logging error into an event log
  * @dev: pointer to pci_dev data structure of error source device
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..67fd04c5ae2b 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -10,6 +10,7 @@
 
 #include <linux/errno.h>
 #include <linux/types.h>
+#include <linux/pci.h>
 
 #define AER_NONFATAL			0
 #define AER_FATAL			1
@@ -55,5 +56,18 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 int cper_severity_to_aer(int cper_severity);
 void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 		       int severity, struct aer_capability_regs *aer_regs);
+
+struct cxl_port_err_hndlrs {
+
+	/* CXL uncorrectable error detected on this device */
+	pci_ers_result_t (*error_detected)(struct pci_dev *dev,
+					   pci_channel_state_t error);
+
+	/* CXL corrected error detected on this device */
+	void (*cor_error_detected)(struct pci_dev *dev);
+};
+void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs);
+void unregister_cxl_port_hndlrs(void);
+struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void);
 #endif //_AER_H_
 
-- 
2.34.1

Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Dan Williams 1 year, 3 months ago

Terry Bowman wrote:
> CXL protocol errors are reported to the OS through PCIe correctable and
> uncorrectable internal errors. However, since CXL PCIe port devices
> are currently bound to the portdrv driver, there is no mechanism to
> notify the CXL driver, which is necessary for proper logging and
> handling.
> 
> To address this, introduce CXL PCIe port error callbacks along with
> register/unregister and accessor functions. The callbacks will be
> invoked by the AER driver in the case protocol errors are reported by
> a CXL port device.
> 
> The AER driver callbacks will be used in future patches implementing
> CXL PCIe port error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++
>  include/linux/aer.h    | 14 ++++++++++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 13b8586924ea..a9792b9576b4 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -50,6 +50,8 @@ struct aer_rpc {
>  	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
>  };
>  
> +static struct cxl_port_err_hndlrs cxl_port_hndlrs;

I think this can afford to splurge on a few more letters and make this

static struct cxl_port_error_handlers cxl_port_error_handlers;


> +
>  /* AER stats for the device */
>  struct aer_stats {
>  
> @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev,
>  					struct aer_err_info *info) { }
>  #endif
>  
> +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
> +{
> +	cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected;
> +	cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected;
> +}
> +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL);
> +
> +void unregister_cxl_port_hndlrs(void)
> +{
> +	cxl_port_hndlrs.error_detected = NULL;
> +	cxl_port_hndlrs.cor_error_detected = NULL;
> +}
> +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL);
> +
> +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void)
> +{
> +	return &cxl_port_hndlrs;
> +}
> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL);

I guess I will need to go deeper into the code, but I would not have
expected that new registration interfaces are needed. Each 'struct
pci_driver' could optionally include CXL error handlers alongside their
PCIe error handlers and when CXL AER errors are broadcast only the CXL
handlers are invoked. I.e. the registration is something like:

diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 6af5e0425872..42db26195bda 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
        .shutdown       = pcie_portdrv_shutdown,
 
        .err_handler    = &pcie_portdrv_err_handler,
+       .cxl_err_handler = &cxl_portdrv_err_handler,
 
        .driver_managed_dma = true,

Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Terry Bowman 1 year, 3 months ago

Hi Dan,

On 10/21/24 20:53, Dan Williams wrote:
> Terry Bowman wrote:
>> CXL protocol errors are reported to the OS through PCIe correctable and
>> uncorrectable internal errors. However, since CXL PCIe port devices
>> are currently bound to the portdrv driver, there is no mechanism to
>> notify the CXL driver, which is necessary for proper logging and
>> handling.
>>
>> To address this, introduce CXL PCIe port error callbacks along with
>> register/unregister and accessor functions. The callbacks will be
>> invoked by the AER driver in the case protocol errors are reported by
>> a CXL port device.
>>
>> The AER driver callbacks will be used in future patches implementing
>> CXL PCIe port error handling.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/pci/pcie/aer.c | 22 ++++++++++++++++++++++
>>  include/linux/aer.h    | 14 ++++++++++++++
>>  2 files changed, 36 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 13b8586924ea..a9792b9576b4 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -50,6 +50,8 @@ struct aer_rpc {
>>  	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
>>  };
>>  
>> +static struct cxl_port_err_hndlrs cxl_port_hndlrs;
> 
> I think this can afford to splurge on a few more letters and make this
> 
> static struct cxl_port_error_handlers cxl_port_error_handlers;
> 
> 

Ok.

>> +
>>  /* AER stats for the device */
>>  struct aer_stats {
>>  
>> @@ -1078,6 +1080,26 @@ static inline void cxl_rch_handle_error(struct pci_dev *dev,
>>  					struct aer_err_info *info) { }
>>  #endif
>>  
>> +void register_cxl_port_hndlrs(struct cxl_port_err_hndlrs *_cxl_port_hndlrs)
>> +{
>> +	cxl_port_hndlrs.error_detected = _cxl_port_hndlrs->error_detected;
>> +	cxl_port_hndlrs.cor_error_detected = _cxl_port_hndlrs->cor_error_detected;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(register_cxl_port_hndlrs, CXL);
>> +
>> +void unregister_cxl_port_hndlrs(void)
>> +{
>> +	cxl_port_hndlrs.error_detected = NULL;
>> +	cxl_port_hndlrs.cor_error_detected = NULL;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(unregister_cxl_port_hndlrs, CXL);
>> +
>> +struct cxl_port_err_hndlrs *find_cxl_port_hndlrs(void)
>> +{
>> +	return &cxl_port_hndlrs;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_hndlrs, CXL);
> 
> I guess I will need to go deeper into the code, but I would not have
> expected that new registration interfaces are needed. Each 'struct
> pci_driver' could optionally include CXL error handlers alongside their
> PCIe error handlers and when CXL AER errors are broadcast only the CXL
> handlers are invoked. I.e. the registration is something like:
> 
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 6af5e0425872..42db26195bda 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
>         .shutdown       = pcie_portdrv_shutdown,
>  
>         .err_handler    = &pcie_portdrv_err_handler,
> +       .cxl_err_handler = &cxl_portdrv_err_handler,
>  
>         .driver_managed_dma = true,

Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type 
'struct pci_error_handler'. 

'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn 
pointers that are used in PCIe recovery if available. The plan is for CXL devices to
call panic for UCE fatal and non-fatal but it might be good to use the 
'struct pci_error_handler' type in case there are needs for the other handlers in 
the future. It also makes the logic to access and use the error handlers common, 
requiring less code.

Regards,
Terry

Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Dan Williams 1 year, 3 months ago

Terry Bowman wrote:
[..]
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 6af5e0425872..42db26195bda 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
> >         .shutdown       = pcie_portdrv_shutdown,
> >  
> >         .err_handler    = &pcie_portdrv_err_handler,
> > +       .cxl_err_handler = &cxl_portdrv_err_handler,
> >  
> >         .driver_managed_dma = true,
> 
> Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type 
> 'struct pci_error_handler'. 
> 
> 'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn 
> pointers that are used in PCIe recovery if available. The plan is for CXL devices to
> call panic for UCE fatal and non-fatal but it might be good to use the 
> 'struct pci_error_handler' type in case there are needs for the other handlers in 
> the future. It also makes the logic to access and use the error handlers common, 
> requiring less code.

Can you give an example where CXL can reuse 'struct pci_error_handlers`
infrastructure? The PCI error handlers are built around the idea that
operations can be paused and recovered, CXL operations assume near
constant device participation in CPU cache and memory coherency
protocol.

About the only reuse I can think of is cases where a CXL error could be
sent down the PCI error handler path, i.e. ones that would send a
'pci_channel_io_normal' notice to ->error_detected(). Otherwise,
pci_channel_state_t and pci_ers_result_t seem to be a poor fit for CXL
error handling.

Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Terry Bowman 1 year, 3 months ago

Hi Dan,

On 10/22/24 12:09, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>>> index 6af5e0425872..42db26195bda 100644
>>> --- a/drivers/pci/pcie/portdrv.c
>>> +++ b/drivers/pci/pcie/portdrv.c
>>> @@ -793,6 +793,7 @@ static struct pci_driver pcie_portdriver = {
>>>         .shutdown       = pcie_portdrv_shutdown,
>>>  
>>>         .err_handler    = &pcie_portdrv_err_handler,
>>> +       .cxl_err_handler = &cxl_portdrv_err_handler,
>>>  
>>>         .driver_managed_dma = true,
>>
>> Ok. I'm thinking to add a definition for 'pci_dev::cxl_err_handler' of type 
>> 'struct pci_error_handler'. 
>>
>> 'struct pci_error_handler' contains a slot reset(), resume(), and mmio_enabled() fn 
>> pointers that are used in PCIe recovery if available. The plan is for CXL devices to
>> call panic for UCE fatal and non-fatal but it might be good to use the 
>> 'struct pci_error_handler' type in case there are needs for the other handlers in 
>> the future. It also makes the logic to access and use the error handlers common, 
>> requiring less code.
> 
> Can you give an example where CXL can reuse 'struct pci_error_handlers`
> infrastructure? The PCI error handlers are built around the idea that
> operations can be paused and recovered, CXL operations assume near
> constant device participation in CPU cache and memory coherency
> protocol.
> 
> About the only reuse I can think of is cases where a CXL error could be
> sent down the PCI error handler path, i.e. ones that would send a
> 'pci_channel_io_normal' notice to ->error_detected(). Otherwise,
> pci_channel_state_t and pci_ers_result_t seem to be a poor fit for CXL
> error handling.


I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL 
UCE-CE errors. 

One example where it can be reused in infrastructure is in err.c's 
report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers' 
then the updated report_error_detected() becomes a bit simpler with less helper 
function logic. But, it's not a reason by itself to choose to reuse 'struct 
pci_error_handlers' for CXL errors.

Looking closer at aer,c shows there is no advantage in this file for using 'struct 
pci_error_handlers' for CXL errors.

If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'.
And will contain 2 function pointers for CE and UCE handling.

Regards,
Terry

Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Dan Williams 1 year, 3 months ago

Terry Bowman wrote:
[..]
> I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL
> UCE-CE errors.
>
> One example where it can be reused in infrastructure is in err.c's
> report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers'
> then the updated report_error_detected() becomes a bit simpler with less helper
> function logic.

report_error_detected() is concerned with link and i/o state
(pci_dev_is_disconnected() and pci_dev_set_io_state()). For device
disconnects, CXL recovery potentially needs to span multiple devices.
For i/o state, CXL.io could be fully operational while CXL.cache and
CXL.mem are in fatal state.

CXL considerations do not feel welcome in that function.

Ideally a PCIe developer never needs to see or understand the CXL error
model because it is off in its own path. In other words, if someone
maintaining pcie_do_recovery=>report_error_detected() for the PCIe case
needs to go find a CXL expert each time they want to touch that path,
that feels like a regression in PCIe error handling maintainability.

> But, it's not a reason by itself to choose to reuse 'struct
> pci_error_handlers' for CXL errors.
>
> Looking closer at aer,c shows there is no advantage in this file for using 'struct
> pci_error_handlers' for CXL errors.
>
> If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'.

Yes, mainly because the bus state and the result of the recovery tend to
be a different operational model. If a CXL error fits the PCIe model
then it can be sent via pcie_do_recovery(), but I expect that only
applies to a handful of correctable errors like CRC_Threshold,
Retry_Threshold, or Physical_Layer_Error. Almost everything else *seems*
like it has a CXL specific response that would confuse
pcie_do_recovery(). 

So, in general new operational models == new data structures and types.

> And will contain 2 function pointers for CE and UCE handling.

Unless and until we define a CXL Reset flow, then yes, I assume you
mean ->error_detected() and ->cor_error_detected()?

I do think there will be some limited fatal cases with CXL accelerators
that could be recoverable if the accelerator knows that the memory error
can be recovered by resetting the device without surprise data-loss.
That work can wait until those failure cases become clearer.

I imagine something like CXL error isolation for a host-bridge dedicated
to a single accelerator might be recoverable, but anything with
general-purpose memory is likely better off with a kernel-panic (see the
CXL error isolation discussion:
http://lore.kernel.org/e7d4a31a-bd5e-41d9-9b51-fbbd5e8fc9b2@amd.com).

Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Bowman, Terry 1 year, 3 months ago

Hi Dan,

I added a question below.

On 10/22/2024 6:43 PM, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>> I was referring to reusing separate instance of 'struct pci_error_handlers' for CXL
>> UCE-CE errors.
>>
>> One example where it can be reused in infrastructure is in err.c's
>> report_error_detected(). If both PCIe and CXL errors use 'struct pci_error_handlers'
>> then the updated report_error_detected() becomes a bit simpler with less helper
>> function logic.
> report_error_detected() is concerned with link and i/o state
> (pci_dev_is_disconnected() and pci_dev_set_io_state()). For device
> disconnects, CXL recovery potentially needs to span multiple devices.
> For i/o state, CXL.io could be fully operational while CXL.cache and
> CXL.mem are in fatal state.
>
> CXL considerations do not feel welcome in that function.
>
> Ideally a PCIe developer never needs to see or understand the CXL error
> model because it is off in its own path. In other words, if someone
> maintaining pcie_do_recovery=>report_error_detected() for the PCIe case
> needs to go find a CXL expert each time they want to touch that path,
> that feels like a regression in PCIe error handling maintainability.
>
>> But, it's not a reason by itself to choose to reuse 'struct
>> pci_error_handlers' for CXL errors.
>>
>> Looking closer at aer,c shows there is no advantage in this file for using 'struct
>> pci_error_handlers' for CXL errors.
>>
>> If I understand correctly you want a new type introduced, 'struct cxl_error_handlers'.
> Yes, mainly because the bus state and the result of the recovery tend to
> be a different operational model. If a CXL error fits the PCIe model
> then it can be sent via pcie_do_recovery(), but I expect that only
> applies to a handful of correctable errors like CRC_Threshold,
> Retry_Threshold, or Physical_Layer_Error. Almost everything else *seems*
> like it has a CXL specific response that would confuse
> pcie_do_recovery(). 
>
> So, in general new operational models == new data structures and types.

Would you like to continue to use the pci_error_handlers for the CXL PCIe 
endpoint device driver? Or do we change the CXL PCIe endpoint driver to 
use the cxl_error_handlers ?
Regards,
Terry

Re: [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver

Posted by Dan Williams 1 year, 3 months ago

Bowman, Terry wrote:
[..]
> > So, in general new operational models == new data structures and types.
> 
> Would you like to continue to use the pci_error_handlers for the CXL PCIe 
> endpoint device driver? Or do we change the CXL PCIe endpoint driver to 
> use the cxl_error_handlers ?

I would expect to extend with new 'cxl_error_handlers' support. 'Extend'
not 'replace' because a CXL endpoint could in fact be operating in pure
PCIe mode.

Otherwise, it is currently ambiguous what an error handler invocation is
signaling. In the new scheme fatal CXL errors are panics, while fatal
PCIe errors remain device resets.

So the question is how to distinguish those events without separate
entry points. Either the error (bus) type needs to be added as a
parameter to the callbacks or the error type is implied by the 'error
handlers' type. I would rather not go disturb the PCIe error handler
world by making them deal with an error type passed to existing
callbacks, so a new invocation regime seems appropriate.

I am open to other thoughts, but this seems the most maintainable way to
handle these parallel universes that both happen to fork from an AER
error source.