PCI/AER: Add option to panic on unrecoverable errors

[PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Breno Leitao 1 day, 6 hours ago

When a device lacks an error_detected callback, AER recovery fails and
the device is left in a disconnected state. This can mask serious
hardware issues during development and testing.

Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
instead, making such failures immediately visible. The parameter
defaults to false to preserve existing behavior.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
In environments where all hardware must be fully operational, silently
leaving a device in a disconnected state after an AER recovery failure
is unacceptable. This is common in high-reliability systems, production
servers, and testing infrastructure where a degraded system should not
continue running.

This patch adds a module parameter that allows administrators to enforce
a strict policy: if a device cannot recover from an AER error, the
kernel panics instead of continuing with degraded hardware. This ensures
that hardware failures are immediately visible and can trigger
appropriate remediation (restart, failover, alerting).
---
 Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
 drivers/pci/pcie/err.c                          | 3 +++
 drivers/pci/pcie/portdrv.c                      | 7 +++++++
 drivers/pci/pcie/portdrv.h                      | 1 +
 4 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1058f2a6d6a8c..ff95c24280e3c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5240,6 +5240,15 @@ Kernel parameters
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes
 			all PCIe root ports use INTx for all services).
 
+	pcieportdrv.aer_unrecoverable_fatal=
+			[PCIE] Panic on unrecoverable AER errors:
+		0	Log the error and leave the device in a disconnected
+			state (default).
+		1	Panic the kernel when a device cannot recover from an
+			AER error (no error_detected callback). Useful for
+			high-reliability systems where degraded hardware is
+			unacceptable.
+
 	pcmv=		[HW,PCMCIA] BadgePAD 4
 
 	pd_ignore_unused
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d75..788484791902e 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev,
 		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
 			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
 			pci_info(dev, "can't recover (no error_detected callback)\n");
+			if (aer_unrecoverable_fatal)
+				panic("AER: %s: no error_detected callback\n",
+				      pci_name(dev));
 		} else {
 			vote = PCI_ERS_RESULT_NONE;
 		}
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 38a41ccf79b9a..a411f60ff50ce 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -22,6 +22,13 @@
 #include "../pci.h"
 #include "portdrv.h"
 
+#ifdef CONFIG_PCIEAER
+bool aer_unrecoverable_fatal;
+module_param(aer_unrecoverable_fatal, bool, 0644);
+MODULE_PARM_DESC(aer_unrecoverable_fatal,
+		 "Panic if a device cannot recover from an AER error (default: false)");
+#endif
+
 /*
  * The PCIe Capability Interrupt Message Number (PCIe r3.1, sec 7.8.2) must
  * be one of the first 32 MSI-X entries.  Per PCI r3.0, sec 6.8.3.1, MSI
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index bd29d1cc7b8bd..6c67b18de93c9 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -29,6 +29,7 @@ extern bool pcie_ports_dpc_native;
 
 #ifdef CONFIG_PCIEAER
 int pcie_aer_init(void);
+extern bool aer_unrecoverable_fatal;
 #else
 static inline int pcie_aer_init(void) { return 0; }
 #endif

---
base-commit: 6bd9ed02871f22beb0e50690b0c3caf457104f7c
change-id: 20260206-pci-362cf172187f

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Bjorn Helgaas 1 day, 6 hours ago

On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> When a device lacks an error_detected callback, AER recovery fails and
> the device is left in a disconnected state. This can mask serious
> hardware issues during development and testing.
> 
> Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
> instead, making such failures immediately visible. The parameter
> defaults to false to preserve existing behavior.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> In environments where all hardware must be fully operational, silently
> leaving a device in a disconnected state after an AER recovery failure
> is unacceptable. This is common in high-reliability systems, production
> servers, and testing infrastructure where a degraded system should not
> continue running.
> 
> This patch adds a module parameter that allows administrators to enforce
> a strict policy: if a device cannot recover from an AER error, the
> kernel panics instead of continuing with degraded hardware. This ensures
> that hardware failures are immediately visible and can trigger
> appropriate remediation (restart, failover, alerting).
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
>  drivers/pci/pcie/err.c                          | 3 +++
>  drivers/pci/pcie/portdrv.c                      | 7 +++++++
>  drivers/pci/pcie/portdrv.h                      | 1 +
>  4 files changed, 20 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 1058f2a6d6a8c..ff95c24280e3c 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5240,6 +5240,15 @@ Kernel parameters
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>  			all PCIe root ports use INTx for all services).
>  
> +	pcieportdrv.aer_unrecoverable_fatal=
> +			[PCIE] Panic on unrecoverable AER errors:
> +		0	Log the error and leave the device in a disconnected
> +			state (default).
> +		1	Panic the kernel when a device cannot recover from an
> +			AER error (no error_detected callback). Useful for
> +			high-reliability systems where degraded hardware is
> +			unacceptable.

Just from an overall complexity point of view, I'm a little hesitant
to add new kernel parameters because this seems like a very specific
case.

Is there anything we could do to improve the logging to make the issue
more recognizable?  I assume you already look for KERN_CRIT, KERN_ERR,
etc., but it looks like the current message is just KERN_INFO.  I
think we could make a good case for at least KERN_WARNING.

But I guess you probably want something that's just impossible to
ignore.

Are there any other similar flags you already use that we could
piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
the existing "panic_on_warn" would be enough?

> +++ b/drivers/pci/pcie/err.c
> @@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev,
>  		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>  			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>  			pci_info(dev, "can't recover (no error_detected callback)\n");
> +			if (aer_unrecoverable_fatal)
> +				panic("AER: %s: no error_detected callback\n",
> +				      pci_name(dev));
>  		} else {
>  			vote = PCI_ERS_RESULT_NONE;
>  		}

Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Keith Busch 1 day, 5 hours ago

On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> Just from an overall complexity point of view, I'm a little hesitant
> to add new kernel parameters because this seems like a very specific
> case.
> 
> Is there anything we could do to improve the logging to make the issue
> more recognizable?  I assume you already look for KERN_CRIT, KERN_ERR,
> etc., but it looks like the current message is just KERN_INFO.  I
> think we could make a good case for at least KERN_WARNING.
> 
> But I guess you probably want something that's just impossible to
> ignore.

It's not necessarily about improving visibility with a higher alert
level. It's more that the system can't be trusted to operate correctly
from here on. Consider an interconnected GPU setup and only one
experiences an unrecoverable error. We don't want to leave the system
limping along with this unresolved error as it can't perform anything
useful. A panic induced reboot is the least bad option to return the
system to operation, or crashes the system temporally close to failure
to get logs for the vendor if we're actively debugging.

> Are there any other similar flags you already use that we could
> piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> the existing "panic_on_warn" would be enough?

There are many KERN_WARNING messages that don't rise to the level of
warranting a 'panic' that don't want to enable such an option in
production. It looks like the panic_on_warn was introduced for developer
debugging.

I agree the curnent INFO level is too low for the generic unrecovered
condition, though.

Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Lukas Wunner 1 day, 4 hours ago

On Fri, Feb 06, 2026 at 12:22:44PM -0700, Keith Busch wrote:
> On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> > Are there any other similar flags you already use that we could
> > piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> > the existing "panic_on_warn" would be enough?
> 
> There are many KERN_WARNING messages that don't rise to the level of
> warranting a 'panic' that don't want to enable such an option in
> production. It looks like the panic_on_warn was introduced for developer
> debugging.

panic_on_warn springs into action on WARN() splats, not arbitrary
messages with KERN_WARNING severity.  Also, sysctl kernel.warn_limit
may be used to grant a certain number of panic-free WARNs.

FWIW, the "pcieportdrv.aer_unrecoverable_fatal" parameter introduced
by this patch feels somewhat oddly named.  Something like
"pci.panic_on_fatal" might be clearer and more succinct.

> I agree the curnent INFO level is too low for the generic unrecovered
> condition, though.

At least for unbound devices, I think 918b4053184c went way too far.
I think an unbound device should generally be considered recoverable
through a reset.

As for bound devices whose drivers lack pci_error_handlers, it has been
painful in practice that they're considered unrecoverable wholesale.
E.g. GPUs often expose an audio device as well as telemetry devices,
all arranged below an integrated PCIe switch.  All of these devices
need drivers with pci_error_handlers in order for the GPU to be
recoverable.  In some cases, dummy callbacks were added to render
the whole thing recoverable.

So I wouldn't consider 918b4053184c to have been a universally successful
approach and I fear that this patch goes even further.

Thanks,

Lukas

Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Keith Busch 19 hours ago

On Fri, Feb 06, 2026 at 09:53:39PM +0100, Lukas Wunner wrote:
> On Fri, Feb 06, 2026 at 12:22:44PM -0700, Keith Busch wrote:
> > On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> > > Are there any other similar flags you already use that we could
> > > piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> > > the existing "panic_on_warn" would be enough?
> > 
> > There are many KERN_WARNING messages that don't rise to the level of
> > warranting a 'panic' that don't want to enable such an option in
> > production. It looks like the panic_on_warn was introduced for developer
> > debugging.
> 
> panic_on_warn springs into action on WARN() splats, not arbitrary
> messages with KERN_WARNING severity.  Also, sysctl kernel.warn_limit
> may be used to grant a certain number of panic-free WARNs.

Okay, but the warn panic param still isn't an option for production.

> FWIW, the "pcieportdrv.aer_unrecoverable_fatal" parameter introduced
> by this patch feels somewhat oddly named.  Something like
> "pci.panic_on_fatal" might be clearer and more succinct.

Naming is hard; thanks for the suggestion.

> > I agree the curnent INFO level is too low for the generic unrecovered
> > condition, though.
> 
> At least for unbound devices, I think 918b4053184c went way too far.
> I think an unbound device should generally be considered recoverable
> through a reset.

Yes, I agree, especially considering the generic probe saves a
checkpoint of the state that we can restore to that is consistent with
the kernel's view. There's no clear reason to fail recovery when there's
no bound driver, so this changing that behavior s a good idea.

> As for bound devices whose drivers lack pci_error_handlers, it has been
> painful in practice that they're considered unrecoverable wholesale.

Yes, it gets tricky when there is a bound driver; there's no telling
whether or not it may initiate a broken transaction with cascading
consequences for the rest of the system if anything in the chain is not
cooperating with the error recovery orchestration. I don't know if there
is a best default action, so allowing it to be user defined seems okay.

> E.g. GPUs often expose an audio device as well as telemetry devices,
> all arranged below an integrated PCIe switch.  All of these devices
> need drivers with pci_error_handlers in order for the GPU to be
> recoverable.  In some cases, dummy callbacks were added to render
> the whole thing recoverable.

This experience sounds familiar, and it really does appear that a hard
reboot is the best outcome in many cases because orchestrating all the
components to recover is not going to happen. Hence the reboot param.

> So I wouldn't consider 918b4053184c to have been a universally successful
> approach and I fear that this patch goes even further.

If anyone goes through the effort of fixing that, will it be considered?
You told me in Vienna LPC '24 that you'd help resolve the pci hotplug
deadlocks that have been plaguing pci for the last 10 years, but not a
single comment has happened despite multiple complete and validated
solutions offered.

Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Lukas Wunner 1 day, 4 hours ago

On Fri, Feb 06, 2026 at 09:53:39PM +0100, Lukas Wunner wrote:
> So I wouldn't consider 918b4053184c to have been a universally successful
> approach and I fear that this patch goes even further.

Forgot to mention -- there's another problem:

PCI_ERS_RESULT_NO_AER_DRIVER is obviously AER-specific.

powerpc (EEH) and s390 have error recovery mechanisms separate from AER
and we've been trying to align them more closely so that drivers don't
need to be aware of platform-specific behavior.

eeh_pe_report_edev() does not modify the pci_ers_result for unbound
drivers and those without pci_error_handlers.  And the default is
PCI_ERS_RESULT_NONE.  eeh_report_error() also returns PCI_ERS_RESULT_NONE
for drivers without ->error_detected() callback.

In the PCI_ERS_RESULT_NONE case, EEH seems to perform a reset and
assume successful recovery.

It's only AER that is this strict about unbound devices and drivers that
lack pci_error_handlers.

If anything we should try to *reduce* deviations between the various
error recovery mechanisms, not double down on increasing them.

Thanks,

Lukas

Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Keith Busch 1 day, 6 hours ago

On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> When a device lacks an error_detected callback, AER recovery fails and
> the device is left in a disconnected state. This can mask serious
> hardware issues during development and testing.
> 
> Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
> instead, making such failures immediately visible. The parameter
> defaults to false to preserve existing behavior.

Sounds like a good idea. There used to be a code comment suggesting
there are probably conditions where you want this panic behavior but it
was removed with commit:

  b06d125e6280603a34d9064cd9c12748ca2edb04

Which I'm not sure was an accurate thing to do as it assumes the system
can remain operational without recoverying, and that's just not always
the case.

> @@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev,
>  		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>  			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>  			pci_info(dev, "can't recover (no error_detected callback)\n");
> +			if (aer_unrecoverable_fatal)
> +				panic("AER: %s: no error_detected callback\n",
> +				      pci_name(dev));

Is this the only condition that the panic behavior should apply? I feel
like we may want to defer the panic to the recovery failed case and even
include the "disconnect" condition. Maybe something like this?

---
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d75..c5a631e2b565b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -295,5 +295,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
 	pci_info(bridge, "device recovery failed\n");
 
+	if (aer_unrecoverable_fatal &&
+	    (status == PCI_ERS_RESULT_DISCONNECT ||
+	     status == PCI_ERS_RESULT_NO_AER_DRIVER))
+		panic("AER: can not continue, status:%d\n", pci_name(dev), status);
+
 	return status;
 }
--

Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

Posted by Lukas Wunner 1 day, 6 hours ago

On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> When a device lacks an error_detected callback, AER recovery fails and
> the device is left in a disconnected state. This can mask serious
> hardware issues during development and testing.
> 
> Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
> instead, making such failures immediately visible. The parameter
> defaults to false to preserve existing behavior.

There's a parallel effort by Terry Bowman (+cc) to introduce a
PCI_ERS_RESULT_PANIC return value for error handling:

https://lore.kernel.org/all/20260203025244.3093805-4-terry.bowman@amd.com/

Please consider using that as the basis for your needs.

Thanks,

Lukas