pci: implement "pci=aer_panic" | Patchew

[PATCH 0/4] pci: implement "pci=aer_panic"

Hans Zhang posted 4 patches 8 months, 3 weeks ago

Download series mbox

.../admin-guide/kernel-parameters.txt          |  7 +++++++
drivers/pci/pci.c                              |  2 ++
drivers/pci/pci.h                              |  4 ++++
drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
drivers/pci/pcie/err.c                         |  8 ++++++--
5 files changed, 37 insertions(+), 2 deletions(-)

Expand all Fold all

[PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago

The following series introduces a new kernel command-line option aer_panic
to enhance error handling for PCIe Advanced Error Reporting (AER) in
mission-critical environments. This feature ensures deterministic recover
from fatal PCIe errors by triggering a controlled kernel panic when device
recovery fails, avoiding indefinite system hangs.

Problem Statement
In systems where unresolved PCIe errors (e.g., bus hangs) occur,
traditional error recovery mechanisms may leave the system unresponsive
indefinitely. This is unacceptable for high-availability environment
requiring prompt recovery via reboot.

Solution
The aer_panic option forces a kernel panic on unrecoverable AER errors.
This bypasses prolonged recovery attempts and ensures immediate reboot.

Patch Summary:
Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
its purpose and usage.

Command-Line Handling: Implements pci=aer_panic parsing and state
management in PCI core.

State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
mode is active.

Panic Trigger: Modifies recovery logic to panic the system when recovery
fails and aer_panic is enabled.

Impact
Controlled Recovery: Reduces downtime by replacing hangs with immediate
reboots.

Optional: Enabled via pci=aer_panic; no default behavior change.

Dependency: Requires CONFIG_PCIEAER.

For example, in mobile phones and tablets, when there is a problem with
the PCIe link and it cannot be restored, it is expected to provide an
alternative method to make the system panic without waiting for the
battery power to be completely exhausted before restarting the system.

---
For example, the sm8250 and sm8350 of qcom will panic and restart the
system when they are linked down.

https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950

Since the design schemes of each SOC manufacturer are different, the AXI
and other buses connected by PCIe do not have a design to prevent hanging.
Once a FATAL error occurs in the PCIe link and cannot be restored, the
system needs to be restarted.

Dear Mani,

I wonder if you know how other SoCs of qcom handle FATAL errors that occur
in PCIe link.
---

Hans Zhang (4):
pci: implement "pci=aer_panic"
PCI/AER: Introduce aer_panic kernel command-line option
PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set

.../admin-guide/kernel-parameters.txt | 7 +++++++
drivers/pci/pci.c | 2 ++
drivers/pci/pci.h | 4 ++++
drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
drivers/pci/pcie/err.c | 8 ++++++--
5 files changed, 37 insertions(+), 2 deletions(-)

base-commit: fee3e843b309444f48157e2188efa6818bae85cf
prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
--
2.25.1

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Bjorn Helgaas 8 months, 3 weeks ago

On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
> The following series introduces a new kernel command-line option aer_panic
> to enhance error handling for PCIe Advanced Error Reporting (AER) in
> mission-critical environments. This feature ensures deterministic recover
> from fatal PCIe errors by triggering a controlled kernel panic when device
> recovery fails, avoiding indefinite system hangs.

We try very hard not to add new kernel parameters.

It sounds like part of the problem is the use of SPI interrupts rather
than the PCIe-architected INTx/MSI/MSI-X.  I'm not sure this warrants
generic upstream code changes.  This might be something you need to
maintain out-of-tree.

> Problem Statement
> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
> traditional error recovery mechanisms may leave the system unresponsive
> indefinitely. This is unacceptable for high-availability environment
> requiring prompt recovery via reboot.
> 
> Solution
> The aer_panic option forces a kernel panic on unrecoverable AER errors.
> This bypasses prolonged recovery attempts and ensures immediate reboot.
> 
> Patch Summary:
> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
> its purpose and usage.
> 
> Command-Line Handling: Implements pci=aer_panic parsing and state
> management in PCI core.
> 
> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
> mode is active.
> 
> Panic Trigger: Modifies recovery logic to panic the system when recovery
> fails and aer_panic is enabled.
> 
> Impact
> Controlled Recovery: Reduces downtime by replacing hangs with immediate
> reboots.
> 
> Optional: Enabled via pci=aer_panic; no default behavior change.
> 
> Dependency: Requires CONFIG_PCIEAER.
> 
> For example, in mobile phones and tablets, when there is a problem with
> the PCIe link and it cannot be restored, it is expected to provide an
> alternative method to make the system panic without waiting for the
> battery power to be completely exhausted before restarting the system.
> 
> ---
> For example, the sm8250 and sm8350 of qcom will panic and restart the
> system when they are linked down.
> 
> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
> 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
> 
> 
> Since the design schemes of each SOC manufacturer are different, the AXI
> and other buses connected by PCIe do not have a design to prevent hanging.
> Once a FATAL error occurs in the PCIe link and cannot be restored, the
> system needs to be restarted.
> 
> 
> Dear Mani,
> 
> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
> in PCIe link.
> ---
> 
> Hans Zhang (4):
>   pci: implement "pci=aer_panic"
>   PCI/AER: Introduce aer_panic kernel command-line option
>   PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>   PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
> 
>  .../admin-guide/kernel-parameters.txt          |  7 +++++++
>  drivers/pci/pci.c                              |  2 ++
>  drivers/pci/pci.h                              |  4 ++++
>  drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>  drivers/pci/pcie/err.c                         |  8 ++++++--
>  5 files changed, 37 insertions(+), 2 deletions(-)
> 
> 
> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
> -- 
> 2.25.1
>

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago


On 2025/5/20 06:03, Bjorn Helgaas wrote:
> On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
>> The following series introduces a new kernel command-line option aer_panic
>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>> mission-critical environments. This feature ensures deterministic recover
>> from fatal PCIe errors by triggering a controlled kernel panic when device
>> recovery fails, avoiding indefinite system hangs.
> 
> We try very hard not to add new kernel parameters.
> 
> It sounds like part of the problem is the use of SPI interrupts rather
> than the PCIe-architected INTx/MSI/MSI-X.  I'm not sure this warrants
> generic upstream code changes.  This might be something you need to
> maintain out-of-tree.
> 

Dear Bjorn,

This seems to have nothing to do with whether AER uses the 
INTx/MSI/MSI-X specified in the PCIe spec. Just like the example I gave 
earlier.

Our next-generation SOC has already converted AER interrupts into INTx 
and reported them to the GIC interrupt controller. But the following 
problems still cannot be solved.

```
Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
      /* Clear AXI link-down status */
      cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to
continue.  This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving
files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes
CPU Core1 to be unable to send TLP transfers and hang.  This is a very
extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in
the upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/
```


If we are in the out-of-tree maintenance corresponding driver, but in 
the file the arch/arm64 / configs/defconfig "CONFIG_PCIEAER=y", make we 
can't modify the AER common code. It also cannot be compiled to aer.ko

Because: CONFIG_PCIEAER can only be equal to y or n.
config PCIEAER
	bool "PCI Express Advanced Error Reporting support"
	depends on PCIEPORTBUS
	select RAS
	help
	  This enables PCI Express Root Port Advanced Error Reporting
	  (AER) driver support. Error reporting messages sent to Root
	  Port will be handled by PCI Express AER driver.

Furthermore, the API of AER common code cannot be used either, and many 
variables have not been exported either. If we write another set of AER 
drivers by ourselves, it will lead to a lot of repetitive processing 
logic code.


I believe that the Qualcomm platform and many other platforms also have 
similar problems.


So can we add a config? For example: CONFIG_PCIEAER_PANIC instead of 
command-line option aer_panic. Or the AER driver can be KO(tristate), so 
that our SOC manufacturer can modify the AER driver.


Best regards,
Hans

>> Problem Statement
>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>> traditional error recovery mechanisms may leave the system unresponsive
>> indefinitely. This is unacceptable for high-availability environment
>> requiring prompt recovery via reboot.
>>
>> Solution
>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>
>> Patch Summary:
>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>> its purpose and usage.
>>
>> Command-Line Handling: Implements pci=aer_panic parsing and state
>> management in PCI core.
>>
>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>> mode is active.
>>
>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>> fails and aer_panic is enabled.
>>
>> Impact
>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>> reboots.
>>
>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>
>> Dependency: Requires CONFIG_PCIEAER.
>>
>> For example, in mobile phones and tablets, when there is a problem with
>> the PCIe link and it cannot be restored, it is expected to provide an
>> alternative method to make the system panic without waiting for the
>> battery power to be completely exhausted before restarting the system.
>>
>> ---
>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>> system when they are linked down.
>>
>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>
>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>
>>
>> Since the design schemes of each SOC manufacturer are different, the AXI
>> and other buses connected by PCIe do not have a design to prevent hanging.
>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>> system needs to be restarted.
>>
>>
>> Dear Mani,
>>
>> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
>> in PCIe link.
>> ---
>>
>> Hans Zhang (4):
>>    pci: implement "pci=aer_panic"
>>    PCI/AER: Introduce aer_panic kernel command-line option
>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>
>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>   drivers/pci/pci.c                              |  2 ++
>>   drivers/pci/pci.h                              |  4 ++++
>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>
>>
>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>> -- 
>> 2.25.1
>>

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Sathyanarayanan Kuppuswamy 8 months, 3 weeks ago

On 5/16/25 9:55 AM, Hans Zhang wrote:
> The following series introduces a new kernel command-line option aer_panic
> to enhance error handling for PCIe Advanced Error Reporting (AER) in
> mission-critical environments. This feature ensures deterministic recover
> from fatal PCIe errors by triggering a controlled kernel panic when device
> recovery fails, avoiding indefinite system hangs.

Why would a device recovery failure lead to a system hang? Worst case
that device may not be accessible, right?  Any real use case?

>
> Problem Statement
> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
> traditional error recovery mechanisms may leave the system unresponsive
> indefinitely. This is unacceptable for high-availability environment
> requiring prompt recovery via reboot.
>
> Solution
> The aer_panic option forces a kernel panic on unrecoverable AER errors.
> This bypasses prolonged recovery attempts and ensures immediate reboot.
>
> Patch Summary:
> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
> its purpose and usage.
>
> Command-Line Handling: Implements pci=aer_panic parsing and state
> management in PCI core.
>
> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
> mode is active.
>
> Panic Trigger: Modifies recovery logic to panic the system when recovery
> fails and aer_panic is enabled.
>
> Impact
> Controlled Recovery: Reduces downtime by replacing hangs with immediate
> reboots.
>
> Optional: Enabled via pci=aer_panic; no default behavior change.
>
> Dependency: Requires CONFIG_PCIEAER.
>
> For example, in mobile phones and tablets, when there is a problem with
> the PCIe link and it cannot be restored, it is expected to provide an
> alternative method to make the system panic without waiting for the
> battery power to be completely exhausted before restarting the system.
>
> ---
> For example, the sm8250 and sm8350 of qcom will panic and restart the
> system when they are linked down.
>
> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>
>
> Since the design schemes of each SOC manufacturer are different, the AXI
> and other buses connected by PCIe do not have a design to prevent hanging.
> Once a FATAL error occurs in the PCIe link and cannot be restored, the
> system needs to be restarted.
>
>
> Dear Mani,
>
> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
> in PCIe link.
> ---
>
> Hans Zhang (4):
>    pci: implement "pci=aer_panic"
>    PCI/AER: Introduce aer_panic kernel command-line option
>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>
>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>   drivers/pci/pci.c                              |  2 ++
>   drivers/pci/pci.h                              |  4 ++++
>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>   drivers/pci/pcie/err.c                         |  8 ++++++--
>   5 files changed, 37 insertions(+), 2 deletions(-)
>
>
> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago


On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/16/25 9:55 AM, Hans Zhang wrote:
>> The following series introduces a new kernel command-line option 
>> aer_panic
>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>> mission-critical environments. This feature ensures deterministic recover
>> from fatal PCIe errors by triggering a controlled kernel panic when 
>> device
>> recovery fails, avoiding indefinite system hangs.
> 
> Why would a device recovery failure lead to a system hang? Worst case
> that device may not be accessible, right?  Any real use case?
> 


Dear Sathyanarayanan,

Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers 
have designed it this way.)  For example, when many mobile phone SoCs of 
Qualcomm handle AER interrupts and there is a link down, that is, a 
fatal problem occurs in the current PCIe physical link, the system 
cannot recover.  At this point, a system restart is needed to solve the 
problem.

And our company design of SOC: http://radxa.com/products/orion/o6/, it 
has 5 road PCIe port.
There is also the same problem.  If there is a problem with one of the 
PCIe ports, it will cause the entire system to hang.  So I hope linux OS 
can offer an option that enables SOC manufacturers to choose to restart 
the system in case of fatal hardware errors occurring in PCIe.

There are also products such as mobile phones and tablets.  We don't 
want to wait until the battery is completely used up before restarting them.

For the specific code of Qualcomm, please refer to the email I sent.

Best regards,
Hans

>>
>> Problem Statement
>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>> traditional error recovery mechanisms may leave the system unresponsive
>> indefinitely. This is unacceptable for high-availability environment
>> requiring prompt recovery via reboot.
>>
>> Solution
>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>
>> Patch Summary:
>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>> its purpose and usage.
>>
>> Command-Line Handling: Implements pci=aer_panic parsing and state
>> management in PCI core.
>>
>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>> mode is active.
>>
>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>> fails and aer_panic is enabled.
>>
>> Impact
>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>> reboots.
>>
>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>
>> Dependency: Requires CONFIG_PCIEAER.
>>
>> For example, in mobile phones and tablets, when there is a problem with
>> the PCIe link and it cannot be restored, it is expected to provide an
>> alternative method to make the system panic without waiting for the
>> battery power to be completely exhausted before restarting the system.
>>
>> ---
>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>> system when they are linked down.
>>
>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>
>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>
>>
>> Since the design schemes of each SOC manufacturer are different, the AXI
>> and other buses connected by PCIe do not have a design to prevent 
>> hanging.
>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>> system needs to be restarted.
>>
>>
>> Dear Mani,
>>
>> I wonder if you know how other SoCs of qcom handle FATAL errors that 
>> occur
>> in PCIe link.
>> ---
>>
>> Hans Zhang (4):
>>    pci: implement "pci=aer_panic"
>>    PCI/AER: Introduce aer_panic kernel command-line option
>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>
>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>   drivers/pci/pci.c                              |  2 ++
>>   drivers/pci/pci.h                              |  4 ++++
>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>
>>
>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago


On 2025/5/19 22:21, Hans Zhang wrote:
> 
> 
> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>> The following series introduces a new kernel command-line option 
>>> aer_panic
>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>> mission-critical environments. This feature ensures deterministic 
>>> recover
>>> from fatal PCIe errors by triggering a controlled kernel panic when 
>>> device
>>> recovery fails, avoiding indefinite system hangs.
>>
>> Why would a device recovery failure lead to a system hang? Worst case
>> that device may not be accessible, right?  Any real use case?
>>
> 
> 
> Dear Sathyanarayanan,
> 
> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
> SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
> design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers 
> have designed it this way.)  For example, when many mobile phone SoCs of 
> Qualcomm handle AER interrupts and there is a link down, that is, a 
> fatal problem occurs in the current PCIe physical link, the system 
> cannot recover.  At this point, a system restart is needed to solve the 
> problem.
> 
> And our company design of SOC: http://radxa.com/products/orion/o6/, it 
> has 5 road PCIe port.
> There is also the same problem.  If there is a problem with one of the 
> PCIe ports, it will cause the entire system to hang.  So I hope linux OS 
> can offer an option that enables SOC manufacturers to choose to restart 
> the system in case of fatal hardware errors occurring in PCIe.
> 
> There are also products such as mobile phones and tablets.  We don't 
> want to wait until the battery is completely used up before restarting 
> them.
> 
> For the specific code of Qualcomm, please refer to the email I sent.
> 


Dear Sathyanarayanan,

Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
     /* Clear AXI link-down status */
     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register 
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
continue.  This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving 
files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes 
CPU Core1 to be unable to send TLP transfers and hang.  This is a very 
extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in 
the upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/

Best regards,
Hans

> 
>>>
>>> Problem Statement
>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>> traditional error recovery mechanisms may leave the system unresponsive
>>> indefinitely. This is unacceptable for high-availability environment
>>> requiring prompt recovery via reboot.
>>>
>>> Solution
>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>
>>> Patch Summary:
>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, 
>>> explaining
>>> its purpose and usage.
>>>
>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>> management in PCI core.
>>>
>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>> mode is active.
>>>
>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>> fails and aer_panic is enabled.
>>>
>>> Impact
>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>> reboots.
>>>
>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>
>>> Dependency: Requires CONFIG_PCIEAER.
>>>
>>> For example, in mobile phones and tablets, when there is a problem with
>>> the PCIe link and it cannot be restored, it is expected to provide an
>>> alternative method to make the system panic without waiting for the
>>> battery power to be completely exhausted before restarting the system.
>>>
>>> ---
>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>> system when they are linked down.
>>>
>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>
>>>
>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>> and other buses connected by PCIe do not have a design to prevent 
>>> hanging.
>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>> system needs to be restarted.
>>>
>>>
>>> Dear Mani,
>>>
>>> I wonder if you know how other SoCs of qcom handle FATAL errors that 
>>> occur
>>> in PCIe link.
>>> ---
>>>
>>> Hans Zhang (4):
>>>    pci: implement "pci=aer_panic"
>>>    PCI/AER: Introduce aer_panic kernel command-line option
>>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>>
>>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>>   drivers/pci/pci.c                              |  2 ++
>>>   drivers/pci/pci.h                              |  4 ++++
>>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>>
>>>
>>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>>

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Sathyanarayanan Kuppuswamy 8 months, 3 weeks ago

On 5/19/25 7:41 AM, Hans Zhang wrote:
>
>
> On 2025/5/19 22:21, Hans Zhang wrote:
>>
>>
>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>
>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>> The following series introduces a new kernel command-line option aer_panic
>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>> mission-critical environments. This feature ensures deterministic recover
>>>> from fatal PCIe errors by triggering a controlled kernel panic when device
>>>> recovery fails, avoiding indefinite system hangs.
>>>
>>> Why would a device recovery failure lead to a system hang? Worst case
>>> that device may not be accessible, right?  Any real use case?
>>>
>>
>>
>> Dear Sathyanarayanan,
>>
>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers have designed it this way.)  For example, when many mobile phone SoCs of Qualcomm handle AER interrupts and there is a link down, that is, a fatal problem occurs in the current PCIe physical link, the system cannot recover.  At this point, a system restart is needed to solve the problem.
>>
>> And our company design of SOC: http://radxa.com/products/orion/o6/, it has 5 road PCIe port.
>> There is also the same problem.  If there is a problem with one of the PCIe ports, it will cause the entire system to hang.  So I hope linux OS can offer an option that enables SOC manufacturers to choose to restart the system in case of fatal hardware errors occurring in PCIe.
>>
>> There are also products such as mobile phones and tablets.  We don't want to wait until the battery is completely used up before restarting them.
>>
>> For the specific code of Qualcomm, please refer to the email I sent.
>>
>
>
> Dear Sathyanarayanan,
>
> Supplementary reasons:
>
> drivers/pci/controller/cadence/pcie-cadence-host.c
> cdns_pci_map_bus
>     /* Clear AXI link-down status */
>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>
> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>
> If there has been a link down in this PCIe port, the register CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to continue.  This is different from Synopsys.
>
> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes CPU Core1 to be unable to send TLP transfers and hang. This is a very extreme situation.
> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still in the upstream process at present.)
>
> Radxa O6 uses Cadence's PCIe HPA IP.
> http://radxa.com/products/orion/o6/
>

It sounds like a system level issue to me. Why not they rely on watchdog to reboot for
this case ?

Even if you want to add this support, I think it is more appropriate to add this to your
specific PCIe controller driver.  I don't see why you want to add it part of generic
AER driver.

> Best regards,
> Hans
>
>>
>>>>
>>>> Problem Statement
>>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>>> traditional error recovery mechanisms may leave the system unresponsive
>>>> indefinitely. This is unacceptable for high-availability environment
>>>> requiring prompt recovery via reboot.
>>>>
>>>> Solution
>>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>>
>>>> Patch Summary:
>>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>>>> its purpose and usage.
>>>>
>>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>>> management in PCI core.
>>>>
>>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>>> mode is active.
>>>>
>>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>>> fails and aer_panic is enabled.
>>>>
>>>> Impact
>>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>>> reboots.
>>>>
>>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>>
>>>> Dependency: Requires CONFIG_PCIEAER.
>>>>
>>>> For example, in mobile phones and tablets, when there is a problem with
>>>> the PCIe link and it cannot be restored, it is expected to provide an
>>>> alternative method to make the system panic without waiting for the
>>>> battery power to be completely exhausted before restarting the system.
>>>>
>>>> ---
>>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>>> system when they are linked down.
>>>>
>>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>>
>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>>
>>>>
>>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>>> and other buses connected by PCIe do not have a design to prevent hanging.
>>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>>> system needs to be restarted.
>>>>
>>>>
>>>> Dear Mani,
>>>>
>>>> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
>>>> in PCIe link.
>>>> ---
>>>>
>>>> Hans Zhang (4):
>>>>    pci: implement "pci=aer_panic"
>>>>    PCI/AER: Introduce aer_panic kernel command-line option
>>>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>>>
>>>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>>>   drivers/pci/pci.c                              |  2 ++
>>>>   drivers/pci/pci.h                              |  4 ++++
>>>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>>>
>>>>
>>>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>>>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>>>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>>>
>
>
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago


On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/19/25 7:41 AM, Hans Zhang wrote:
>>
>>
>> On 2025/5/19 22:21, Hans Zhang wrote:
>>>
>>>
>>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>>
>>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>>> The following series introduces a new kernel command-line option 
>>>>> aer_panic
>>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>>> mission-critical environments. This feature ensures deterministic 
>>>>> recover
>>>>> from fatal PCIe errors by triggering a controlled kernel panic when 
>>>>> device
>>>>> recovery fails, avoiding indefinite system hangs.
>>>>
>>>> Why would a device recovery failure lead to a system hang? Worst case
>>>> that device may not be accessible, right?  Any real use case?
>>>>
>>>
>>>
>>> Dear Sathyanarayanan,
>>>
>>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
>>> SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
>>> design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all 
>>> customers have designed it this way.)  For example, when many mobile 
>>> phone SoCs of Qualcomm handle AER interrupts and there is a link 
>>> down, that is, a fatal problem occurs in the current PCIe physical 
>>> link, the system cannot recover.  At this point, a system restart is 
>>> needed to solve the problem.
>>>
>>> And our company design of SOC: http://radxa.com/products/orion/o6/, 
>>> it has 5 road PCIe port.
>>> There is also the same problem.  If there is a problem with one of 
>>> the PCIe ports, it will cause the entire system to hang.  So I hope 
>>> linux OS can offer an option that enables SOC manufacturers to choose 
>>> to restart the system in case of fatal hardware errors occurring in 
>>> PCIe.
>>>
>>> There are also products such as mobile phones and tablets.  We don't 
>>> want to wait until the battery is completely used up before 
>>> restarting them.
>>>
>>> For the specific code of Qualcomm, please refer to the email I sent.
>>>
>>
>>
>> Dear Sathyanarayanan,
>>
>> Supplementary reasons:
>>
>> drivers/pci/controller/cadence/pcie-cadence-host.c
>> cdns_pci_map_bus
>>     /* Clear AXI link-down status */
>>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>>
>> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>>
>> If there has been a link down in this PCIe port, the register 
>> CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
>> continue.  This is different from Synopsys.
>>
>> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD 
>> saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it 
>> causes CPU Core1 to be unable to send TLP transfers and hang. This is 
>> a very extreme situation.
>> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still 
>> in the upstream process at present.)
>>
>> Radxa O6 uses Cadence's PCIe HPA IP.
>> http://radxa.com/products/orion/o6/
>>
> 
> It sounds like a system level issue to me. Why not they rely on watchdog 
> to reboot for
> this case ?

Dear Sathyanarayanan,

Thank you for your reply. Yes, personally, I think it's also a problem 
at the system level. I conducted a local test. When I directly unplugged 
the EP device on the slot, the system would hang. It has been tested 
many times. Since we don't have a bus timeout response mechanism for 
PCIe, it hangs easily.

> 
> Even if you want to add this support, I think it is more appropriate to 
> add this to your
> specific PCIe controller driver.  I don't see why you want to add it 
> part of generic
> AER driver.
> 
Because we want to use the processing logic of the general AER driver. 
If the recovery is successful, there will be no problem. If the recovery 
fails, my original intention was to restart the system.

If added to the specific PCIe controller driver, a lot of repetitive AER 
processing logic will be written. So I was thinking whether the AER 
driver could be changed to be compiled as a KO module.


If this series is not reasonable, I'll drop it.


Best regards,
Hans

>>>
>>>>>
>>>>> Problem Statement
>>>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>>>> traditional error recovery mechanisms may leave the system 
>>>>> unresponsive
>>>>> indefinitely. This is unacceptable for high-availability environment
>>>>> requiring prompt recovery via reboot.
>>>>>
>>>>> Solution
>>>>> The aer_panic option forces a kernel panic on unrecoverable AER 
>>>>> errors.
>>>>> This bypasses prolonged recovery attempts and ensures immediate 
>>>>> reboot.
>>>>>
>>>>> Patch Summary:
>>>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, 
>>>>> explaining
>>>>> its purpose and usage.
>>>>>
>>>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>>>> management in PCI core.
>>>>>
>>>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the 
>>>>> panic
>>>>> mode is active.
>>>>>
>>>>> Panic Trigger: Modifies recovery logic to panic the system when 
>>>>> recovery
>>>>> fails and aer_panic is enabled.
>>>>>
>>>>> Impact
>>>>> Controlled Recovery: Reduces downtime by replacing hangs with 
>>>>> immediate
>>>>> reboots.
>>>>>
>>>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>>>
>>>>> Dependency: Requires CONFIG_PCIEAER.
>>>>>
>>>>> For example, in mobile phones and tablets, when there is a problem 
>>>>> with
>>>>> the PCIe link and it cannot be restored, it is expected to provide an
>>>>> alternative method to make the system panic without waiting for the
>>>>> battery power to be completely exhausted before restarting the system.
>>>>>
>>>>> ---
>>>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>>>> system when they are linked down.
>>>>>
>>>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>>>
>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>>>
>>>>>
>>>>> Since the design schemes of each SOC manufacturer are different, 
>>>>> the AXI
>>>>> and other buses connected by PCIe do not have a design to prevent 
>>>>> hanging.
>>>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>>>> system needs to be restarted.
>>>>>
>>>>>
>>>>> Dear Mani,
>>>>>
>>>>> I wonder if you know how other SoCs of qcom handle FATAL errors 
>>>>> that occur
>>>>> in PCIe link.
>>>>> ---
>>>>>

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Sathyanarayanan Kuppuswamy 8 months, 3 weeks ago

On 5/21/25 7:54 AM, Hans Zhang wrote:
>
>
> On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 5/19/25 7:41 AM, Hans Zhang wrote:
>>>
>>>
>>> On 2025/5/19 22:21, Hans Zhang wrote:
>>>>
>>>>
>>>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>>>
>>>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>>>> The following series introduces a new kernel command-line option aer_panic
>>>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>>>> mission-critical environments. This feature ensures deterministic recover
>>>>>> from fatal PCIe errors by triggering a controlled kernel panic when device
>>>>>> recovery fails, avoiding indefinite system hangs.
>>>>>
>>>>> Why would a device recovery failure lead to a system hang? Worst case
>>>>> that device may not be accessible, right?  Any real use case?
>>>>>
>>>>
>>>>
>>>> Dear Sathyanarayanan,
>>>>
>>>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually SPI interrupts, not INTx/MSI/MSIx interrupts. (Some customers will design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers have designed it this way.)  For example, when many mobile phone SoCs of Qualcomm handle AER interrupts and there is a link down, that is, a fatal problem occurs in the current PCIe physical link, the system cannot recover.  At this point, a system restart is needed to solve the problem.
>>>>
>>>> And our company design of SOC: http://radxa.com/products/orion/o6/, it has 5 road PCIe port.
>>>> There is also the same problem.  If there is a problem with one of the PCIe ports, it will cause the entire system to hang.  So I hope linux OS can offer an option that enables SOC manufacturers to choose to restart the system in case of fatal hardware errors occurring in PCIe.
>>>>
>>>> There are also products such as mobile phones and tablets. We don't want to wait until the battery is completely used up before restarting them.
>>>>
>>>> For the specific code of Qualcomm, please refer to the email I sent.
>>>>
>>>
>>>
>>> Dear Sathyanarayanan,
>>>
>>> Supplementary reasons:
>>>
>>> drivers/pci/controller/cadence/pcie-cadence-host.c
>>> cdns_pci_map_bus
>>>     /* Clear AXI link-down status */
>>>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>>>
>>> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>>>
>>> If there has been a link down in this PCIe port, the register CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to continue.  This is different from Synopsys.
>>>
>>> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes CPU Core1 to be unable to send TLP transfers and hang. This is a very extreme situation.
>>> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still in the upstream process at present.)
>>>
>>> Radxa O6 uses Cadence's PCIe HPA IP.
>>> http://radxa.com/products/orion/o6/
>>>
>>
>> It sounds like a system level issue to me. Why not they rely on watchdog to reboot for
>> this case ?
>
> Dear Sathyanarayanan,
>
> Thank you for your reply. Yes, personally, I think it's also a problem at the system level. I conducted a local test. When I directly unplugged the EP device on the slot, the system would hang. It has been tested many times. Since we don't have a bus timeout response mechanism for PCIe, it hangs easily.

Any comment on why watchdog is not used to reboot the unresponsive system?

>
>>
>> Even if you want to add this support, I think it is more appropriate to add this to your
>> specific PCIe controller driver.  I don't see why you want to add it part of generic
>> AER driver.
>>
> Because we want to use the processing logic of the general AER driver. If the recovery is successful, there will be no problem. If the recovery fails, my original intention was to restart the system.
>
> If added to the specific PCIe controller driver, a lot of repetitive AER processing logic will be written. So I was thinking whether the AER driver could be changed to be compiled as a KO module.

May be you can rely on err handler callbacks to get notification on fatal errors or you can even use uevent handler to detect the disconnected device event and handle it there.

>
>
> If this series is not reasonable, I'll drop it.

Adding new kernel param to solve a specific system issue is not recommended. Try to find some custom solution for your chip/controller.

>
>
> Best regards,
> Hans
>
>>>>
>>>>>>
>>>>>> Problem Statement
>>>>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>>>>> traditional error recovery mechanisms may leave the system unresponsive
>>>>>> indefinitely. This is unacceptable for high-availability environment
>>>>>> requiring prompt recovery via reboot.
>>>>>>
>>>>>> Solution
>>>>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>>>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>>>>
>>>>>> Patch Summary:
>>>>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>>>>>> its purpose and usage.
>>>>>>
>>>>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>>>>> management in PCI core.
>>>>>>
>>>>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>>>>> mode is active.
>>>>>>
>>>>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>>>>> fails and aer_panic is enabled.
>>>>>>
>>>>>> Impact
>>>>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>>>>> reboots.
>>>>>>
>>>>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>>>>
>>>>>> Dependency: Requires CONFIG_PCIEAER.
>>>>>>
>>>>>> For example, in mobile phones and tablets, when there is a problem with
>>>>>> the PCIe link and it cannot be restored, it is expected to provide an
>>>>>> alternative method to make the system panic without waiting for the
>>>>>> battery power to be completely exhausted before restarting the system.
>>>>>>
>>>>>> ---
>>>>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>>>>> system when they are linked down.
>>>>>>
>>>>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>>>>
>>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>>>>
>>>>>>
>>>>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>>>>> and other buses connected by PCIe do not have a design to prevent hanging.
>>>>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>>>>> system needs to be restarted.
>>>>>>
>>>>>>
>>>>>> Dear Mani,
>>>>>>
>>>>>> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
>>>>>> in PCIe link.
>>>>>> ---
>>>>>>
>
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago


On 2025/5/22 00:17, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/21/25 7:54 AM, Hans Zhang wrote:
>>
>>
>> On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
>>>
>>> On 5/19/25 7:41 AM, Hans Zhang wrote:
>>>>
>>>>
>>>> On 2025/5/19 22:21, Hans Zhang wrote:
>>>>>
>>>>>
>>>>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>>>>
>>>>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>>>>> The following series introduces a new kernel command-line option 
>>>>>>> aer_panic
>>>>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>>>>> mission-critical environments. This feature ensures deterministic 
>>>>>>> recover
>>>>>>> from fatal PCIe errors by triggering a controlled kernel panic 
>>>>>>> when device
>>>>>>> recovery fails, avoiding indefinite system hangs.
>>>>>>
>>>>>> Why would a device recovery failure lead to a system hang? Worst case
>>>>>> that device may not be accessible, right?  Any real use case?
>>>>>>
>>>>>
>>>>>
>>>>> Dear Sathyanarayanan,
>>>>>
>>>>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are 
>>>>> usually SPI interrupts, not INTx/MSI/MSIx interrupts. (Some 
>>>>> customers will design it as an MSI/MSIx interrupt, e.g.: RK3588, 
>>>>> but not all customers have designed it this way.)  For example, 
>>>>> when many mobile phone SoCs of Qualcomm handle AER interrupts and 
>>>>> there is a link down, that is, a fatal problem occurs in the 
>>>>> current PCIe physical link, the system cannot recover.  At this 
>>>>> point, a system restart is needed to solve the problem.
>>>>>
>>>>> And our company design of SOC: http://radxa.com/products/orion/o6/, 
>>>>> it has 5 road PCIe port.
>>>>> There is also the same problem.  If there is a problem with one of 
>>>>> the PCIe ports, it will cause the entire system to hang.  So I hope 
>>>>> linux OS can offer an option that enables SOC manufacturers to 
>>>>> choose to restart the system in case of fatal hardware errors 
>>>>> occurring in PCIe.
>>>>>
>>>>> There are also products such as mobile phones and tablets. We don't 
>>>>> want to wait until the battery is completely used up before 
>>>>> restarting them.
>>>>>
>>>>> For the specific code of Qualcomm, please refer to the email I sent.
>>>>>
>>>>
>>>>
>>>> Dear Sathyanarayanan,
>>>>
>>>> Supplementary reasons:
>>>>
>>>> drivers/pci/controller/cadence/pcie-cadence-host.c
>>>> cdns_pci_map_bus
>>>>     /* Clear AXI link-down status */
>>>>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>>>>
>>>> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>>>>
>>>> If there has been a link down in this PCIe port, the register 
>>>> CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
>>>> continue.  This is different from Synopsys.
>>>>
>>>> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD 
>>>> saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, 
>>>> it causes CPU Core1 to be unable to send TLP transfers and hang. 
>>>> This is a very extreme situation.
>>>> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still 
>>>> in the upstream process at present.)
>>>>
>>>> Radxa O6 uses Cadence's PCIe HPA IP.
>>>> http://radxa.com/products/orion/o6/
>>>>
>>>
>>> It sounds like a system level issue to me. Why not they rely on 
>>> watchdog to reboot for
>>> this case ?
>>
>> Dear Sathyanarayanan,
>>
>> Thank you for your reply. Yes, personally, I think it's also a problem 
>> at the system level. I conducted a local test. When I directly 
>> unplugged the EP device on the slot, the system would hang. It has 
>> been tested many times. Since we don't have a bus timeout response 
>> mechanism for PCIe, it hangs easily.
> 
> Any comment on why watchdog is not used to reboot the unresponsive system?

Dear Sathyanarayanan,

Thank you very much for your reply.

After my testing, the watchdog doesn't work properly every time. There 
might be other reasons causing the entire system to hang.


> 
>>
>>>
>>> Even if you want to add this support, I think it is more appropriate 
>>> to add this to your
>>> specific PCIe controller driver.  I don't see why you want to add it 
>>> part of generic
>>> AER driver.
>>>
>> Because we want to use the processing logic of the general AER driver. 
>> If the recovery is successful, there will be no problem. If the 
>> recovery fails, my original intention was to restart the system.
>>
>> If added to the specific PCIe controller driver, a lot of repetitive 
>> AER processing logic will be written. So I was thinking whether the 
>> AER driver could be changed to be compiled as a KO module.
> 
> May be you can rely on err handler callbacks to get notification on 
> fatal errors or you can even use uevent handler to detect the 
> disconnected device event and handle it there.

I will try the method you suggested.

> 
>>
>>
>> If this series is not reasonable, I'll drop it.
> 
> Adding new kernel param to solve a specific system issue is not 
> recommended. Try to find some custom solution for your chip/controller.
> 

Ok. Understood. Thank you again for your reply.

Best regards,
Hans

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago


On 2025/5/19 22:21, Hans Zhang wrote:
> 
> 
> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>> The following series introduces a new kernel command-line option 
>>> aer_panic
>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>> mission-critical environments. This feature ensures deterministic 
>>> recover
>>> from fatal PCIe errors by triggering a controlled kernel panic when 
>>> device
>>> recovery fails, avoiding indefinite system hangs.
>>
>> Why would a device recovery failure lead to a system hang? Worst case
>> that device may not be accessible, right?  Any real use case?
>>
> 
> 
> Dear Sathyanarayanan,
> 
> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
> SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
> design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers 
> have designed it this way.)  For example, when many mobile phone SoCs of 
> Qualcomm handle AER interrupts and there is a link down, that is, a 
> fatal problem occurs in the current PCIe physical link, the system 
> cannot recover.  At this point, a system restart is needed to solve the 
> problem.
> 
> And our company design of SOC: http://radxa.com/products/orion/o6/, it 
> has 5 road PCIe port.
> There is also the same problem.  If there is a problem with one of the 
> PCIe ports, it will cause the entire system to hang.  So I hope linux OS 
> can offer an option that enables SOC manufacturers to choose to restart 
> the system in case of fatal hardware errors occurring in PCIe.
> 
> There are also products such as mobile phones and tablets.  We don't 
> want to wait until the battery is completely used up before restarting 
> them.
> 
> For the specific code of Qualcomm, please refer to the email I sent.
> 

Dear Sathyanarayanan,

Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
     /* Clear AXI link-down status */
     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register 
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
continue.  This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving 
files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes 
CPU Core1 to be unable to send TLP transfers and hang.  This is a very 
extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in 
the upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/

Best regards,
Hans
> 
>>>
>>> Problem Statement
>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>> traditional error recovery mechanisms may leave the system unresponsive
>>> indefinitely. This is unacceptable for high-availability environment
>>> requiring prompt recovery via reboot.
>>>
>>> Solution
>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>
>>> Patch Summary:
>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, 
>>> explaining
>>> its purpose and usage.
>>>
>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>> management in PCI core.
>>>
>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>> mode is active.
>>>
>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>> fails and aer_panic is enabled.
>>>
>>> Impact
>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>> reboots.
>>>
>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>
>>> Dependency: Requires CONFIG_PCIEAER.
>>>
>>> For example, in mobile phones and tablets, when there is a problem with
>>> the PCIe link and it cannot be restored, it is expected to provide an
>>> alternative method to make the system panic without waiting for the
>>> battery power to be completely exhausted before restarting the system.
>>>
>>> ---
>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>> system when they are linked down.
>>>
>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>
>>>
>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>> and other buses connected by PCIe do not have a design to prevent 
>>> hanging.
>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>> system needs to be restarted.
>>>
>>>
>>> Dear Mani,
>>>
>>> I wonder if you know how other SoCs of qcom handle FATAL errors that 
>>> occur
>>> in PCIe link.
>>> ---
>>>
>>> Hans Zhang (4):
>>>    pci: implement "pci=aer_panic"
>>>    PCI/AER: Introduce aer_panic kernel command-line option
>>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>>
>>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>>   drivers/pci/pci.c                              |  2 ++
>>>   drivers/pci/pci.h                              |  4 ++++
>>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>>
>>>
>>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>>

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Manivannan Sadhasivam 8 months, 3 weeks ago

On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
> The following series introduces a new kernel command-line option aer_panic
> to enhance error handling for PCIe Advanced Error Reporting (AER) in
> mission-critical environments. This feature ensures deterministic recover
> from fatal PCIe errors by triggering a controlled kernel panic when device
> recovery fails, avoiding indefinite system hangs.
> 
> Problem Statement
> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
> traditional error recovery mechanisms may leave the system unresponsive
> indefinitely. This is unacceptable for high-availability environment
> requiring prompt recovery via reboot.
> 
> Solution
> The aer_panic option forces a kernel panic on unrecoverable AER errors.
> This bypasses prolonged recovery attempts and ensures immediate reboot.
> 

You should not panic the kernel when a PCI error occurs (even if it is a fatal
one). You should instead try to reset the root complex. For that you need this
series that got merged recently:
https://lore.kernel.org/all/20250508-pcie-reset-slot-v4-0-7050093e2b50@linaro.org

PS: You need to populate the slot_reset callback in your controller driver to
reset the controller in the event of a fatal AER error or link down.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Posted by Hans Zhang 8 months, 3 weeks ago


On 2025/5/22 19:47, Manivannan Sadhasivam wrote:
> On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
>> The following series introduces a new kernel command-line option aer_panic
>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>> mission-critical environments. This feature ensures deterministic recover
>> from fatal PCIe errors by triggering a controlled kernel panic when device
>> recovery fails, avoiding indefinite system hangs.
>>
>> Problem Statement
>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>> traditional error recovery mechanisms may leave the system unresponsive
>> indefinitely. This is unacceptable for high-availability environment
>> requiring prompt recovery via reboot.
>>
>> Solution
>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>
> 
> You should not panic the kernel when a PCI error occurs (even if it is a fatal
> one). You should instead try to reset the root complex. For that you need this
> series that got merged recently:
> https://lore.kernel.org/all/20250508-pcie-reset-slot-v4-0-7050093e2b50@linaro.org
> 
> PS: You need to populate the slot_reset callback in your controller driver to
> reset the controller in the event of a fatal AER error or link down.

Dear Mani,

Thank you for your reply. I will take a look at the submission record 
you provided.

Best regards,
Hans