[PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status

Shuai Xue posted 5 patches 2 weeks, 2 days ago
[PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status
Posted by Shuai Xue 2 weeks, 2 days ago
The DPC driver clears AER fatal status for the port that reported the
error, but not for the downstream device that deteced the error.  The
current recovery code only clears non-fatal AER status, leaving fatal
status bits set in the error device.

Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
status in the error device, ensuring all AER status bits are properly
cleared after recovery.

Fixes: aa344bc8b727 ("PCI/ERR: Clear AER status only when we control AER")
Cc: stable@vger.kernel.org
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/pci/pcie/err.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 0780ea09478b..5e463efc3d05 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -285,7 +285,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	 */
 	if (host->native_aer || pcie_ports_native) {
 		pcie_clear_device_status(dev);
-		pci_aer_clear_nonfatal_status(dev);
+		pci_aer_raw_clear_status(dev);
 	}
 
 	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
-- 
2.39.3
Re: [PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status
Posted by Lukas Wunner 6 days, 3 hours ago
On Sat, Jan 24, 2026 at 03:45:56PM +0800, Shuai Xue wrote:
> The DPC driver clears AER fatal status for the port that reported the
> error, but not for the downstream device that deteced the error.  The
> current recovery code only clears non-fatal AER status, leaving fatal
> status bits set in the error device.

That's not quite accurate:

The error device has undergone a Hot Reset as a result of the Link Down
event.  To be able to use it again, pci_restore_state() is invoked by
the driver's ->slot_reset() callback.  And pci_restore_state() does
clear fatal status bits.

pci_restore_state()
  pci_aer_clear_status()
    pci_aer_raw_clear_status()

> Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
> status in the error device, ensuring all AER status bits are properly
> cleared after recovery.

Well, pci_restore_state() already clears all AER status bits so why
is this patch necessary?

> +++ b/drivers/pci/pcie/err.c
> @@ -285,7 +285,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	 */
>  	if (host->native_aer || pcie_ports_native) {
>  		pcie_clear_device_status(dev);
> -		pci_aer_clear_nonfatal_status(dev);
> +		pci_aer_raw_clear_status(dev);
>  	}

This code path is for the case when pcie_do_recovery() is called
with state=pci_channel_io_normal, i.e. in the nonfatal case.
That's why only the nonfatal bits need to be cleared here.

In the fatal case clearing of the error bits is done by
pci_restore_state().

I understand that this is subtle and should probably be changed
to improve clarity, but this patch doesn't look like a step
in that direction.

Thanks,

Lukas
Re: [PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status
Posted by Shuai Xue 2 days, 3 hours ago

On 2/3/26 4:06 PM, Lukas Wunner wrote:
> On Sat, Jan 24, 2026 at 03:45:56PM +0800, Shuai Xue wrote:
>> The DPC driver clears AER fatal status for the port that reported the
>> error, but not for the downstream device that deteced the error.  The
>> current recovery code only clears non-fatal AER status, leaving fatal
>> status bits set in the error device.
> 
> That's not quite accurate:
> 
> The error device has undergone a Hot Reset as a result of the Link Down
> event.  To be able to use it again, pci_restore_state() is invoked by
> the driver's ->slot_reset() callback.  And pci_restore_state() does
> clear fatal status bits.
> 
> pci_restore_state()
>    pci_aer_clear_status()
>      pci_aer_raw_clear_status()
> 
>> Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
>> status in the error device, ensuring all AER status bits are properly
>> cleared after recovery.
> 
> Well, pci_restore_state() already clears all AER status bits so why
> is this patch necessary?

You're right that many drivers call pci_restore_state() in their
->slot_reset() callback, which clears all AER status bits.  However,
since ->slot_reset() is driver-defined and not all drivers invoke
pci_restore_state(), there could be cases where fatal AER status bits
remain set after the frozen recovery completes.

> 
>> +++ b/drivers/pci/pcie/err.c
>> @@ -285,7 +285,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>   	 */
>>   	if (host->native_aer || pcie_ports_native) {
>>   		pcie_clear_device_status(dev);
>> -		pci_aer_clear_nonfatal_status(dev);
>> +		pci_aer_raw_clear_status(dev);
>>   	}
> 
> This code path is for the case when pcie_do_recovery() is called
> with state=pci_channel_io_normal, i.e. in the nonfatal case.
> That's why only the nonfatal bits need to be cleared here.
> 
> In the fatal case clearing of the error bits is done by
> pci_restore_state().
> 
> I understand that this is subtle and should probably be changed
> to improve clarity, but this patch doesn't look like a step
> in that direction.
> 

I notice that pcie_clear_device_status()
(called just before pci_aer_clear_nonfatal_status()) already clears
*all* error bits in the Device Status register (CED, NFED, FED, URD),
including the Fatal Error Detected (FED) bit, regardless of whether
we're in a fatal or nonfatal recovery path.

So there's an inconsistency in the current design:
- pcie_clear_device_status() clears all error bits (including FED)
- pci_aer_clear_nonfatal_status() only clears nonfatal AER status

This means that even in the nonfatal case, the FED bit in Device
Status is cleared, even though we preserve the fatal bits in the AER
Uncorrectable Error Status register.

Would you prefer that I:
1. Make both pcie_clear_device_status() and the AER clearing
    conditional on the 'state' parameter, so that nonfatal recovery
    truly preserves fatal bits in both registers, or
2. Drop this patch and instead move the AER-specific clearing logic
    out of pcie_do_recovery() entirely (as you suggested earlier), so
    that each caller can handle the status clearing explicitly for the
    appropriate error sources and severity?

Thanks,
Shuai
Re: [PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status
Posted by Kuppuswamy Sathyanarayanan 1 week, 4 days ago

On 1/23/2026 11:45 PM, Shuai Xue wrote:
> The DPC driver clears AER fatal status for the port that reported the
> error, but not for the downstream device that deteced the error.  The
> current recovery code only clears non-fatal AER status, leaving fatal
> status bits set in the error device.
> 
> Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
> status in the error device, ensuring all AER status bits are properly
> cleared after recovery.
> 
> Fixes: aa344bc8b727 ("PCI/ERR: Clear AER status only when we control AER")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>


>  drivers/pci/pcie/err.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 0780ea09478b..5e463efc3d05 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -285,7 +285,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	 */
>  	if (host->native_aer || pcie_ports_native) {
>  		pcie_clear_device_status(dev);
> -		pci_aer_clear_nonfatal_status(dev);
> +		pci_aer_raw_clear_status(dev);
>  	}
>  
>  	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
Re: [PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status
Posted by Jonathan Cameron 1 week, 6 days ago
On Sat, 24 Jan 2026 15:45:56 +0800
Shuai Xue <xueshuai@linux.alibaba.com> wrote:

> The DPC driver clears AER fatal status for the port that reported the
> error, but not for the downstream device that deteced the error.  The
> current recovery code only clears non-fatal AER status, leaving fatal
> status bits set in the error device.
> 
> Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
> status in the error device, ensuring all AER status bits are properly
> cleared after recovery.
> 
> Fixes: aa344bc8b727 ("PCI/ERR: Clear AER status only when we control AER")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Shouldn't this be first patch in series to make it easier to backport?

Otherwise seems reasonable to me, but others know these flows better than me
so hopefully we'll get some more review.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  drivers/pci/pcie/err.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 0780ea09478b..5e463efc3d05 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -285,7 +285,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	 */
>  	if (host->native_aer || pcie_ports_native) {
>  		pcie_clear_device_status(dev);
> -		pci_aer_clear_nonfatal_status(dev);
> +		pci_aer_raw_clear_status(dev);
>  	}
>  
>  	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
Re: [PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status
Posted by Shuai Xue 1 week, 4 days ago

On 1/27/26 6:39 PM, Jonathan Cameron wrote:
> On Sat, 24 Jan 2026 15:45:56 +0800
> Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> 
>> The DPC driver clears AER fatal status for the port that reported the
>> error, but not for the downstream device that deteced the error.  The
>> current recovery code only clears non-fatal AER status, leaving fatal
>> status bits set in the error device.
>>
>> Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
>> status in the error device, ensuring all AER status bits are properly
>> cleared after recovery.
>>
>> Fixes: aa344bc8b727 ("PCI/ERR: Clear AER status only when we control AER")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Shouldn't this be first patch in series to make it easier to backport?

Yes, you are right. Will move it as the first one.

> 
> Otherwise seems reasonable to me, but others know these flows better than me
> so hopefully we'll get some more review.
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Thanks.

Best Regards,
Shuai