When an Advisory Non-Fatal error(ANFE) triggers, both correctable error(CE)
status and ANFE related uncorrectable error(UE) status will be printed:
AER: Correctable error message received from 0000:b7:02.0
PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
device [8086:0db0] error status/mask=00002000/00000000
[13] NonFatalErr
Uncorrectable errors that may cause Advisory Non-Fatal:
[12] TLP
Tested-by: Yudong Wang <yudong.wang@intel.com>
Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
drivers/pci/pcie/aer.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3dcfa0191169..ba3a54092f2c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -681,6 +681,7 @@ static void __aer_print_error(struct pci_dev *dev,
{
const char **strings;
unsigned long status = info->status & ~info->mask;
+ unsigned long anfe_status = info->anfe_status;
const char *level, *errmsg;
int i;
@@ -701,6 +702,20 @@ static void __aer_print_error(struct pci_dev *dev,
info->first_error == i ? " (First)" : "");
}
pci_dev_aer_stats_incr(dev, info);
+
+ if (!anfe_status)
+ return;
+
+ strings = aer_uncorrectable_error_string;
+ pci_printk(level, dev, "Uncorrectable errors that may cause Advisory Non-Fatal:\n");
+
+ for_each_set_bit(i, &anfe_status, 32) {
+ errmsg = strings[i];
+ if (!errmsg)
+ errmsg = "Unknown Error Bit";
+
+ pci_printk(level, dev, " [%2d] %s\n", i, errmsg);
+ }
}
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
--
2.34.1
[+cc Matt]
On Thu, Jun 20, 2024 at 10:58:57AM +0800, Zhenzhong Duan wrote:
> When an Advisory Non-Fatal error(ANFE) triggers, both correctable error(CE)
> status and ANFE related uncorrectable error(UE) status will be printed:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
> Uncorrectable errors that may cause Advisory Non-Fatal:
> [12] TLP
>
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
> drivers/pci/pcie/aer.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3dcfa0191169..ba3a54092f2c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -681,6 +681,7 @@ static void __aer_print_error(struct pci_dev *dev,
> {
> const char **strings;
> unsigned long status = info->status & ~info->mask;
> + unsigned long anfe_status = info->anfe_status;
> const char *level, *errmsg;
> int i;
>
> @@ -701,6 +702,20 @@ static void __aer_print_error(struct pci_dev *dev,
> info->first_error == i ? " (First)" : "");
> }
> pci_dev_aer_stats_incr(dev, info);
> +
> + if (!anfe_status)
> + return;
__aer_print_error() is used by both native AER handling, where Linux
fields the AER interrupt and reads the AER status registers directly,
and APEI GHES firmware-first error handling, where platform firmware
fields the AER interrupt, reads the AER status registers, and packages
them up to hand off to Linux via aer_recover_queue().
But the previous patch only sets info->anfe_status for the native
path, so the APEI GHES path doesn't get the benefit of this change.
I think both paths should log the same ANFE information.
> +
> + strings = aer_uncorrectable_error_string;
> + pci_printk(level, dev, "Uncorrectable errors that may cause Advisory Non-Fatal:\n");
> +
> + for_each_set_bit(i, &anfe_status, 32) {
> + errmsg = strings[i];
> + if (!errmsg)
> + errmsg = "Unknown Error Bit";
> +
> + pci_printk(level, dev, " [%2d] %s\n", i, errmsg);
> + }
> }
>
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> --
> 2.34.1
>
On Thu, Jun 20, 2024 at 10:58:57AM +0800, Zhenzhong Duan wrote:
> When an Advisory Non-Fatal error(ANFE) triggers, both correctable error(CE)
> status and ANFE related uncorrectable error(UE) status will be printed:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
> Uncorrectable errors that may cause Advisory Non-Fatal:
> [12] TLP
Forgot to mention on other patch, but please add spaces between the
spelled-out terms and the "()" abbreviation, e.g., "Correctable Error
(CE)".
Also, can you update this commit log to say what the patch does? It's
OK if it repeats and/or expands on the subject.
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
> drivers/pci/pcie/aer.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3dcfa0191169..ba3a54092f2c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -681,6 +681,7 @@ static void __aer_print_error(struct pci_dev *dev,
> {
> const char **strings;
> unsigned long status = info->status & ~info->mask;
> + unsigned long anfe_status = info->anfe_status;
> const char *level, *errmsg;
> int i;
>
> @@ -701,6 +702,20 @@ static void __aer_print_error(struct pci_dev *dev,
> info->first_error == i ? " (First)" : "");
> }
> pci_dev_aer_stats_incr(dev, info);
> +
> + if (!anfe_status)
> + return;
> +
> + strings = aer_uncorrectable_error_string;
> + pci_printk(level, dev, "Uncorrectable errors that may cause Advisory Non-Fatal:\n");
Will have to look at the spec more, but I don't think "may cause" is
quite the right wording here. It's not that an Uncorrectable Error
causes a separate Advisory Non-Fatal Error; IIUC there's only a single
error and it's just *treated* and signaled differently.
> +
> + for_each_set_bit(i, &anfe_status, 32) {
> + errmsg = strings[i];
> + if (!errmsg)
> + errmsg = "Unknown Error Bit";
> +
> + pci_printk(level, dev, " [%2d] %s\n", i, errmsg);
I think we might have removed pci_printk() recently, so this might
need adjustment.
> + }
> }
>
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> --
> 2.34.1
>
© 2016 - 2025 Red Hat, Inc.