drivers/pci/pci.h | 1 + drivers/pci/pcie/aer.c | 79 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 79 insertions(+), 1 deletion(-)
Hi,
This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
processing as subject suggests and drops trace-event for now. I think it's
a bit heavy to do extra IOes to get PCIe registers only for trace purpose
and not see it a community request for now.
According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
both Correctable Error(CE) Status register and Uncorrectable Error(UE)
Status register. Currently, when handling AER events the kernel will only
look at CE status or UE status, but never both. In the ANFE case, bits set
in the UE status register will not be reported and cleared until the next
FE/NFE arrives.
For instance, previously, when the kernel receives an ANFE with Poisoned
TLP in OS native AER mode, only the status of CE will be reported and
cleared:
AER: Correctable error message received from 0000:b7:02.0
PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
device [8086:0db0] error status/mask=00002000/00000000
[13] NonFatalErr
If the kernel receives a Malformed TLP after that, two UEs will be
reported, which is unexpected. The Malformed TLP Header is lost since
the previous ANFE gated the TLP header logs:
PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
device [8086:0db0] error status/mask=00041000/00180020
[12] TLP (First)
[18] MalfTLP
To handle this case properly, calculate potential ANFE related status bits
and save in aer_err_info. Use this information to determine the status bits
that need to be cleared.
Now, for the previous scenario, both CE status and related UE status will
be reported and cleared after ANFE:
AER: Correctable error message received from 0000:b7:02.0
PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
device [8086:0db0] error status/mask=00002000/00000000
[13] NonFatalErr
Uncorrectable errors that may cause Advisory Non-Fatal:
[12] TLP
Note:
checkpatch.pl will produce following warnings on PATCH1&2:
WARNING: 'UE' may be misspelled - perhaps 'USE'?
#22:
uncorrectable error(UE) status should be cleared. However, there is no
...similar warnings omitted...
This is a false-positive, so not fixed.
WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#10:
PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
...similar warnings omitted...
For readability reasons, these warnings are not fixed.
[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-qingshun.wang@linux.intel.com
Thanks
Qingshun, Zhenzhong
Changelog:
v5:
- squash patch 1 and 3 (Kuppuswamy)
- add comment about avoiding race and fix typo error (Kuppuswamy)
- collect Jonathan and Kuppuswamy's R-b
v4:
- Fix a race in anfe_get_uc_status() (Jonathan)
- Add a comment to explain side effect of processing ANFE as NFE (Jonathan)
- Drop the check for PCI_EXP_DEVSTA_NFED
v3:
- Split ANFE print and processing to two patches (Bjorn)
- Simplify ANFE handling, drop trace event
- Polish comments and patch description
- Add Tested-by
v2:
- Reference to the latest PCIe Specification in both commit messages
and comments, as suggested by Bjorn Helgaas.
- Describe the reason for storing additional information in
aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
Helgaas.
- Add more details of behavior changes in the commit message of PATCH
2, as suggested by Bjorn Helgaas.
v4: https://lkml.org/lkml/2024/5/9/247
v3: https://lore.kernel.org/lkml/20240417061407.1491361-1-zhenzhong.duan@intel.com
v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1-qingshun.wang@linux.intel.com
v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-qingshun.wang@linux.intel.com
Zhenzhong Duan (2):
PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
PCI/AER: Print UNCOR_STATUS bits that might be ANFE
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 79 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 79 insertions(+), 1 deletion(-)
--
2.34.1
Hi Bjorn, Kindly ping, this series got Reviewed-by and no comments for a month. Will you think about picking it or further improvements are needed. Look forward to your suggestions. Thanks Zhenzhong >-----Original Message----- >From: Duan, Zhenzhong <zhenzhong.duan@intel.com> >Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error > >Hi, > >This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE >processing as subject suggests and drops trace-event for now. I think it's >a bit heavy to do extra IOes to get PCIe registers only for trace purpose >and not see it a community request for now. > >According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and >6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of >ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in >both Correctable Error(CE) Status register and Uncorrectable Error(UE) >Status register. Currently, when handling AER events the kernel will only >look at CE status or UE status, but never both. In the ANFE case, bits set >in the UE status register will not be reported and cleared until the next >FE/NFE arrives. > >For instance, previously, when the kernel receives an ANFE with Poisoned >TLP in OS native AER mode, only the status of CE will be reported and >cleared: > > AER: Correctable error message received from 0000:b7:02.0 > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > device [8086:0db0] error status/mask=00002000/00000000 > [13] NonFatalErr > >If the kernel receives a Malformed TLP after that, two UEs will be >reported, which is unexpected. The Malformed TLP Header is lost since >the previous ANFE gated the TLP header logs: > > PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, >(Receiver ID) > device [8086:0db0] error status/mask=00041000/00180020 > [12] TLP (First) > [18] MalfTLP > >To handle this case properly, calculate potential ANFE related status bits >and save in aer_err_info. Use this information to determine the status bits >that need to be cleared. > >Now, for the previous scenario, both CE status and related UE status will >be reported and cleared after ANFE: > > AER: Correctable error message received from 0000:b7:02.0 > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > device [8086:0db0] error status/mask=00002000/00000000 > [13] NonFatalErr > Uncorrectable errors that may cause Advisory Non-Fatal: > [12] TLP > >Note: >checkpatch.pl will produce following warnings on PATCH1&2: > >WARNING: 'UE' may be misspelled - perhaps 'USE'? >#22: >uncorrectable error(UE) status should be cleared. However, there is no > >...similar warnings omitted... > >This is a false-positive, so not fixed. > >WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit >description?) >#10: > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > >...similar warnings omitted... > >For readability reasons, these warnings are not fixed. > > > >[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1- >qingshun.wang@linux.intel.com > >Thanks >Qingshun, Zhenzhong > >Changelog: >v5: > - squash patch 1 and 3 (Kuppuswamy) > - add comment about avoiding race and fix typo error (Kuppuswamy) > - collect Jonathan and Kuppuswamy's R-b > >v4: > - Fix a race in anfe_get_uc_status() (Jonathan) > - Add a comment to explain side effect of processing ANFE as NFE (Jonathan) > - Drop the check for PCI_EXP_DEVSTA_NFED > >v3: > - Split ANFE print and processing to two patches (Bjorn) > - Simplify ANFE handling, drop trace event > - Polish comments and patch description > - Add Tested-by > >v2: > - Reference to the latest PCIe Specification in both commit messages > and comments, as suggested by Bjorn Helgaas. > - Describe the reason for storing additional information in > aer_err_info in the commit message of PATCH 1, as suggested by Bjorn > Helgaas. > - Add more details of behavior changes in the commit message of PATCH > 2, as suggested by Bjorn Helgaas. > >v4: https://lkml.org/lkml/2024/5/9/247 >v3: https://lore.kernel.org/lkml/20240417061407.1491361-1- >zhenzhong.duan@intel.com >v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1- >qingshun.wang@linux.intel.com >v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1- >qingshun.wang@linux.intel.com > > >Zhenzhong Duan (2): > PCI/AER: Clear UNCOR_STATUS bits that might be ANFE > PCI/AER: Print UNCOR_STATUS bits that might be ANFE > > drivers/pci/pci.h | 1 + > drivers/pci/pcie/aer.c | 79 >+++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 79 insertions(+), 1 deletion(-) > >-- >2.34.1
Hello. My team had independently started to make a change similar to this before realizing that someone had already taken a stab at it. It is highly desirable in my mind to have an improved handling of Advisory Errors in the upstream kernel. Is there anything we can do to help move this effort along? Perhaps testing? We have a decent variety of system configurations & are able to inject various kinds of errors via special devices/commands etc. Thanks, -Matt
Hi Matthew, Feel free to take it over if you are interested. Maintainer didn't respond to this series, perhaps he expects some improvement in the series. Thanks Zhenzhong >-----Original Message----- >From: Matthew W Carlis <mattc@purestorage.com> >Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error > >Hello. My team had independently started to make a change similar to this >before realizing that someone had already taken a stab at it. It is highly >desirable in my mind to have an improved handling of Advisory Errors in >the upstream kernel. Is there anything we can do to help move this effort >along? Perhaps testing? We have a decent variety of system configurations & >are able to inject various kinds of errors via special devices/commands etc. > >Thanks, >-Matt
On Fri, Aug 22, 2025 at 01:45:30AM +0000, Duan, Zhenzhong wrote: > Hi Matthew, > > Feel free to take it over if you are interested. Maintainer didn't > respond to this series, perhaps he expects some improvement in the > series. I'm terribly sorry, this is my fault. It just fell off my list for no good reason. Matthew, if you are able to test and/or provide a Reviewed-by, that would be the best thing you can do to move this forward (although neither is actually necessary). Bjorn > >-----Original Message----- > >From: Matthew W Carlis <mattc@purestorage.com> > >Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error > > > >Hello. My team had independently started to make a change similar to this > >before realizing that someone had already taken a stab at it. It is highly > >desirable in my mind to have an improved handling of Advisory Errors in > >the upstream kernel. Is there anything we can do to help move this effort > >along? Perhaps testing? We have a decent variety of system configurations & > >are able to inject various kinds of errors via special devices/commands etc. > > > >Thanks, > >-Matt
On Fri, 22 Aug 2025 11:51:12 -0500, Bjorn Helgaas wrote > Matthew, if you are able to test and/or provide a Reviewed-by, that would > be the best thing you can do to move this forward ... I spent some time looking at the patch thinking about it a little more carefully. The only thing I don't really like in this revision of the patch is the logging for "may cause Advisory". Example below from "[PATCH v5 2/2] PCI/AER: Print UNCOR_STATUS bits that might be ANFE". AER: Correctable error message received from 0000:b7:02.0 PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) device [8086:0db0] error status/mask=00002000/00000000 [13] NonFatalErr Uncorrectable errors that may cause Advisory Non-Fatal: [12] TLP I don't think we really need to log the UE caused by ANF any differently than any other UE & in fact I would prefer not to. In my mind we should log all the UE status bits via the same format as before. Taking from example above, in my mind it would be nice if the logging looked like this. AER: Correctable error message received from 0000:b7:02.0 PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) device [8086:0db0] error status/mask=00002000/00000000 [13] NonFatalErr PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer [12] TLP If there was only one error (that triggered ANF handling) then we would know that the Non-Fatal UE was what triggered the NonFatalErr. If some other Non-Fatal errors are happening at the same time then it doesn't really matter which was sent via ERR_COR vs ERR_NONFATAL since we would also know from Root Error Status that we had received at least one of each message type. The objective in my mind being to free up header-logs & log status details without making error the recovery worse. Does this sound reasonable or unreasonable? I can update the patch-set & re-submit if 'reasonable'. Cheers! -Matt
On Fri, 22 Aug 2025 11:51:12 -0500, Bjorn Helgaas wrote > I'm terribly sorry, this is my fault. It just fell off my list for no > good reason. Matthew, if you are able to test and/or provide a > Reviewed-by, that would be the best thing you can do to move this > forward (although neither is actually necessary). It seems for pci there is always a massive list of things in flight.. Difficult for any mortal to keep up with. We pulled the patch into our kernel & have started testing it. I'll sync-up with my team internally to see exactly what the plan is & how long we think it will take. Cheers! -Matt
>-----Original Message----- >From: Matthew W Carlis <mattc@purestorage.com> >Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error > >On Fri, 22 Aug 2025 11:51:12 -0500, Bjorn Helgaas wrote >> I'm terribly sorry, this is my fault. It just fell off my list for no >> good reason. Matthew, if you are able to test and/or provide a >> Reviewed-by, that would be the best thing you can do to move this >> forward (although neither is actually necessary). > >It seems for pci there is always a massive list of things in flight.. >Difficult for any mortal to keep up with. Fully agree, never mind, Bjorn. BRs, Zhenzhong
© 2016 - 2025 Red Hat, Inc.