drivers/cxl/core/core.h | 3 + drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++-------- drivers/cxl/core/port.c | 4 +- drivers/cxl/core/trace.h | 47 +++++++++++ drivers/cxl/cxl.h | 14 +++- drivers/cxl/mem.c | 30 ++++++- drivers/cxl/pci.c | 8 ++ drivers/pci/pci.h | 5 ++ drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++-------- drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++ include/linux/aer.h | 16 ++++ include/linux/pci.h | 3 + 12 files changed, 503 insertions(+), 72 deletions(-)
This is a continuation of the CXL port error handling RFC from earlier.[1] The RFC resulted in the decision to add CXL PCIe port error handling to the existing RCH downstream port handling. This patchset adds the CXL PCIe port handling and logging. The first 7 patches update the existing AER service driver to support CXL PCIe port protocol error handling and reporting. This includes AER service driver changes for adding correctable and uncorrectable error support, CXL specific recovery handling, and addition of CXL driver callback handlers. The following 8 patches address CXL driver support for CXL PCIe port protocol errors. This includes the following changes to the CXL drivers: mapping CXL port and downstream port RAS registers, interface updates for common RCH and VH, adding port specific error handlers, and protocol error logging. [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554 -1-terry.bowman@amd.com/ Testing: Below are test results for this patchset. This is using Qemu with a root port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port (0e:00.0). This was tested using aer-inject updated to support CE and UCE internal error injection. CXL RAS was set using a test patch (not upstreamed). Root port UCE: root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000 [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available [ 27.325584] [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error' [ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic [ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857 [ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 27.335716] Call Trace: [ 27.335985] <TASK> [ 27.336226] panic+0x2ed/0x320 [ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10 [ 27.337037] ? __pfx_aer_root_reset+0x10/0x10 [ 27.337453] cxl_do_recovery+0x304/0x310 [ 27.337833] aer_isr+0x3fd/0x700 [ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10 [ 27.338572] irq_thread_fn+0x1f/0x60 [ 27.338923] irq_thread+0x102/0x1b0 [ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10 [ 27.339683] ? __pfx_irq_thread+0x10/0x10 [ 27.340059] kthread+0xcd/0x100 [ 27.340387] ? __pfx_kthread+0x10/0x10 [ 27.340748] ret_from_fork+0x2f/0x50 [ 27.341100] ? __pfx_kthread+0x10/0x10 [ 27.341466] ret_from_fork_asm+0x1a/0x30 [ 27.341842] </TASK> [ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- Root port CE: root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh [ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0 [ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0 [ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) [ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000 [ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr [ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available [ 19.449223] [ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer' Upstream switch port UCE: root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh [ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0 [ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0 [ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) [ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000 [ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr [ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available [ 45.242448] [ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error' [ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic [ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855 [ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 45.251907] Call Trace: [ 45.253284] <TASK> [ 45.253564] panic+0x2ed/0x320 [ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10 [ 45.255455] ? __pfx_aer_root_reset+0x10/0x10 [ 45.255915] cxl_do_recovery+0x304/0x310 [ 45.257219] aer_isr+0x3fd/0x700 [ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10 [ 45.258006] irq_thread_fn+0x1f/0x60 [ 45.258383] irq_thread+0x102/0x1b0 [ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10 [ 45.259196] ? __pfx_irq_thread+0x10/0x10 [ 45.259605] kthread+0xcd/0x100 [ 45.259956] ? __pfx_kthread+0x10/0x10 [ 45.260386] ret_from_fork+0x2f/0x50 [ 45.260879] ? __pfx_kthread+0x10/0x10 [ 45.261418] ret_from_fork_asm+0x1a/0x30 [ 45.261936] </TASK> [ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- Upstream switch port CE: root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh [ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0 [ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0 [ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) [ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000 [ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr [ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available [ 37.510180] [ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer' Downstream switch port UCE: root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh [ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0 [ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0 [ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) [ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000 [ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr [ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available [ 29.427111] [ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error' [ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic [ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851 [ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 29.433031] Call Trace: [ 29.433354] <TASK> [ 29.433631] panic+0x2ed/0x320 [ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10 [ 29.434653] ? __pfx_aer_root_reset+0x10/0x10 [ 29.435179] cxl_do_recovery+0x304/0x310 [ 29.435626] aer_isr+0x3fd/0x700 [ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10 [ 29.436507] irq_thread_fn+0x1f/0x60 [ 29.436898] irq_thread+0x102/0x1b0 [ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10 [ 29.437758] ? __pfx_irq_thread+0x10/0x10 [ 29.438189] kthread+0xcd/0x100 [ 29.438551] ? __pfx_kthread+0x10/0x10 [ 29.438959] ret_from_fork+0x2f/0x50 [ 29.439362] ? __pfx_kthread+0x10/0x10 [ 29.439771] ret_from_fork_asm+0x1a/0x30 [ 29.440221] </TASK> [ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- Downstream switch port CE: root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0 [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000 [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available [ 177.119521] [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer' Changes RFC->v1: [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error() [Dan] Add cxl_do_recovery() [Jonathan] Flatten cxl_setup_parent_uport() [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs [Jonathan] Rename cxl_dev_is_pci_type() [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can replace these find_cxl_port() and device_find_child(). [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport() [Ming] Dont use endpoint as host to cxl_map_component_regs() [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE" [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface Terry Bowman (15): cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver cxl/pci: Change find_cxl_ports() to be non-static cxl/pci: Map CXL PCIe downstream port RAS registers cxl/pci: Map CXL PCIe upstream port RAS registers cxl/pci: Update RAS handler interfaces to support CXL PCIe ports cxl/pci: Add error handler for CXL PCIe port RAS errors cxl/pci: Add trace logging for CXL PCIe port RAS errors cxl/aer/pci: Export pci_aer_unmask_internal_errors() cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices drivers/cxl/core/core.h | 3 + drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++-------- drivers/cxl/core/port.c | 4 +- drivers/cxl/core/trace.h | 47 +++++++++++ drivers/cxl/cxl.h | 14 +++- drivers/cxl/mem.c | 30 ++++++- drivers/cxl/pci.c | 8 ++ drivers/pci/pci.h | 5 ++ drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++-------- drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++ include/linux/aer.h | 16 ++++ include/linux/pci.h | 3 + 12 files changed, 503 insertions(+), 72 deletions(-) base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a -- 2.34.1
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: > This is a continuation of the CXL port error handling RFC from earlier.[1] > The RFC resulted in the decision to add CXL PCIe port error handling to > the existing RCH downstream port handling. This patchset adds the CXL PCIe > port handling and logging. > > The first 7 patches update the existing AER service driver to support CXL > PCIe port protocol error handling and reporting. This includes AER service > driver changes for adding correctable and uncorrectable error support, CXL > specific recovery handling, and addition of CXL driver callback handlers. > > The following 8 patches address CXL driver support for CXL PCIe port > protocol errors. This includes the following changes to the CXL drivers: > mapping CXL port and downstream port RAS registers, interface updates for > common RCH and VH, adding port specific error handlers, and protocol error > logging. > > [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554 > -1-terry.bowman@amd.com/ Makes life easier if URLs are all on one line so they still work. > Testing: > > Below are test results for this patchset. This is using Qemu with a root > port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port > (0e:00.0). > > This was tested using aer-inject updated to support CE and UCE internal > error injection. CXL RAS was set using a test patch (not upstreamed). > > Root port UCE: > root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh > [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 > [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 > [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000 > [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr > [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > [ 27.325584] > [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' > first_error: 'Memory Address Parity Error' > [ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic > [ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857 > [ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 27.335716] Call Trace: > [ 27.335985] <TASK> > [ 27.336226] panic+0x2ed/0x320 > [ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10 > [ 27.337037] ? __pfx_aer_root_reset+0x10/0x10 > [ 27.337453] cxl_do_recovery+0x304/0x310 > [ 27.337833] aer_isr+0x3fd/0x700 > [ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10 > [ 27.338572] irq_thread_fn+0x1f/0x60 > [ 27.338923] irq_thread+0x102/0x1b0 > [ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10 > [ 27.339683] ? __pfx_irq_thread+0x10/0x10 > [ 27.340059] kthread+0xcd/0x100 > [ 27.340387] ? __pfx_kthread+0x10/0x10 > [ 27.340748] ret_from_fork+0x2f/0x50 > [ 27.341100] ? __pfx_kthread+0x10/0x10 > [ 27.341466] ret_from_fork_asm+0x1a/0x30 > [ 27.341842] </TASK> > [ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > Root port CE: > root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh > [ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0 > [ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0 > [ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > [ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000 > [ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr > [ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > [ 19.449223] > [ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer' > > Upstream switch port UCE: > root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh > [ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0 > [ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0 > [ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > [ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000 > [ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr > [ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > [ 45.242448] > [ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' > first_error: 'Memory Address Parity Error' > [ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic > [ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855 > [ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 45.251907] Call Trace: > [ 45.253284] <TASK> > [ 45.253564] panic+0x2ed/0x320 > [ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10 > [ 45.255455] ? __pfx_aer_root_reset+0x10/0x10 > [ 45.255915] cxl_do_recovery+0x304/0x310 > [ 45.257219] aer_isr+0x3fd/0x700 > [ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10 > [ 45.258006] irq_thread_fn+0x1f/0x60 > [ 45.258383] irq_thread+0x102/0x1b0 > [ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10 > [ 45.259196] ? __pfx_irq_thread+0x10/0x10 > [ 45.259605] kthread+0xcd/0x100 > [ 45.259956] ? __pfx_kthread+0x10/0x10 > [ 45.260386] ret_from_fork+0x2f/0x50 > [ 45.260879] ? __pfx_kthread+0x10/0x10 > [ 45.261418] ret_from_fork_asm+0x1a/0x30 > [ 45.261936] </TASK> > [ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > Upstream switch port CE: > root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh > [ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0 > [ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0 > [ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > [ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000 > [ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr > [ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > [ 37.510180] > [ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer' > > Downstream switch port UCE: > root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh > [ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0 > [ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0 > [ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > [ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000 > [ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr > [ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > [ 29.427111] > [ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' > first_error: 'Memory Address Parity Error' > [ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic > [ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851 > [ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 29.433031] Call Trace: > [ 29.433354] <TASK> > [ 29.433631] panic+0x2ed/0x320 > [ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10 > [ 29.434653] ? __pfx_aer_root_reset+0x10/0x10 > [ 29.435179] cxl_do_recovery+0x304/0x310 > [ 29.435626] aer_isr+0x3fd/0x700 > [ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10 > [ 29.436507] irq_thread_fn+0x1f/0x60 > [ 29.436898] irq_thread+0x102/0x1b0 > [ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10 > [ 29.437758] ? __pfx_irq_thread+0x10/0x10 > [ 29.438189] kthread+0xcd/0x100 > [ 29.438551] ? __pfx_kthread+0x10/0x10 > [ 29.438959] ret_from_fork+0x2f/0x50 > [ 29.439362] ? __pfx_kthread+0x10/0x10 > [ 29.439771] ret_from_fork_asm+0x1a/0x30 > [ 29.440221] </TASK> > [ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > Downstream switch port CE: > root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh > [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 > [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0 > [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000 > [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr > [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > [ 177.119521] > [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer' Thanks for the hints about how to test this; it's helpful to have those in the email archives. Remove the timestamps and non-relevant call trace entries unless they add useful information. AFAICT they're just distractions in this case. > Changes RFC->v1: > [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error() > [Dan] Add cxl_do_recovery() > [Jonathan] Flatten cxl_setup_parent_uport() > [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs > [Jonathan] Rename cxl_dev_is_pci_type() > [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can > replace these find_cxl_port() and device_find_child(). > [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport() > [Ming] Dont use endpoint as host to cxl_map_component_regs() > [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE" > [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface > > Terry Bowman (15): > cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service > driver > cxl/aer/pci: Update is_internal_error() to be callable w/o > CONFIG_PCIEAER_CXL > cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL > PCIe ports > cxl/aer/pci: Add CXL PCIe port correctable error support in AER > service driver > cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL > PCIe port devices > cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type > cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER > service driver I had to look at the patches to learn that all the above only touch drivers/pci, aer.h, and pci.h. Can you use the PCI subject line conventions (e.g., "PCI/AER: ...") to make this more obvious? Almost all already include "CXL", so I don't think we'd really lose any information. > cxl/pci: Change find_cxl_ports() to be non-static > cxl/pci: Map CXL PCIe downstream port RAS registers > cxl/pci: Map CXL PCIe upstream port RAS registers > cxl/pci: Update RAS handler interfaces to support CXL PCIe ports > cxl/pci: Add error handler for CXL PCIe port RAS errors > cxl/pci: Add trace logging for CXL PCIe port RAS errors > cxl/aer/pci: Export pci_aer_unmask_internal_errors() Ditto here, and add something about CXL in the subject since this doesn't export universally. > cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices > > drivers/cxl/core/core.h | 3 + > drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++-------- > drivers/cxl/core/port.c | 4 +- > drivers/cxl/core/trace.h | 47 +++++++++++ > drivers/cxl/cxl.h | 14 +++- > drivers/cxl/mem.c | 30 ++++++- > drivers/cxl/pci.c | 8 ++ > drivers/pci/pci.h | 5 ++ > drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++-------- > drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++ > include/linux/aer.h | 16 ++++ > include/linux/pci.h | 3 + > 12 files changed, 503 insertions(+), 72 deletions(-) > > > base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a This doesn't apply cleanly on v6.12-rc1, and f7982d85e136ba7e26b31a725c1841373f81f84a isn't upstream yet. Where is it? I guess it relies on some other series that hasn't been merged yet? Bjorn
Hi Bjorn, Thanks for taking the time to review. I added comments below. On 10/10/24 14:07, Bjorn Helgaas wrote: > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: >> This is a continuation of the CXL port error handling RFC from earlier.[1] >> The RFC resulted in the decision to add CXL PCIe port error handling to >> the existing RCH downstream port handling. This patchset adds the CXL PCIe >> port handling and logging. >> >> The first 7 patches update the existing AER service driver to support CXL >> PCIe port protocol error handling and reporting. This includes AER service >> driver changes for adding correctable and uncorrectable error support, CXL >> specific recovery handling, and addition of CXL driver callback handlers. >> >> The following 8 patches address CXL driver support for CXL PCIe port >> protocol errors. This includes the following changes to the CXL drivers: >> mapping CXL port and downstream port RAS registers, interface updates for >> common RCH and VH, adding port specific error handlers, and protocol error >> logging. >> >> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554 >> -1-terry.bowman@amd.com/ > > Makes life easier if URLs are all on one line so they still work. > Ok. >> Testing: >> >> Below are test results for this patchset. This is using Qemu with a root >> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port >> (0e:00.0). >> >> This was tested using aer-inject updated to support CE and UCE internal >> error injection. CXL RAS was set using a test patch (not upstreamed). >> >> Root port UCE: >> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh >> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 >> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 >> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) >> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000 >> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr >> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available >> [ 27.325584] >> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' >> first_error: 'Memory Address Parity Error' >> [ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic >> [ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857 >> [ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 >> [ 27.335716] Call Trace: >> [ 27.335985] <TASK> >> [ 27.336226] panic+0x2ed/0x320 >> [ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10 >> [ 27.337037] ? __pfx_aer_root_reset+0x10/0x10 >> [ 27.337453] cxl_do_recovery+0x304/0x310 >> [ 27.337833] aer_isr+0x3fd/0x700 >> [ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10 >> [ 27.338572] irq_thread_fn+0x1f/0x60 >> [ 27.338923] irq_thread+0x102/0x1b0 >> [ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10 >> [ 27.339683] ? __pfx_irq_thread+0x10/0x10 >> [ 27.340059] kthread+0xcd/0x100 >> [ 27.340387] ? __pfx_kthread+0x10/0x10 >> [ 27.340748] ret_from_fork+0x2f/0x50 >> [ 27.341100] ? __pfx_kthread+0x10/0x10 >> [ 27.341466] ret_from_fork_asm+0x1a/0x30 >> [ 27.341842] </TASK> >> [ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) >> [ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- >> >> Root port CE: >> root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh >> [ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0 >> [ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0 >> [ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) >> [ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000 >> [ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr >> [ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available >> [ 19.449223] >> [ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer' >> >> Upstream switch port UCE: >> root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh >> [ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0 >> [ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0 >> [ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) >> [ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000 >> [ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr >> [ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available >> [ 45.242448] >> [ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' >> first_error: 'Memory Address Parity Error' >> [ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic >> [ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855 >> [ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 >> [ 45.251907] Call Trace: >> [ 45.253284] <TASK> >> [ 45.253564] panic+0x2ed/0x320 >> [ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10 >> [ 45.255455] ? __pfx_aer_root_reset+0x10/0x10 >> [ 45.255915] cxl_do_recovery+0x304/0x310 >> [ 45.257219] aer_isr+0x3fd/0x700 >> [ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10 >> [ 45.258006] irq_thread_fn+0x1f/0x60 >> [ 45.258383] irq_thread+0x102/0x1b0 >> [ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10 >> [ 45.259196] ? __pfx_irq_thread+0x10/0x10 >> [ 45.259605] kthread+0xcd/0x100 >> [ 45.259956] ? __pfx_kthread+0x10/0x10 >> [ 45.260386] ret_from_fork+0x2f/0x50 >> [ 45.260879] ? __pfx_kthread+0x10/0x10 >> [ 45.261418] ret_from_fork_asm+0x1a/0x30 >> [ 45.261936] </TASK> >> [ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) >> [ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- >> >> Upstream switch port CE: >> root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh >> [ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0 >> [ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0 >> [ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) >> [ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000 >> [ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr >> [ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available >> [ 37.510180] >> [ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer' >> >> Downstream switch port UCE: >> root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh >> [ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0 >> [ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0 >> [ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) >> [ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000 >> [ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr >> [ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available >> [ 29.427111] >> [ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' >> first_error: 'Memory Address Parity Error' >> [ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic >> [ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851 >> [ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 >> [ 29.433031] Call Trace: >> [ 29.433354] <TASK> >> [ 29.433631] panic+0x2ed/0x320 >> [ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10 >> [ 29.434653] ? __pfx_aer_root_reset+0x10/0x10 >> [ 29.435179] cxl_do_recovery+0x304/0x310 >> [ 29.435626] aer_isr+0x3fd/0x700 >> [ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10 >> [ 29.436507] irq_thread_fn+0x1f/0x60 >> [ 29.436898] irq_thread+0x102/0x1b0 >> [ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10 >> [ 29.437758] ? __pfx_irq_thread+0x10/0x10 >> [ 29.438189] kthread+0xcd/0x100 >> [ 29.438551] ? __pfx_kthread+0x10/0x10 >> [ 29.438959] ret_from_fork+0x2f/0x50 >> [ 29.439362] ? __pfx_kthread+0x10/0x10 >> [ 29.439771] ret_from_fork_asm+0x1a/0x30 >> [ 29.440221] </TASK> >> [ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) >> [ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- >> >> Downstream switch port CE: >> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh >> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 >> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0 >> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) >> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000 >> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr >> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available >> [ 177.119521] >> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer' > > Thanks for the hints about how to test this; it's helpful to have > those in the email archives. Remove the timestamps and non-relevant > call trace entries unless they add useful information. AFAICT they're > just distractions in this case. > I'll remove the test logging and details from the cover sheet. I'm unable to find how to attach using git tools. Instead of an atatachment, I can locate the log files and details on a public github. Let me know if this is not acceptable. >> Changes RFC->v1: >> [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error() >> [Dan] Add cxl_do_recovery() >> [Jonathan] Flatten cxl_setup_parent_uport() >> [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs >> [Jonathan] Rename cxl_dev_is_pci_type() >> [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can >> replace these find_cxl_port() and device_find_child(). >> [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport() >> [Ming] Dont use endpoint as host to cxl_map_component_regs() >> [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE" >> [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface >> >> Terry Bowman (15): >> cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service >> driver >> cxl/aer/pci: Update is_internal_error() to be callable w/o >> CONFIG_PCIEAER_CXL >> cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL >> PCIe ports >> cxl/aer/pci: Add CXL PCIe port correctable error support in AER >> service driver >> cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL >> PCIe port devices >> cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type >> cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER >> service driver > > I had to look at the patches to learn that all the above only touch > drivers/pci, aer.h, and pci.h. Can you use the PCI subject line > conventions (e.g., "PCI/AER: ...") to make this more obvious? Almost > all already include "CXL", so I don't think we'd really lose any > information. > Yes, I'll change the patches' headlines to use capitalized "PCI/AER". >> cxl/pci: Change find_cxl_ports() to be non-static >> cxl/pci: Map CXL PCIe downstream port RAS registers >> cxl/pci: Map CXL PCIe upstream port RAS registers >> cxl/pci: Update RAS handler interfaces to support CXL PCIe ports >> cxl/pci: Add error handler for CXL PCIe port RAS errors >> cxl/pci: Add trace logging for CXL PCIe port RAS errors >> cxl/aer/pci: Export pci_aer_unmask_internal_errors() > > Ditto here, and add something about CXL in the subject since this > doesn't export universally. > Ok. >> cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices >> >> drivers/cxl/core/core.h | 3 + >> drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++-------- >> drivers/cxl/core/port.c | 4 +- >> drivers/cxl/core/trace.h | 47 +++++++++++ >> drivers/cxl/cxl.h | 14 +++- >> drivers/cxl/mem.c | 30 ++++++- >> drivers/cxl/pci.c | 8 ++ >> drivers/pci/pci.h | 5 ++ >> drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++-------- >> drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++ >> include/linux/aer.h | 16 ++++ >> include/linux/pci.h | 3 + >> 12 files changed, 503 insertions(+), 72 deletions(-) >> >> >> base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a > > This doesn't apply cleanly on v6.12-rc1, and > f7982d85e136ba7e26b31a725c1841373f81f84a isn't upstream yet. Where > is it? I guess it relies on some other series that hasn't been merged > yet? > > Bjorn Hmmm, I thought I was using a 6.11-rc7 commit. I will rebase to either 6.12-rc1 or rc2. Regards, Terry
On Mon, Oct 14, 2024 at 12:22:08PM -0500, Terry Bowman wrote: > On 10/10/24 14:07, Bjorn Helgaas wrote: > > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: > >> This is a continuation of the CXL port error handling RFC from earlier.[1] > >> The RFC resulted in the decision to add CXL PCIe port error handling to > >> the existing RCH downstream port handling. This patchset adds the CXL PCIe > >> port handling and logging. > ... > >> Downstream switch port CE: > >> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh > >> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 > >> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0 > >> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > >> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000 > >> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr > >> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > >> [ 177.119521] > >> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer' > > > > Thanks for the hints about how to test this; it's helpful to have > > those in the email archives. Remove the timestamps and non-relevant > > call trace entries unless they add useful information. AFAICT they're > > just distractions in this case. > > I'll remove the test logging and details from the cover sheet. I'm > unable to find how to attach using git tools. Instead of an > atatachment, I can locate the log files and details on a public > github. Let me know if this is not acceptable. It's fine to keep this in the cover sheet, and I'd rather have it there, where lore will archive it reliably forever, than to have a pointer to some other github that may eventually disappear even though it's public today. I just meant to remove irrelevant information like the timestamps. Bjorn
On 10/14/24 12:29, Bjorn Helgaas wrote: > On Mon, Oct 14, 2024 at 12:22:08PM -0500, Terry Bowman wrote: >> On 10/10/24 14:07, Bjorn Helgaas wrote: >>> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: >>>> This is a continuation of the CXL port error handling RFC from earlier.[1] >>>> The RFC resulted in the decision to add CXL PCIe port error handling to >>>> the existing RCH downstream port handling. This patchset adds the CXL PCIe >>>> port handling and logging. >> ... > >>>> Downstream switch port CE: >>>> root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh >>>> [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 >>>> [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0 >>>> [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) >>>> [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000 >>>> [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr >>>> [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available >>>> [ 177.119521] >>>> [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer' >>> >>> Thanks for the hints about how to test this; it's helpful to have >>> those in the email archives. Remove the timestamps and non-relevant >>> call trace entries unless they add useful information. AFAICT they're >>> just distractions in this case. >> >> I'll remove the test logging and details from the cover sheet. I'm >> unable to find how to attach using git tools. Instead of an >> atatachment, I can locate the log files and details on a public >> github. Let me know if this is not acceptable. > > It's fine to keep this in the cover sheet, and I'd rather have it > there, where lore will archive it reliably forever, than to have a > pointer to some other github that may eventually disappear even though > it's public today. > > I just meant to remove irrelevant information like the timestamps. > > Bjorn Ok, I'll cleanup and leave here. Thanks. Regards, Terry
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: > This is a continuation of the CXL port error handling RFC from earlier.[1] > The RFC resulted in the decision to add CXL PCIe port error handling to > the existing RCH downstream port handling. This patchset adds the CXL PCIe > port handling and logging. > > The first 7 patches update the existing AER service driver to support CXL > PCIe port protocol error handling and reporting. This includes AER service > driver changes for adding correctable and uncorrectable error support, CXL > specific recovery handling, and addition of CXL driver callback handlers. > > The following 8 patches address CXL driver support for CXL PCIe port > protocol errors. This includes the following changes to the CXL drivers: > mapping CXL port and downstream port RAS registers, interface updates for > common RCH and VH, adding port specific error handlers, and protocol error > logging. Looks like all my comments at https://lore.kernel.org/r/20241010190726.GA570880@bhelgaas still apply. URL broken across lines, distracting timestamps, patch subjects, no clue about the base commit.
Hi Bjorn, I added a response below. On 10/18/24 18:22, Bjorn Helgaas wrote: > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: >> This is a continuation of the CXL port error handling RFC from earlier.[1] >> The RFC resulted in the decision to add CXL PCIe port error handling to >> the existing RCH downstream port handling. This patchset adds the CXL PCIe >> port handling and logging. >> >> The first 7 patches update the existing AER service driver to support CXL >> PCIe port protocol error handling and reporting. This includes AER service >> driver changes for adding correctable and uncorrectable error support, CXL >> specific recovery handling, and addition of CXL driver callback handlers. >> >> The following 8 patches address CXL driver support for CXL PCIe port >> protocol errors. This includes the following changes to the CXL drivers: >> mapping CXL port and downstream port RAS registers, interface updates for >> common RCH and VH, adding port specific error handlers, and protocol error >> logging. > > Looks like all my comments at > https://lore.kernel.org/r/20241010190726.GA570880@bhelgaas still > apply. > > URL broken across lines, distracting timestamps, patch subjects, > no clue about the base commit. I added changes for code reuse in pcie_do_recovery() as recommended. I am finishing testing now and will have v2 upstreamed shortly. Regards, Terry
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: > This is a continuation of the CXL port error handling RFC from earlier.[1] > The RFC resulted in the decision to add CXL PCIe port error handling to > the existing RCH downstream port handling. This patchset adds the CXL PCIe > port handling and logging. > > The first 7 patches update the existing AER service driver to support CXL > PCIe port protocol error handling and reporting. This includes AER service > driver changes for adding correctable and uncorrectable error support, CXL > specific recovery handling, and addition of CXL driver callback handlers. > > The following 8 patches address CXL driver support for CXL PCIe port > protocol errors. This includes the following changes to the CXL drivers: > mapping CXL port and downstream port RAS registers, interface updates for > common RCH and VH, adding port specific error handlers, and protocol error > logging. > > [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554 > -1-terry.bowman@amd.com/ > > Testing: > > Below are test results for this patchset. This is using Qemu with a root > port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port > (0e:00.0). > > This was tested using aer-inject updated to support CE and UCE internal > error injection. CXL RAS was set using a test patch (not upstreamed). Hi Terry, Can you share the aer-inject repo for the testing or the test patch? Fan > > Root port UCE: > root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh > [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 > [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 > [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000 > [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr > [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > [ 27.325584] > [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' > first_error: 'Memory Address Parity Error' > [ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic > [ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857 > [ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 27.335716] Call Trace: > [ 27.335985] <TASK> > [ 27.336226] panic+0x2ed/0x320 > [ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10 > [ 27.337037] ? __pfx_aer_root_reset+0x10/0x10 > [ 27.337453] cxl_do_recovery+0x304/0x310 > [ 27.337833] aer_isr+0x3fd/0x700 > [ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10 > [ 27.338572] irq_thread_fn+0x1f/0x60 > [ 27.338923] irq_thread+0x102/0x1b0 > [ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10 > [ 27.339683] ? __pfx_irq_thread+0x10/0x10 > [ 27.340059] kthread+0xcd/0x100 > [ 27.340387] ? __pfx_kthread+0x10/0x10 > [ 27.340748] ret_from_fork+0x2f/0x50 > [ 27.341100] ? __pfx_kthread+0x10/0x10 > [ 27.341466] ret_from_fork_asm+0x1a/0x30 > [ 27.341842] </TASK> > [ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > Root port CE: > root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh > [ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0 > [ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0 > [ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > [ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000 > [ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr > [ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > [ 19.449223] > [ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer' > > Upstream switch port UCE: > root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh > [ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0 > [ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0 > [ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > [ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000 > [ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr > [ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > [ 45.242448] > [ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' > first_error: 'Memory Address Parity Error' > [ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic > [ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855 > [ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 45.251907] Call Trace: > [ 45.253284] <TASK> > [ 45.253564] panic+0x2ed/0x320 > [ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10 > [ 45.255455] ? __pfx_aer_root_reset+0x10/0x10 > [ 45.255915] cxl_do_recovery+0x304/0x310 > [ 45.257219] aer_isr+0x3fd/0x700 > [ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10 > [ 45.258006] irq_thread_fn+0x1f/0x60 > [ 45.258383] irq_thread+0x102/0x1b0 > [ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10 > [ 45.259196] ? __pfx_irq_thread+0x10/0x10 > [ 45.259605] kthread+0xcd/0x100 > [ 45.259956] ? __pfx_kthread+0x10/0x10 > [ 45.260386] ret_from_fork+0x2f/0x50 > [ 45.260879] ? __pfx_kthread+0x10/0x10 > [ 45.261418] ret_from_fork_asm+0x1a/0x30 > [ 45.261936] </TASK> > [ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > Upstream switch port CE: > root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh > [ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0 > [ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0 > [ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > [ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000 > [ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr > [ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > [ 37.510180] > [ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer' > > Downstream switch port UCE: > root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh > [ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0 > [ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0 > [ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > [ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000 > [ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr > [ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > [ 29.427111] > [ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' > first_error: 'Memory Address Parity Error' > [ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic > [ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851 > [ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 29.433031] Call Trace: > [ 29.433354] <TASK> > [ 29.433631] panic+0x2ed/0x320 > [ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10 > [ 29.434653] ? __pfx_aer_root_reset+0x10/0x10 > [ 29.435179] cxl_do_recovery+0x304/0x310 > [ 29.435626] aer_isr+0x3fd/0x700 > [ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10 > [ 29.436507] irq_thread_fn+0x1f/0x60 > [ 29.436898] irq_thread+0x102/0x1b0 > [ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10 > [ 29.437758] ? __pfx_irq_thread+0x10/0x10 > [ 29.438189] kthread+0xcd/0x100 > [ 29.438551] ? __pfx_kthread+0x10/0x10 > [ 29.438959] ret_from_fork+0x2f/0x50 > [ 29.439362] ? __pfx_kthread+0x10/0x10 > [ 29.439771] ret_from_fork_asm+0x1a/0x30 > [ 29.440221] </TASK> > [ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > Downstream switch port CE: > root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh > [ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 > [ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0 > [ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > [ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000 > [ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr > [ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > [ 177.119521] > [ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer' > > Changes RFC->v1: > [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error() > [Dan] Add cxl_do_recovery() > [Jonathan] Flatten cxl_setup_parent_uport() > [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs > [Jonathan] Rename cxl_dev_is_pci_type() > [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can > replace these find_cxl_port() and device_find_child(). > [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport() > [Ming] Dont use endpoint as host to cxl_map_component_regs() > [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE" > [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface > > Terry Bowman (15): > cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service > driver > cxl/aer/pci: Update is_internal_error() to be callable w/o > CONFIG_PCIEAER_CXL > cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL > PCIe ports > cxl/aer/pci: Add CXL PCIe port correctable error support in AER > service driver > cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL > PCIe port devices > cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type > cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER > service driver > cxl/pci: Change find_cxl_ports() to be non-static > cxl/pci: Map CXL PCIe downstream port RAS registers > cxl/pci: Map CXL PCIe upstream port RAS registers > cxl/pci: Update RAS handler interfaces to support CXL PCIe ports > cxl/pci: Add error handler for CXL PCIe port RAS errors > cxl/pci: Add trace logging for CXL PCIe port RAS errors > cxl/aer/pci: Export pci_aer_unmask_internal_errors() > cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices > > drivers/cxl/core/core.h | 3 + > drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++-------- > drivers/cxl/core/port.c | 4 +- > drivers/cxl/core/trace.h | 47 +++++++++++ > drivers/cxl/cxl.h | 14 +++- > drivers/cxl/mem.c | 30 ++++++- > drivers/cxl/pci.c | 8 ++ > drivers/pci/pci.h | 5 ++ > drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++-------- > drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++ > include/linux/aer.h | 16 ++++ > include/linux/pci.h | 3 + > 12 files changed, 503 insertions(+), 72 deletions(-) > > > base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a > -- > 2.34.1 > -- Fan Ni
Hi Fan, On 10/17/2024 11:34 AM, Fan Ni wrote: > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: >> This is a continuation of the CXL port error handling RFC from earlier.[1] >> The RFC resulted in the decision to add CXL PCIe port error handling to >> the existing RCH downstream port handling. This patchset adds the CXL PCIe >> port handling and logging. >> >> The first 7 patches update the existing AER service driver to support CXL >> PCIe port protocol error handling and reporting. This includes AER service >> driver changes for adding correctable and uncorrectable error support, CXL >> specific recovery handling, and addition of CXL driver callback handlers. >> >> The following 8 patches address CXL driver support for CXL PCIe port >> protocol errors. This includes the following changes to the CXL drivers: >> mapping CXL port and downstream port RAS registers, interface updates for >> common RCH and VH, adding port specific error handlers, and protocol error >> logging. >> >> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554 >> -1-terry.bowman@amd.com/ >> >> Testing: >> >> Below are test results for this patchset. This is using Qemu with a root >> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port >> (0e:00.0). >> >> This was tested using aer-inject updated to support CE and UCE internal >> error injection. CXL RAS was set using a test patch (not upstreamed). > > Hi Terry, > Can you share the aer-inject repo for the testing or the test patch? > > Fan Sure, but, its easiest to attach the patch here. Origin was https://github.com/jderrick/aer-inject.git Base is 81701cbb30e35a1a76c3876f55692f91bdb9751b Regards, TerryFrom ca9277866b506723f46f3acd7b264ffa80c37276 Mon Sep 17 00:00:00 2001 From: Terry Bowman <terry.bowman@amd.com> Date: Thu, 17 Oct 2024 12:12:58 -0500 Subject: [PATCH] aer-inject: Add internal error injection Add corrected (CE) and uncorrected (UCE) AER internal error injection support. Signed-off-by: Terry Bowman <terry.bowman@amd.com> --- aer.h | 2 ++ aer.lex | 2 ++ aer.y | 8 ++++---- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/aer.h b/aer.h index a0ad152..e55a731 100644 --- a/aer.h +++ b/aer.h @@ -30,11 +30,13 @@ struct aer_error_inj #define PCI_ERR_UNC_MALF_TLP 0x00040000 /* Malformed TLP */ #define PCI_ERR_UNC_ECRC 0x00080000 /* ECRC Error Status */ #define PCI_ERR_UNC_UNSUP 0x00100000 /* Unsupported Request */ +#define PCI_ERR_UNC_INTERNAL 0x00400000 /* Internal error */ #define PCI_ERR_COR_RCVR 0x00000001 /* Receiver Error Status */ #define PCI_ERR_COR_BAD_TLP 0x00000040 /* Bad TLP Status */ #define PCI_ERR_COR_BAD_DLLP 0x00000080 /* Bad DLLP Status */ #define PCI_ERR_COR_REP_ROLL 0x00000100 /* REPLAY_NUM Rollover */ #define PCI_ERR_COR_REP_TIMER 0x00001000 /* Replay Timer Timeout */ +#define PCI_ERR_COR_CINTERNAL 0x00004000 /* Internal error */ extern void init_aer(struct aer_error_inj *err); extern void submit_aer(struct aer_error_inj *err); diff --git a/aer.lex b/aer.lex index 6121e4e..4fadd0e 100644 --- a/aer.lex +++ b/aer.lex @@ -82,11 +82,13 @@ static struct key { KEYVAL(MALF_TLP, PCI_ERR_UNC_MALF_TLP), KEYVAL(ECRC, PCI_ERR_UNC_ECRC), KEYVAL(UNSUP, PCI_ERR_UNC_UNSUP), + KEYVAL(INTERNAL, PCI_ERR_UNC_INTERNAL), KEYVAL(RCVR, PCI_ERR_COR_RCVR), KEYVAL(BAD_TLP, PCI_ERR_COR_BAD_TLP), KEYVAL(BAD_DLLP, PCI_ERR_COR_BAD_DLLP), KEYVAL(REP_ROLL, PCI_ERR_COR_REP_ROLL), KEYVAL(REP_TIMER, PCI_ERR_COR_REP_TIMER), + KEYVAL(CINTERNAL, PCI_ERR_COR_CINTERNAL), }; static int cmp_key(const void *av, const void *bv) diff --git a/aer.y b/aer.y index e5ecc7d..500dc97 100644 --- a/aer.y +++ b/aer.y @@ -34,8 +34,8 @@ static void init(void); %token AER DOMAIN BUS DEV FN PCI_ID UNCOR_STATUS COR_STATUS HEADER_LOG %token <num> TRAIN DLP POISON_TLP FCP COMP_TIME COMP_ABORT UNX_COMP RX_OVER -%token <num> MALF_TLP ECRC UNSUP -%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER +%token <num> MALF_TLP ECRC UNSUP INTERNAL +%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER CINTERNAL %token <num> SYMBOL NUMBER %token <str> PCI_ID_STR @@ -77,14 +77,14 @@ uncor_status_list: /* empty */ { $$ = 0; } ; uncor_status: TRAIN | DLP | POISON_TLP | FCP | COMP_TIME | COMP_ABORT - | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | NUMBER + | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | INTERNAL | NUMBER ; cor_status_list: /* empty */ { $$ = 0; } | cor_status_list cor_status { $$ = $1 | $2; } ; -cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | NUMBER +cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | CINTERNAL | NUMBER ; %% -- 2.34.1
On Thu, Oct 17, 2024 at 12:27:04PM -0500, Bowman, Terry wrote: > Hi Fan, > > On 10/17/2024 11:34 AM, Fan Ni wrote: > > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote: > > > This is a continuation of the CXL port error handling RFC from earlier.[1] > > > The RFC resulted in the decision to add CXL PCIe port error handling to > > > the existing RCH downstream port handling. This patchset adds the CXL PCIe > > > port handling and logging. > > > > > > The first 7 patches update the existing AER service driver to support CXL > > > PCIe port protocol error handling and reporting. This includes AER service > > > driver changes for adding correctable and uncorrectable error support, CXL > > > specific recovery handling, and addition of CXL driver callback handlers. > > > > > > The following 8 patches address CXL driver support for CXL PCIe port > > > protocol errors. This includes the following changes to the CXL drivers: > > > mapping CXL port and downstream port RAS registers, interface updates for > > > common RCH and VH, adding port specific error handlers, and protocol error > > > logging. > > > > > > [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554 > > > -1-terry.bowman@amd.com/ > > > > > > Testing: > > > > > > Below are test results for this patchset. This is using Qemu with a root > > > port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port > > > (0e:00.0). > > > > > > This was tested using aer-inject updated to support CE and UCE internal > > > error injection. CXL RAS was set using a test patch (not upstreamed). > > > > Hi Terry, > > Can you share the aer-inject repo for the testing or the test patch? Hi Terry, Could you tell me which code base you use for this patch set? I hit a lot of issues when trying to apply it on top of "fixes" or "next" branches. Fan > > > > Fan > > Sure, but, its easiest to attach the patch here. > > Origin was https://github.com/jderrick/aer-inject.git > Base is 81701cbb30e35a1a76c3876f55692f91bdb9751b > > Regards, > Terry > From ca9277866b506723f46f3acd7b264ffa80c37276 Mon Sep 17 00:00:00 2001 > From: Terry Bowman <terry.bowman@amd.com> > Date: Thu, 17 Oct 2024 12:12:58 -0500 > Subject: [PATCH] aer-inject: Add internal error injection > > Add corrected (CE) and uncorrected (UCE) AER internal error injection > support. > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > --- > aer.h | 2 ++ > aer.lex | 2 ++ > aer.y | 8 ++++---- > 3 files changed, 8 insertions(+), 4 deletions(-) > > diff --git a/aer.h b/aer.h > index a0ad152..e55a731 100644 > --- a/aer.h > +++ b/aer.h > @@ -30,11 +30,13 @@ struct aer_error_inj > #define PCI_ERR_UNC_MALF_TLP 0x00040000 /* Malformed TLP */ > #define PCI_ERR_UNC_ECRC 0x00080000 /* ECRC Error Status */ > #define PCI_ERR_UNC_UNSUP 0x00100000 /* Unsupported Request */ > +#define PCI_ERR_UNC_INTERNAL 0x00400000 /* Internal error */ > #define PCI_ERR_COR_RCVR 0x00000001 /* Receiver Error Status */ > #define PCI_ERR_COR_BAD_TLP 0x00000040 /* Bad TLP Status */ > #define PCI_ERR_COR_BAD_DLLP 0x00000080 /* Bad DLLP Status */ > #define PCI_ERR_COR_REP_ROLL 0x00000100 /* REPLAY_NUM Rollover */ > #define PCI_ERR_COR_REP_TIMER 0x00001000 /* Replay Timer Timeout */ > +#define PCI_ERR_COR_CINTERNAL 0x00004000 /* Internal error */ > > extern void init_aer(struct aer_error_inj *err); > extern void submit_aer(struct aer_error_inj *err); > diff --git a/aer.lex b/aer.lex > index 6121e4e..4fadd0e 100644 > --- a/aer.lex > +++ b/aer.lex > @@ -82,11 +82,13 @@ static struct key { > KEYVAL(MALF_TLP, PCI_ERR_UNC_MALF_TLP), > KEYVAL(ECRC, PCI_ERR_UNC_ECRC), > KEYVAL(UNSUP, PCI_ERR_UNC_UNSUP), > + KEYVAL(INTERNAL, PCI_ERR_UNC_INTERNAL), > KEYVAL(RCVR, PCI_ERR_COR_RCVR), > KEYVAL(BAD_TLP, PCI_ERR_COR_BAD_TLP), > KEYVAL(BAD_DLLP, PCI_ERR_COR_BAD_DLLP), > KEYVAL(REP_ROLL, PCI_ERR_COR_REP_ROLL), > KEYVAL(REP_TIMER, PCI_ERR_COR_REP_TIMER), > + KEYVAL(CINTERNAL, PCI_ERR_COR_CINTERNAL), > }; > > static int cmp_key(const void *av, const void *bv) > diff --git a/aer.y b/aer.y > index e5ecc7d..500dc97 100644 > --- a/aer.y > +++ b/aer.y > @@ -34,8 +34,8 @@ static void init(void); > > %token AER DOMAIN BUS DEV FN PCI_ID UNCOR_STATUS COR_STATUS HEADER_LOG > %token <num> TRAIN DLP POISON_TLP FCP COMP_TIME COMP_ABORT UNX_COMP RX_OVER > -%token <num> MALF_TLP ECRC UNSUP > -%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER > +%token <num> MALF_TLP ECRC UNSUP INTERNAL > +%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER CINTERNAL > %token <num> SYMBOL NUMBER > %token <str> PCI_ID_STR > > @@ -77,14 +77,14 @@ uncor_status_list: /* empty */ { $$ = 0; } > ; > > uncor_status: TRAIN | DLP | POISON_TLP | FCP | COMP_TIME | COMP_ABORT > - | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | NUMBER > + | UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | INTERNAL | NUMBER > ; > > cor_status_list: /* empty */ { $$ = 0; } > | cor_status_list cor_status { $$ = $1 | $2; } > ; > > -cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | NUMBER > +cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | CINTERNAL | NUMBER > ; > > %% > -- > 2.34.1 > -- Fan Ni
Terry Bowman wrote: [..] > Testing: > > Below are test results for this patchset. This is using Qemu with a root > port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port > (0e:00.0). > > This was tested using aer-inject updated to support CE and UCE internal > error injection. CXL RAS was set using a test patch (not upstreamed). Thanks for these test outputs! > > Root port UCE: > root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh > [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 > [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 > [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000 > [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr > [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available It strikes that by this point the code knows that it is a "CXL Bus" error and no longer a "PCIe Bus" error. Given the divergent responses to Fatal errors based on bus I think it would help to clarify that the kernel is panicking due to "CXL Bus", not "PCIe Bus" errors. > [ 27.325584] > [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' ...i.e. someone may not notice that this is "cxl" reference in the backtrace.
Hi Dan, On 10/21/24 20:43, Dan Williams wrote: > Terry Bowman wrote: > [..] >> Testing: >> >> Below are test results for this patchset. This is using Qemu with a root >> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port >> (0e:00.0). >> >> This was tested using aer-inject updated to support CE and UCE internal >> error injection. CXL RAS was set using a test patch (not upstreamed). > > Thanks for these test outputs! > >> >> Root port UCE: >> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh >> [ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 >> [ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 >> [ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) >> [ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000 >> [ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr >> [ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > > It strikes that by this point the code knows that it is a "CXL Bus" > error and no longer a "PCIe Bus" error. Given the divergent responses > to Fatal errors based on bus I think it would help to clarify that the > kernel is panicking due to "CXL Bus", not "PCIe Bus" errors. > >> [ 27.325584] >> [ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' > > ...i.e. someone may not notice that this is "cxl" reference in the > backtrace. Good idea. I'll add logic to print 'CXL' bus in the case of a CXL erroring device. Regards, Terry
© 2016 - 2024 Red Hat, Inc.