drivers/pci/controller/dwc/Kconfig | 2 + drivers/pci/controller/dwc/pcie-dw-rockchip.c | 91 ++++++++++++++++++- drivers/pci/controller/dwc/pcie-qcom.c | 120 ++++++++++++++++++++++++-- drivers/pci/controller/pci-host-common.c | 33 +++++++ drivers/pci/controller/pci-host-common.h | 1 + drivers/pci/pci.c | 21 +++++ drivers/pci/pcie/err.c | 6 +- include/linux/pci.h | 1 + 8 files changed, 260 insertions(+), 15 deletions(-)
Hi, Currently, in the event of AER/DPC, PCI core will try to reset the slot (Root Port) and its subordinate devices by invoking bridge control reset and FLR. But in some cases like AER Fatal error, it might be necessary to reset the Root Ports using the PCI host bridge drivers in a platform specific way (as indicated by the TODO in the pcie_do_recovery() function in drivers/pci/pcie/err.c). Otherwise, the PCI link won't be recovered successfully. So this series adds a new callback 'pci_host_bridge::reset_root_port' for the host bridge drivers to reset the Root Port when a fatal error happens. Also, this series allows the host bridge drivers to handle PCI link down event by resetting the Root Ports and recovering the bus. This is accomplished by the help of the new 'pci_host_handle_link_down()' API. Host bridge drivers are expected to call this API (preferrably from a threaded IRQ handler) with relevant Root Port 'pci_dev' when a link down event is detected for the port. The API will reuse the pcie_do_recovery() function to recover the link if AER support is enabled, otherwise it will directly call the reset_root_port() callback of the host bridge driver (if exists). For reference, I've modified the pcie-qcom driver to call pci_host_handle_link_down() API with Root Port 'pci_dev' after receiving the LINK_DOWN global_irq event and populated 'pci_host_bridge::reset_root_port()' callback to reset the Root Port. Since the Qcom PCIe controllers support only a single Root Port (slot) per controller instance, the API is going to be invoked only once. For multi Root Port controllers, the controller driver is expected to detect the Root Port that received the link down event and call the pci_host_handle_link_down() API with 'pci_dev' of that Root Port. Testing ------- I've lost access to my test setup now. So Krishna (Cced) will help with testing on the Qcom platform and Wilfred or Niklas should be able to test it on Rockchip platform. For the moment, this series is compile tested only. Changes in v6: - Incorporated the patch: https://lore.kernel.org/all/20250524185304.26698-2-manivannan.sadhasivam@linaro.org/ - Link to v5: https://lore.kernel.org/r/20250715-pci-port-reset-v5-0-26a5d278db40@oss.qualcomm.com Changes in v5: * Reworked the pci_host_handle_link_down() to accept Root Port instead of resetting all Root Ports in the event of link down. * Renamed 'reset_slot' to 'reset_root_port' to avoid confusion as both terms were used interchangibly and the series is intended to reset Root Port only. * Added the Rockchip driver change to this series. * Dropped the applied patches and review/tested tags due to rework. * Rebased on top of v6.16-rc1. Changes in v4: - Handled link down first in the irq handler - Updated ICC & OPP bandwidth after link up in reset_slot() callback - Link to v3: https://lore.kernel.org/r/20250417-pcie-reset-slot-v3-0-59a10811c962@linaro.org Changes in v3: - Made the pci-host-common driver as a common library for host controller drivers - Moved the reset slot code to pci-host-common library - Link to v2: https://lore.kernel.org/r/20250416-pcie-reset-slot-v2-0-efe76b278c10@linaro.org Changes in v2: - Moved calling reset_slot() callback from pcie_do_recovery() to pcibios_reset_secondary_bus() - Link to v1: https://lore.kernel.org/r/20250404-pcie-reset-slot-v1-0-98952918bf90@linaro.org Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> --- Manivannan Sadhasivam (3): PCI/ERR: Add support for resetting the Root Ports in a platform specific way PCI: host-common: Add link down handling for Root Ports PCI: qcom: Add support for resetting the Root Port due to link down event Wilfred Mallawa (1): PCI: dw-rockchip: Add support to reset Root Port upon link down event drivers/pci/controller/dwc/Kconfig | 2 + drivers/pci/controller/dwc/pcie-dw-rockchip.c | 91 ++++++++++++++++++- drivers/pci/controller/dwc/pcie-qcom.c | 120 ++++++++++++++++++++++++-- drivers/pci/controller/pci-host-common.c | 33 +++++++ drivers/pci/controller/pci-host-common.h | 1 + drivers/pci/pci.c | 21 +++++ drivers/pci/pcie/err.c | 6 +- include/linux/pci.h | 1 + 8 files changed, 260 insertions(+), 15 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20250715-pci-port-reset-4d9519570123 Best regards, -- Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
On Tue, Jul 15, 2025 at 07:51:03PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > Hi, > > Currently, in the event of AER/DPC, PCI core will try to reset the slot (Root > Port) and its subordinate devices by invoking bridge control reset and FLR. But > in some cases like AER Fatal error, it might be necessary to reset the Root > Ports using the PCI host bridge drivers in a platform specific way (as indicated > by the TODO in the pcie_do_recovery() function in drivers/pci/pcie/err.c). > Otherwise, the PCI link won't be recovered successfully. > > So this series adds a new callback 'pci_host_bridge::reset_root_port' for the > host bridge drivers to reset the Root Port when a fatal error happens. > > Also, this series allows the host bridge drivers to handle PCI link down event > by resetting the Root Ports and recovering the bus. This is accomplished by the > help of the new 'pci_host_handle_link_down()' API. Host bridge drivers are > expected to call this API (preferrably from a threaded IRQ handler) with > relevant Root Port 'pci_dev' when a link down event is detected for the port. > The API will reuse the pcie_do_recovery() function to recover the link if AER > support is enabled, otherwise it will directly call the reset_root_port() > callback of the host bridge driver (if exists). > > For reference, I've modified the pcie-qcom driver to call > pci_host_handle_link_down() API with Root Port 'pci_dev' after receiving the > LINK_DOWN global_irq event and populated 'pci_host_bridge::reset_root_port()' > callback to reset the Root Port. Since the Qcom PCIe controllers support only > a single Root Port (slot) per controller instance, the API is going to be > invoked only once. For multi Root Port controllers, the controller driver is > expected to detect the Root Port that received the link down event and call > the pci_host_handle_link_down() API with 'pci_dev' of that Root Port. > > Testing > ------- > > I've lost access to my test setup now. So Krishna (Cced) will help with testing > on the Qcom platform and Wilfred or Niklas should be able to test it on Rockchip > platform. For the moment, this series is compile tested only. For the series: Tested-by: Brian Norris <briannorris@chromium.org> I've tested the whole thing on Qualcomm SC7280 Herobrine systems with NVMe. After adding a debugfs node to control toggling PERST, I can force the link to reset, and see it recover and resume NVMe traffic. I've tested the first two on Pixel phones, using a non-upstream DWC-based driver that I'm working on getting in better shape. (We've previously supported a custom link-error API setup instead.) I'd love to see this available upstream. Regards, Brian
On Thu, Aug 28, 2025 at 01:01:51PM GMT, Brian Norris wrote: > On Tue, Jul 15, 2025 at 07:51:03PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > > Hi, > > > > Currently, in the event of AER/DPC, PCI core will try to reset the slot (Root > > Port) and its subordinate devices by invoking bridge control reset and FLR. But > > in some cases like AER Fatal error, it might be necessary to reset the Root > > Ports using the PCI host bridge drivers in a platform specific way (as indicated > > by the TODO in the pcie_do_recovery() function in drivers/pci/pcie/err.c). > > Otherwise, the PCI link won't be recovered successfully. > > > > So this series adds a new callback 'pci_host_bridge::reset_root_port' for the > > host bridge drivers to reset the Root Port when a fatal error happens. > > > > Also, this series allows the host bridge drivers to handle PCI link down event > > by resetting the Root Ports and recovering the bus. This is accomplished by the > > help of the new 'pci_host_handle_link_down()' API. Host bridge drivers are > > expected to call this API (preferrably from a threaded IRQ handler) with > > relevant Root Port 'pci_dev' when a link down event is detected for the port. > > The API will reuse the pcie_do_recovery() function to recover the link if AER > > support is enabled, otherwise it will directly call the reset_root_port() > > callback of the host bridge driver (if exists). > > > > For reference, I've modified the pcie-qcom driver to call > > pci_host_handle_link_down() API with Root Port 'pci_dev' after receiving the > > LINK_DOWN global_irq event and populated 'pci_host_bridge::reset_root_port()' > > callback to reset the Root Port. Since the Qcom PCIe controllers support only > > a single Root Port (slot) per controller instance, the API is going to be > > invoked only once. For multi Root Port controllers, the controller driver is > > expected to detect the Root Port that received the link down event and call > > the pci_host_handle_link_down() API with 'pci_dev' of that Root Port. > > > > Testing > > ------- > > > > I've lost access to my test setup now. So Krishna (Cced) will help with testing > > on the Qcom platform and Wilfred or Niklas should be able to test it on Rockchip > > platform. For the moment, this series is compile tested only. > > For the series: > > Tested-by: Brian Norris <briannorris@chromium.org> > > I've tested the whole thing on Qualcomm SC7280 Herobrine systems with > NVMe. After adding a debugfs node to control toggling PERST, I can force > the link to reset, and see it recover and resume NVMe traffic. > > I've tested the first two on Pixel phones, using a non-upstream > DWC-based driver that I'm working on getting in better shape. (We've > previously supported a custom link-error API setup instead.) I'd love to > see this available upstream. > Thanks, Brian for testing! I didn't get time to look into the report from Niklas (which is the only blocking thing for this series). I'll try to dig into it today/tomorrow. - Mani -- மணிவண்ணன் சதாசிவம்
On Tue, Jul 15, 2025 at 07:51:03PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > Testing > ------- > > I've lost access to my test setup now. So Krishna (Cced) will help with testing > on the Qcom platform and Wilfred or Niklas should be able to test it on Rockchip > platform. For the moment, this series is compile tested only. Since this patch series implements two things: 1) Testing sysfs initiated reset: selftests before sysfs initiated reset: # FAILED: 14 / 16 tests passed. # echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset [ 145.567748] pci-endpoint-test 0000:01:00.0: resetting [ 145.638755] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 [ 145.639472] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 [ 145.640063] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! [ 145.682612] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up [ 145.683162] rockchip-dw-pcie a40000000.pcie: Root Port reset completed [ 145.810852] pci-endpoint-test 0000:01:00.0: reset done selftests after sysfs initiated reset: # FAILED: 14 / 16 tests passed. (Without this patch series: # FAILED: 7 / 16 tests passed.) So for this part: Tested-by: Niklas Cassel <cassel@kernel.org> 2) Testing link down reset: selftests before link down reset: # FAILED: 14 / 16 tests passed. ## On EP side: # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start [ 111.137162] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x4 [ 111.137881] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x0 [ 111.138432] rockchip-dw-pcie a40000000.pcie: hot reset or link-down reset [ 111.139067] pcieport 0000:00:00.0: Recovering Root Port due to Link Down [ 111.139686] pci-endpoint-test 0000:01:00.0: AER: can't recover (no error_detected callback) [ 111.255407] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up [ 111.256019] rockchip-dw-pcie a40000000.pcie: Root Port reset completed [ 111.383401] pcieport 0000:00:00.0: Root Port has been reset [ 111.384060] pcieport 0000:00:00.0: AER: device recovery failed [ 111.384582] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 [ 111.385218] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 [ 111.385771] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! [ 111.390866] pcieport 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring [ 111.391650] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 Basically all tests timeout # FAILED: 1 / 16 tests passed. Which is the same as before this patch series. So AFAICT, this part does not seem to work as advertised. Instead of quickly stopping and starting the link, I also tried to reboot the EP board, which does the configfs writes and starts the link automatically on boot, but that had the same result as quickly stopping and starting the link. Kind regards, Niklas
On Fri, Jul 18, 2025 at 12:28:44PM +0200, Niklas Cassel wrote: > On Tue, Jul 15, 2025 at 07:51:03PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > 2) Testing link down reset: > > selftests before link down reset: > # FAILED: 14 / 16 tests passed. > > ## On EP side: > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ > sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start > > > [ 111.137162] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x4 > [ 111.137881] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x0 > [ 111.138432] rockchip-dw-pcie a40000000.pcie: hot reset or link-down reset > [ 111.139067] pcieport 0000:00:00.0: Recovering Root Port due to Link Down > [ 111.139686] pci-endpoint-test 0000:01:00.0: AER: can't recover (no error_detected callback) > [ 111.255407] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up > [ 111.256019] rockchip-dw-pcie a40000000.pcie: Root Port reset completed > [ 111.383401] pcieport 0000:00:00.0: Root Port has been reset > [ 111.384060] pcieport 0000:00:00.0: AER: device recovery failed > [ 111.384582] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 > [ 111.385218] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 > [ 111.385771] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! > [ 111.390866] pcieport 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > [ 111.391650] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 > > Basically all tests timeout > # FAILED: 1 / 16 tests passed. > > Which is the same as before this patch series. The above was with CONFIG_PCIEAER=y Wilfred suggested that I tried without this config set. However, doing so, I got the exact same result: # FAILED: 1 / 16 tests passed. For the record, the test that passes is not actually passing either, it is the BAR4 test, which is skipped, since BAR4 is reserved on rock5b: ok 5 pci_ep_bar.BAR4.BAR_TEST # SKIP BAR is disabled Kind regards, Niklas
On Fri, Jul 18, 2025 at 12:39:50PM GMT, Niklas Cassel wrote: > On Fri, Jul 18, 2025 at 12:28:44PM +0200, Niklas Cassel wrote: > > On Tue, Jul 15, 2025 at 07:51:03PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > > 2) Testing link down reset: > > > > selftests before link down reset: > > # FAILED: 14 / 16 tests passed. > > > > ## On EP side: > > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ > > sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start > > > > > > [ 111.137162] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x4 > > [ 111.137881] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x0 > > [ 111.138432] rockchip-dw-pcie a40000000.pcie: hot reset or link-down reset > > [ 111.139067] pcieport 0000:00:00.0: Recovering Root Port due to Link Down > > [ 111.139686] pci-endpoint-test 0000:01:00.0: AER: can't recover (no error_detected callback) > > [ 111.255407] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up > > [ 111.256019] rockchip-dw-pcie a40000000.pcie: Root Port reset completed > > [ 111.383401] pcieport 0000:00:00.0: Root Port has been reset > > [ 111.384060] pcieport 0000:00:00.0: AER: device recovery failed > > [ 111.384582] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 > > [ 111.385218] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 > > [ 111.385771] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! > > [ 111.390866] pcieport 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > [ 111.391650] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 > > > > Basically all tests timeout > > # FAILED: 1 / 16 tests passed. > > > > Which is the same as before this patch series. > > The above was with CONFIG_PCIEAER=y > This is kind of expected since the pci_endpoint_test driver doesn't have the AER err_handlers defined. > Wilfred suggested that I tried without this config set. > > However, doing so, I got the exact same result: > # FAILED: 1 / 16 tests passed. > Interesting. Could you please share the dmesg log like above. - Mani -- மணிவண்ணன் சதாசிவம்
Hello Mani, Sorry for the delayed reply. I just came back from vacation. On Thu, Jul 24, 2025 at 11:00:05AM +0530, Manivannan Sadhasivam wrote: > On Fri, Jul 18, 2025 at 12:39:50PM GMT, Niklas Cassel wrote: > > On Fri, Jul 18, 2025 at 12:28:44PM +0200, Niklas Cassel wrote: > > > On Tue, Jul 15, 2025 at 07:51:03PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > > > 2) Testing link down reset: > > > > > > selftests before link down reset: > > > # FAILED: 14 / 16 tests passed. > > > > > > ## On EP side: > > > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ > > > sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start > > > > > > > > > [ 111.137162] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x4 > > > [ 111.137881] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x0 > > > [ 111.138432] rockchip-dw-pcie a40000000.pcie: hot reset or link-down reset > > > [ 111.139067] pcieport 0000:00:00.0: Recovering Root Port due to Link Down > > > [ 111.139686] pci-endpoint-test 0000:01:00.0: AER: can't recover (no error_detected callback) > > > [ 111.255407] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up > > > [ 111.256019] rockchip-dw-pcie a40000000.pcie: Root Port reset completed > > > [ 111.383401] pcieport 0000:00:00.0: Root Port has been reset > > > [ 111.384060] pcieport 0000:00:00.0: AER: device recovery failed > > > [ 111.384582] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 > > > [ 111.385218] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 > > > [ 111.385771] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! > > > [ 111.390866] pcieport 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > > [ 111.391650] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 > > > > > > Basically all tests timeout > > > # FAILED: 1 / 16 tests passed. > > > > > > Which is the same as before this patch series. > > > > The above was with CONFIG_PCIEAER=y > > > > This is kind of expected since the pci_endpoint_test driver doesn't have the AER > err_handlers defined. I see. Would be nice if we could add them then, so that we can verify that this series is working as intended. > > > Wilfred suggested that I tried without this config set. > > > > However, doing so, I got the exact same result: > > # FAILED: 1 / 16 tests passed. > > > > Interesting. Could you please share the dmesg log like above. It is looking exactly like the dmesg above [ 86.820059] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x4 [ 86.820791] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x0 [ 86.821344] rockchip-dw-pcie a40000000.pcie: hot reset or link-down reset [ 86.821978] pcieport 0000:00:00.0: Recovering Root Port due to Link Down [ 87.040551] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up [ 87.041138] rockchip-dw-pcie a40000000.pcie: Root Port reset completed [ 87.168378] pcieport 0000:00:00.0: Root Port has been reset [ 87.168882] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 [ 87.169519] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 [ 87.272463] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! [ 87.277552] pcieport 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring [ 87.278314] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 except that we don't get the: > [ 111.139686] pci-endpoint-test 0000:01:00.0: AER: can't recover (no error_detected callback) > [ 111.384060] pcieport 0000:00:00.0: AER: device recovery failed prints. Kind regards, Niklas
On Fri, Aug 15, 2025 at 11:07:42AM GMT, Niklas Cassel wrote: > Hello Mani, > > Sorry for the delayed reply. > I just came back from vacation. > > On Thu, Jul 24, 2025 at 11:00:05AM +0530, Manivannan Sadhasivam wrote: > > On Fri, Jul 18, 2025 at 12:39:50PM GMT, Niklas Cassel wrote: > > > On Fri, Jul 18, 2025 at 12:28:44PM +0200, Niklas Cassel wrote: > > > > On Tue, Jul 15, 2025 at 07:51:03PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > > > > 2) Testing link down reset: > > > > > > > > selftests before link down reset: > > > > # FAILED: 14 / 16 tests passed. > > > > > > > > ## On EP side: > > > > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ > > > > sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start > > > > > > > > > > > > [ 111.137162] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x4 > > > > [ 111.137881] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x0 > > > > [ 111.138432] rockchip-dw-pcie a40000000.pcie: hot reset or link-down reset > > > > [ 111.139067] pcieport 0000:00:00.0: Recovering Root Port due to Link Down > > > > [ 111.139686] pci-endpoint-test 0000:01:00.0: AER: can't recover (no error_detected callback) > > > > [ 111.255407] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up > > > > [ 111.256019] rockchip-dw-pcie a40000000.pcie: Root Port reset completed > > > > [ 111.383401] pcieport 0000:00:00.0: Root Port has been reset > > > > [ 111.384060] pcieport 0000:00:00.0: AER: device recovery failed > > > > [ 111.384582] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 > > > > [ 111.385218] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 > > > > [ 111.385771] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! > > > > [ 111.390866] pcieport 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > > > [ 111.391650] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 > > > > > > > > Basically all tests timeout > > > > # FAILED: 1 / 16 tests passed. > > > > > > > > Which is the same as before this patch series. > > > > > > The above was with CONFIG_PCIEAER=y > > > > > > > This is kind of expected since the pci_endpoint_test driver doesn't have the AER > > err_handlers defined. > > I see. > Would be nice if we could add them then, so that we can verify that this > series is working as intended. > > > > > > > Wilfred suggested that I tried without this config set. > > > > > > However, doing so, I got the exact same result: > > > # FAILED: 1 / 16 tests passed. > > > > > > > Interesting. Could you please share the dmesg log like above. > > It is looking exactly like the dmesg above > > [ 86.820059] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x4 > [ 86.820791] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x0 > [ 86.821344] rockchip-dw-pcie a40000000.pcie: hot reset or link-down reset > [ 86.821978] pcieport 0000:00:00.0: Recovering Root Port due to Link Down > [ 87.040551] rockchip-dw-pcie a40000000.pcie: PCIe Gen.3 x4 link up > [ 87.041138] rockchip-dw-pcie a40000000.pcie: Root Port reset completed > [ 87.168378] pcieport 0000:00:00.0: Root Port has been reset > [ 87.168882] rockchip-dw-pcie a40000000.pcie: PCIE_CLIENT_INTR_STATUS_MISC: 0x3 > [ 87.169519] rockchip-dw-pcie a40000000.pcie: LTSSM_STATUS: 0x230011 > [ 87.272463] rockchip-dw-pcie a40000000.pcie: Received Link up event. Starting enumeration! > [ 87.277552] pcieport 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring > [ 87.278314] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01 > > except that we don't get the: > > [ 111.139686] pci-endpoint-test 0000:01:00.0: AER: can't recover (no error_detected callback) > > [ 111.384060] pcieport 0000:00:00.0: AER: device recovery failed > Ok, thanks for the logs. I guess what is happening here is that we are not saving/restoring the config space of the devices under the Root Port if linkdown is happens. TBH, we cannot do that from the PCI core since once linkdown happens, we cannot access any devices underneath the Root Port. But if err_handlers are available for drivers for all devices, they could do something smart like below: diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c index c4e5e2c977be..9aabf1fe902e 100644 --- a/drivers/misc/pci_endpoint_test.c +++ b/drivers/misc/pci_endpoint_test.c @@ -989,6 +989,8 @@ static int pci_endpoint_test_probe(struct pci_dev *pdev, pci_set_drvdata(pdev, test); + pci_save_state(pdev); + id = ida_alloc(&pci_endpoint_test_ida, GFP_KERNEL); if (id < 0) { ret = id; @@ -1140,12 +1142,31 @@ static const struct pci_device_id pci_endpoint_test_tbl[] = { }; MODULE_DEVICE_TABLE(pci, pci_endpoint_test_tbl); +static pci_ers_result_t pci_endpoint_test_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + return PCI_ERS_RESULT_NEED_RESET; +} + +static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev) +{ + pci_restore_state(pdev); + + return PCI_ERS_RESULT_RECOVERED; +} + +static const struct pci_error_handlers pci_endpoint_test_err_handler = { + .error_detected = pci_endpoint_test_error_detected, + .slot_reset = pci_endpoint_test_slot_reset, +}; + static struct pci_driver pci_endpoint_test_driver = { .name = DRV_MODULE_NAME, .id_table = pci_endpoint_test_tbl, .probe = pci_endpoint_test_probe, .remove = pci_endpoint_test_remove, .sriov_configure = pci_sriov_configure_simple, + .err_handler = &pci_endpoint_test_err_handler, }; module_pci_driver(pci_endpoint_test_driver); This essentially saves the good known config space during probe and restores it during the slot_reset callback. Ofc, the state would've been overwritten if suspend/resume happens in-between, but the point I'm making is that unless all device drivers restore their known config space, devices cannot be resumed properly post linkdown recovery. I can add a patch based on the above diff in next revision if that helps. Right now, I do not have access to my endpoint test setup. So can't test anything. - Mani -- மணிவண்ணன் சதாசிவம்
Hello Mani, On Fri, Aug 29, 2025 at 09:44:08PM +0530, Manivannan Sadhasivam wrote: > On Fri, Aug 15, 2025 at 11:07:42AM GMT, Niklas Cassel wrote: (snip) > > > > > ## On EP side: > > > > > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ > > > > > sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start > > > > > > > > > > Basically all tests timeout > > > > > # FAILED: 1 / 16 tests passed. > > > > > > > > > > Which is the same as before this patch series. > > > > > > This is kind of expected since the pci_endpoint_test driver doesn't have the AER > > > err_handlers defined. > > > > I see. > > Would be nice if we could add them then, so that we can verify that this > > series is working as intended. (snip) > Ok, thanks for the logs. I guess what is happening here is that we are not > saving/restoring the config space of the devices under the Root Port if linkdown > is happens. TBH, we cannot do that from the PCI core since once linkdown > happens, we cannot access any devices underneath the Root Port. But if > err_handlers are available for drivers for all devices, they could do something > smart like below: > > diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c > index c4e5e2c977be..9aabf1fe902e 100644 > --- a/drivers/misc/pci_endpoint_test.c > +++ b/drivers/misc/pci_endpoint_test.c > @@ -989,6 +989,8 @@ static int pci_endpoint_test_probe(struct pci_dev *pdev, > > pci_set_drvdata(pdev, test); > > + pci_save_state(pdev); > + > id = ida_alloc(&pci_endpoint_test_ida, GFP_KERNEL); > if (id < 0) { > ret = id; > @@ -1140,12 +1142,31 @@ static const struct pci_device_id pci_endpoint_test_tbl[] = { > }; > MODULE_DEVICE_TABLE(pci, pci_endpoint_test_tbl); > > +static pci_ers_result_t pci_endpoint_test_error_detected(struct pci_dev *pdev, > + pci_channel_state_t state) > +{ > + return PCI_ERS_RESULT_NEED_RESET; > +} > + > +static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev) > +{ > + pci_restore_state(pdev); > + > + return PCI_ERS_RESULT_RECOVERED; > +} > + > +static const struct pci_error_handlers pci_endpoint_test_err_handler = { > + .error_detected = pci_endpoint_test_error_detected, > + .slot_reset = pci_endpoint_test_slot_reset, > +}; > + > static struct pci_driver pci_endpoint_test_driver = { > .name = DRV_MODULE_NAME, > .id_table = pci_endpoint_test_tbl, > .probe = pci_endpoint_test_probe, > .remove = pci_endpoint_test_remove, > .sriov_configure = pci_sriov_configure_simple, > + .err_handler = &pci_endpoint_test_err_handler, > }; > module_pci_driver(pci_endpoint_test_driver); > > This essentially saves the good known config space during probe and restores it > during the slot_reset callback. Ofc, the state would've been overwritten if > suspend/resume happens in-between, but the point I'm making is that unless all > device drivers restore their known config space, devices cannot be resumed > properly post linkdown recovery. > > I can add a patch based on the above diff in next revision if that helps. Right > now, I do not have access to my endpoint test setup. So can't test anything. I tested your patch series + your suggested change above, and after a: ## On EP side: # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start Instead of: # FAILED: 1 / 16 tests passed. I now get: # FAILED: 7 / 16 tests passed. Test cases 1-7 now passes (the test cases related to BARs), all other test cases still fail: # /pcitest TAP version 13 1..16 # Starting 16 tests from 9 test cases. # RUN pci_ep_bar.BAR0.BAR_TEST ... # OK pci_ep_bar.BAR0.BAR_TEST ok 1 pci_ep_bar.BAR0.BAR_TEST # RUN pci_ep_bar.BAR1.BAR_TEST ... # OK pci_ep_bar.BAR1.BAR_TEST ok 2 pci_ep_bar.BAR1.BAR_TEST # RUN pci_ep_bar.BAR2.BAR_TEST ... # OK pci_ep_bar.BAR2.BAR_TEST ok 3 pci_ep_bar.BAR2.BAR_TEST # RUN pci_ep_bar.BAR3.BAR_TEST ... # OK pci_ep_bar.BAR3.BAR_TEST ok 4 pci_ep_bar.BAR3.BAR_TEST # RUN pci_ep_bar.BAR4.BAR_TEST ... # SKIP BAR is disabled # OK pci_ep_bar.BAR4.BAR_TEST ok 5 pci_ep_bar.BAR4.BAR_TEST # SKIP BAR is disabled # RUN pci_ep_bar.BAR5.BAR_TEST ... # OK pci_ep_bar.BAR5.BAR_TEST ok 6 pci_ep_bar.BAR5.BAR_TEST # RUN pci_ep_basic.CONSECUTIVE_BAR_TEST ... # OK pci_ep_basic.CONSECUTIVE_BAR_TEST ok 7 pci_ep_basic.CONSECUTIVE_BAR_TEST # RUN pci_ep_basic.LEGACY_IRQ_TEST ... # pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Test failed for Legacy IRQ # LEGACY_IRQ_TEST: Test failed # FAIL pci_ep_basic.LEGACY_IRQ_TEST not ok 8 pci_ep_basic.LEGACY_IRQ_TEST # RUN pci_ep_basic.MSI_TEST ... # pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI1 # pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI2 # pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI3 ... I think I know the reason.. you save the state before the IRQs have been allocated. Perhaps we need to save the state after enabling IRQs? I tried this patch on top of your patch: --- a/drivers/misc/pci_endpoint_test.c +++ b/drivers/misc/pci_endpoint_test.c @@ -851,6 +851,8 @@ static int pci_endpoint_test_set_irq(struct pci_endpoint_test *test, return ret; } + pci_save_state(pdev); + return 0; } But still: # FAILED: 7 / 16 tests passed. So... apparently that did not help... I tried with the following change as well (on top of my patch above): +static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev) +{ + struct pci_endpoint_test *test = pci_get_drvdata(pdev); + int irq_type = test->irq_type; + + pci_restore_state(pdev); + + if (irq_type != PCITEST_IRQ_TYPE_UNDEFINED) { + pci_endpoint_test_clear_irq(test); + pci_endpoint_test_set_irq(test, irq_type); + } + + return PCI_ERS_RESULT_RECOVERED; +} But still only: # FAILED: 7 / 16 tests passed. Do you have any suggestions? Kind regards, Niklas
On 7/15/2025 7:51 PM, Manivannan Sadhasivam via B4 Relay wrote: > Hi, > > Currently, in the event of AER/DPC, PCI core will try to reset the slot (Root > Port) and its subordinate devices by invoking bridge control reset and FLR. But > in some cases like AER Fatal error, it might be necessary to reset the Root > Ports using the PCI host bridge drivers in a platform specific way (as indicated > by the TODO in the pcie_do_recovery() function in drivers/pci/pcie/err.c). > Otherwise, the PCI link won't be recovered successfully. > > So this series adds a new callback 'pci_host_bridge::reset_root_port' for the > host bridge drivers to reset the Root Port when a fatal error happens. > > Also, this series allows the host bridge drivers to handle PCI link down event > by resetting the Root Ports and recovering the bus. This is accomplished by the > help of the new 'pci_host_handle_link_down()' API. Host bridge drivers are > expected to call this API (preferrably from a threaded IRQ handler) with > relevant Root Port 'pci_dev' when a link down event is detected for the port. > The API will reuse the pcie_do_recovery() function to recover the link if AER > support is enabled, otherwise it will directly call the reset_root_port() > callback of the host bridge driver (if exists). > > For reference, I've modified the pcie-qcom driver to call > pci_host_handle_link_down() API with Root Port 'pci_dev' after receiving the > LINK_DOWN global_irq event and populated 'pci_host_bridge::reset_root_port()' > callback to reset the Root Port. Since the Qcom PCIe controllers support only > a single Root Port (slot) per controller instance, the API is going to be > invoked only once. For multi Root Port controllers, the controller driver is > expected to detect the Root Port that received the link down event and call > the pci_host_handle_link_down() API with 'pci_dev' of that Root Port. > > Testing > ------- > > I've lost access to my test setup now. So Krishna (Cced) will help with testing > on the Qcom platform and Wilfred or Niklas should be able to test it on Rockchip > platform. For the moment, this series is compile tested only. Tested on QCOM platform rb3gen2. > > Changes in v6: > - Incorporated the patch: https://lore.kernel.org/all/20250524185304.26698-2-manivannan.sadhasivam@linaro.org/ > - Link to v5: https://lore.kernel.org/r/20250715-pci-port-reset-v5-0-26a5d278db40@oss.qualcomm.com > > Changes in v5: > * Reworked the pci_host_handle_link_down() to accept Root Port instead of > resetting all Root Ports in the event of link down. > * Renamed 'reset_slot' to 'reset_root_port' to avoid confusion as both terms > were used interchangibly and the series is intended to reset Root Port only. > * Added the Rockchip driver change to this series. > * Dropped the applied patches and review/tested tags due to rework. > * Rebased on top of v6.16-rc1. > > Changes in v4: > - Handled link down first in the irq handler > - Updated ICC & OPP bandwidth after link up in reset_slot() callback > - Link to v3: https://lore.kernel.org/r/20250417-pcie-reset-slot-v3-0-59a10811c962@linaro.org > > Changes in v3: > - Made the pci-host-common driver as a common library for host controller > drivers > - Moved the reset slot code to pci-host-common library > - Link to v2: https://lore.kernel.org/r/20250416-pcie-reset-slot-v2-0-efe76b278c10@linaro.org > > Changes in v2: > - Moved calling reset_slot() callback from pcie_do_recovery() to pcibios_reset_secondary_bus() > - Link to v1: https://lore.kernel.org/r/20250404-pcie-reset-slot-v1-0-98952918bf90@linaro.org > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Tested-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com> - Krishna Chaitanya. > --- > Manivannan Sadhasivam (3): > PCI/ERR: Add support for resetting the Root Ports in a platform specific way > PCI: host-common: Add link down handling for Root Ports > PCI: qcom: Add support for resetting the Root Port due to link down event > > Wilfred Mallawa (1): > PCI: dw-rockchip: Add support to reset Root Port upon link down event > > drivers/pci/controller/dwc/Kconfig | 2 + > drivers/pci/controller/dwc/pcie-dw-rockchip.c | 91 ++++++++++++++++++- > drivers/pci/controller/dwc/pcie-qcom.c | 120 ++++++++++++++++++++++++-- > drivers/pci/controller/pci-host-common.c | 33 +++++++ > drivers/pci/controller/pci-host-common.h | 1 + > drivers/pci/pci.c | 21 +++++ > drivers/pci/pcie/err.c | 6 +- > include/linux/pci.h | 1 + > 8 files changed, 260 insertions(+), 15 deletions(-) > --- > base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 > change-id: 20250715-pci-port-reset-4d9519570123 > > Best regards,
I have been testing this series on the 6.17 pre-releases, lightly patched by the collabora [1] and mnt-reform [2] teams. I have been testing on bare hardware, on MNT Research's pocket-reform product. I'm afraid I can only offer CI level feedback, but in case it helps 1) The series now applies cleanly onto collabora's rockchip-devel branch 2) The resulting kernel boots and runs OK. 3) the resulting kernel still fails the "platform" pm_test [3] with "rockchip-dw-pcie a40c00000.pcie: Phy link never came up" Of course there could be other reasons for (3), I don't know that much about it. I'm happy to test a newer version of the series if/when it exists. d [1]: https://gitlab.collabora.com/hardware-enablement/rockchip-3588/linux.git#rockchip-devel [2]: https://salsa.debian.org/bremner/collabora-rockchip-3588#reform-patches [3]: https://www.cs.unb.ca/~bremner/blog/posts/hibernate-pocket-12/
Hello David, On Tue, Sep 23, 2025 at 10:06:13AM -0300, David Bremner wrote: > > I have been testing this series on the 6.17 pre-releases, lightly > patched by the collabora [1] and mnt-reform [2] teams. I have been testing > on bare hardware, on MNT Research's pocket-reform product. I'm afraid I > can only offer CI level feedback, but in case it helps > > 1) The series now applies cleanly onto collabora's rockchip-devel branch > 2) The resulting kernel boots and runs OK. > 3) the resulting kernel still fails the "platform" pm_test [3] with > "rockchip-dw-pcie a40c00000.pcie: Phy link never came up" > > Of course there could be other reasons for (3), I don't know that much > about it. I don't think this driver has support for hibernate in upstream. Did you try forward porting this patch: https://lore.kernel.org/linux-pci/1744940759-23823-1-git-send-email-shawn.lin@rock-chips.com/ Also, have you tried the downstream driver? If it is working there, perhaps you could try to isolate the changes that make it work there. Otherwise, I would recommend you to reach out to author of the patch above ask for support. Kind regards, Niklas
© 2016 - 2025 Red Hat, Inc.