drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++ 1 file changed, 4 insertions(+)
If driver fails to set the device to suspend, it means that the
device is abnormal. In this case, reset the device to recover
when PCIe device is offline.
Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
---
V2:
* Add judgment, reset when device is offline
---
drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/t7xx/t7xx_pci.c
index e556e5bd49ab..4f89a353588b 100644
--- a/drivers/net/wwan/t7xx/t7xx_pci.c
+++ b/drivers/net/wwan/t7xx/t7xx_pci.c
@@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev *pdev)
iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + ENABLE_ASPM_LOWPWR);
atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
+ if (pci_channel_offline(pdev)) {
+ dev_err(&pdev->dev, "Device offline, reset to recover\n");
+ t7xx_reset_device(t7xx_dev, PLDR);
+ }
return ret;
}
--
2.34.1
Hello Jinjian, On 22.10.2024 11:43, Jinjian Song wrote: > If driver fails to set the device to suspend, it means that the > device is abnormal. In this case, reset the device to recover > when PCIe device is offline. Is it a reproducible or a speculative issue? Does the fix recover modem from a problematic state? Anyway we need someone more familiar with this hardware (Intel or MediaTek engineer) to Ack the change to make sure we are not going to put a system in a more complicated state. > Signed-off-by: Jinjian Song <jinjian.song@fibocom.com> > --- > V2: > * Add judgment, reset when device is offline > --- > drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/t7xx/t7xx_pci.c > index e556e5bd49ab..4f89a353588b 100644 > --- a/drivers/net/wwan/t7xx/t7xx_pci.c > +++ b/drivers/net/wwan/t7xx/t7xx_pci.c > @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev *pdev) > iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + ENABLE_ASPM_LOWPWR); > atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED); > t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT); > + if (pci_channel_offline(pdev)) { > + dev_err(&pdev->dev, "Device offline, reset to recover\n"); > + t7xx_reset_device(t7xx_dev, PLDR); > + } > return ret; > } -- Sergey
From: Sergey Ryazanov <ryazanov.s.a@gmail.com> >Hello Jinjian, > >On 22.10.2024 11:43, Jinjian Song wrote: >> If driver fails to set the device to suspend, it means that the >> device is abnormal. In this case, reset the device to recover >> when PCIe device is offline. > >Is it a reproducible or a speculative issue? Does the fix recover modem >from a problematic state? > >Anyway we need someone more familiar with this hardware (Intel or >MediaTek engineer) to Ack the change to make sure we are not going to >put a system in a more complicated state. Hi Sergey, This is a very difficult issue to replicate onece occured and fixed. The issue occured when driver and device lost the connection. I have encountered this problem twice so far: 1. During suspend/resume stress test, there was a probabilistic D3L2 time sequence issue with the BIOS, result in PCIe link down, driver read and write the register of device invalid, so suspend failed. This issue was eventually fixed in the BIOS and I was able to restore it through the reset module after reproducing the problem. 2. During idle test, the modem probabilistic hang up, result in PCIe link down, driver read and write the register of device invalid, so suspend failed. This issue was eventually fiex in device modem firmware by adjust a certain power supply voltage, and reset modem as a workround to restore when the MBIM port command timeout in userspace applycations. Hardware reset modem to recover was discussed with MTK, and they said that if we don't want to keep the on-site problem location in case of suspend failure, we can use the recover solution. Both the ocurred issues result in the PCIe link issue, driver can't read and writer the register of WWAN device, so I want to add this path to restore, hardware reset modem can recover modem, but using the pci_channle_offline() as the judgment is my inference. Thanks. >> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com> >> --- >> V2: >> * Add judgment, reset when device is offline >> --- >> drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/t7xx/t7xx_pci.c >> index e556e5bd49ab..4f89a353588b 100644 >> --- a/drivers/net/wwan/t7xx/t7xx_pci.c >> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c >> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev *pdev) >> iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + ENABLE_ASPM_LOWPWR); >> atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED); >> t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT); >> + if (pci_channel_offline(pdev)) { >> + dev_err(&pdev->dev, "Device offline, reset to recover\n"); >> + t7xx_reset_device(t7xx_dev, PLDR); >> + } >> return ret; >> } > >-- >Sergey > Best Regards, Jinjian.
Hi Jinjian, On 29.10.2024 05:46, Jinjian Song wrote: > From: Sergey Ryazanov <ryazanov.s.a@gmail.com> >> On 22.10.2024 11:43, Jinjian Song wrote: >>> If driver fails to set the device to suspend, it means that the >>> device is abnormal. In this case, reset the device to recover >>> when PCIe device is offline. >> >> Is it a reproducible or a speculative issue? Does the fix recover >> modem from a problematic state? >> >> Anyway we need someone more familiar with this hardware (Intel or >> MediaTek engineer) to Ack the change to make sure we are not going to >> put a system in a more complicated state. > > Hi Sergey, > > This is a very difficult issue to replicate onece occured and fixed. > > The issue occured when driver and device lost the connection. I have > encountered this problem twice so far: > 1. During suspend/resume stress test, there was a probabilistic D3L2 > time sequence issue with the BIOS, result in PCIe link down, driver > read and write the register of device invalid, so suspend failed. > This issue was eventually fixed in the BIOS and I was able to restore > it through the reset module after reproducing the problem. > > 2. During idle test, the modem probabilistic hang up, result in PCIe > link down, driver read and write the register of device invalid, so > suspend failed. This issue was eventually fiex in device modem firmware > by adjust a certain power supply voltage, and reset modem as a workround > to restore when the MBIM port command timeout in userspace applycations. > > Hardware reset modem to recover was discussed with MTK, and they said > that if we don't want to keep the on-site problem location in case of > suspend failure, we can use the recover solution. > Both the ocurred issues result in the PCIe link issue, driver can't read > and writer the register of WWAN device, so I want to add this path > to restore, hardware reset modem can recover modem, but using the > pci_channle_offline() as the judgment is my inference. Thank you for the clarification. Let me summarize what I've understood from the explanation: a) there were hardware (firmware) issues, b) issues already were solved, c) issues were not directly related to the device suspension procedure, d) you want to implement a backup plan to make the modem support robust. If got it right, then I would like to recommend to implement a generic error handling solution for the PCIe interface. You can check this document: Documentation/PCI/pci-error-recovery.rst Suddenly, I am not an expert in the PCIe link recovery procedure, so I've CCed this message to PCI subsystem maintainers. I hope they can suggest a conceptually correct way to handle these cases. >>> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com> >>> --- >>> V2: >>> * Add judgment, reset when device is offline >>> --- >>> drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/ >>> t7xx/t7xx_pci.c >>> index e556e5bd49ab..4f89a353588b 100644 >>> --- a/drivers/net/wwan/t7xx/t7xx_pci.c >>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c >>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev >>> *pdev) >>> iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + >>> ENABLE_ASPM_LOWPWR); >>> atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED); >>> t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT); >>> + if (pci_channel_offline(pdev)) { >>> + dev_err(&pdev->dev, "Device offline, reset to recover\n"); >>> + t7xx_reset_device(t7xx_dev, PLDR); >>> + } >>> return ret; >>> } -- Sergey
From: Sergey Ryazanov <ryazanov.s.a@gmail.com> >Hi Jinjian, > >On 29.10.2024 05:46, Jinjian Song wrote: >> From: Sergey Ryazanov <ryazanov.s.a@gmail.com> >>> On 22.10.2024 11:43, Jinjian Song wrote: >>>> If driver fails to set the device to suspend, it means that the >>>> device is abnormal. In this case, reset the device to recover >>>> when PCIe device is offline. >>> >>> Is it a reproducible or a speculative issue? Does the fix recover >>> modem from a problematic state? >>> >>> Anyway we need someone more familiar with this hardware (Intel or >>> MediaTek engineer) to Ack the change to make sure we are not going to >>> put a system in a more complicated state. >> >> Hi Sergey, >> >> This is a very difficult issue to replicate onece occured and fixed. >> >> The issue occured when driver and device lost the connection. I have >> encountered this problem twice so far: >> 1. During suspend/resume stress test, there was a probabilistic D3L2 >> time sequence issue with the BIOS, result in PCIe link down, driver >> read and write the register of device invalid, so suspend failed. >> This issue was eventually fixed in the BIOS and I was able to restore >> it through the reset module after reproducing the problem. >> >> 2. During idle test, the modem probabilistic hang up, result in PCIe >> link down, driver read and write the register of device invalid, so >> suspend failed. This issue was eventually fiex in device modem firmware >> by adjust a certain power supply voltage, and reset modem as a workround >> to restore when the MBIM port command timeout in userspace applycations. >> >> Hardware reset modem to recover was discussed with MTK, and they said >> that if we don't want to keep the on-site problem location in case of >> suspend failure, we can use the recover solution. >> Both the ocurred issues result in the PCIe link issue, driver can't read >> and writer the register of WWAN device, so I want to add this path >> to restore, hardware reset modem can recover modem, but using the >> pci_channle_offline() as the judgment is my inference. > >Thank you for the clarification. Let me summarize what I've understood >from the explanation: >a) there were hardware (firmware) issues, >b) issues already were solved, >c) issues were not directly related to the device suspension procedure, >d) you want to implement a backup plan to make the modem support robust. > >If got it right, then I would like to recommend to implement a generic >error handling solution for the PCIe interface. You can check this >document: Documentation/PCI/pci-error-recovery.rst Hi Sergey, Yes, got it right. I want to identify the scenario and then recover by reset device, otherwise suspend failure will aways prevent the system from suspending if it occurs. >Suddenly, I am not an expert in the PCIe link recovery procedure, so >I've CCed this message to PCI subsystem maintainers. I hope they can >suggest a conceptually correct way to handle these cases. Thanks. >>>> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com> >>>> --- >>>> V2: >>>> * Add judgment, reset when device is offline >>>> --- >>>> drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/ >>>> t7xx/t7xx_pci.c >>>> index e556e5bd49ab..4f89a353588b 100644 >>>> --- a/drivers/net/wwan/t7xx/t7xx_pci.c >>>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c >>>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev >>>> *pdev) >>>> iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + >>>> ENABLE_ASPM_LOWPWR); >>>> atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED); >>>> t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT); >>>> + if (pci_channel_offline(pdev)) { >>>> + dev_err(&pdev->dev, "Device offline, reset to recover\n"); >>>> + t7xx_reset_device(t7xx_dev, PLDR); >>>> + } >>>> return ret; >>>> } > >-- >Sergey > Best Regards, Jinjian
On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote: > From: Sergey Ryazanov <ryazanov.s.a@gmail.com> > > On 29.10.2024 05:46, Jinjian Song wrote: > > > From: Sergey Ryazanov <ryazanov.s.a@gmail.com> > > > > On 22.10.2024 11:43, Jinjian Song wrote: > > > > > If driver fails to set the device to suspend, it means that the > > > > > device is abnormal. In this case, reset the device to recover > > > > > when PCIe device is offline. > > > > > > > > Is it a reproducible or a speculative issue? Does the fix > > > > recover modem from a problematic state? > > > > > > > > Anyway we need someone more familiar with this hardware (Intel > > > > or MediaTek engineer) to Ack the change to make sure we are not > > > > going to put a system in a more complicated state. > > > > > > This is a very difficult issue to replicate onece occured and fixed. > > > > > > The issue occured when driver and device lost the connection. I have > > > encountered this problem twice so far: > > > 1. During suspend/resume stress test, there was a probabilistic D3L2 > > > time sequence issue with the BIOS, result in PCIe link down, driver > > > read and write the register of device invalid, so suspend failed. > > > This issue was eventually fixed in the BIOS and I was able to restore > > > it through the reset module after reproducing the problem. > > > > > > 2. During idle test, the modem probabilistic hang up, result in PCIe > > > link down, driver read and write the register of device invalid, so > > > suspend failed. This issue was eventually fiex in device modem firmware > > > by adjust a certain power supply voltage, and reset modem as a workround > > > to restore when the MBIM port command timeout in userspace applycations. > > > > > > Hardware reset modem to recover was discussed with MTK, and they said > > > that if we don't want to keep the on-site problem location in case of > > > suspend failure, we can use the recover solution. > > > Both the ocurred issues result in the PCIe link issue, driver can't > > > read and writer the register of WWAN device, so I want to add this > > > path > > > to restore, hardware reset modem can recover modem, but using the > > > pci_channle_offline() as the judgment is my inference. > > > > Thank you for the clarification. Let me summarize what I've understood > > from the explanation: > > a) there were hardware (firmware) issues, > > b) issues already were solved, > > c) issues were not directly related to the device suspension procedure, > > d) you want to implement a backup plan to make the modem support robust. > > > > If got it right, then I would like to recommend to implement a generic > > error handling solution for the PCIe interface. You can check this > > document: Documentation/PCI/pci-error-recovery.rst > > Yes, got it right. > I want to identify the scenario and then recover by reset device, > otherwise suspend failure will aways prevent the system from suspending > if it occurs. If a PCIe link goes down here's my understanding of what happens: - Writes to the device are silently dropped. - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()). - If the device is in a slot and pciehp is enabled, a Data Link Layer State Changed interrupt will cause pciehp_unconfigure_device() to detach the driver and remove the pci_dev. - If AER is enabled, a failed access to the device will cause an AER interrupt. If the driver has registered pci_error_handlers, the driver callbacks will be called, and based on what the driver returns, the PCI core may reset the device. The pciehp and AER interrupts are *asynchronous* to link down events and to any driver access to the device, so they may be delayed an arbitrary amount of time. Both interrupt paths may lead to the device being marked as "offline". Obviously this is asynchronous with respect to the driver. > > > > > +++ b/drivers/net/wwan/t7xx/t7xx_pci.c > > > > > @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct > > > > > pci_dev *pdev) > > > > > iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + > > > > > ENABLE_ASPM_LOWPWR); > > > > > atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED); > > > > > t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT); > > > > > + if (pci_channel_offline(pdev)) { > > > > > + dev_err(&pdev->dev, "Device offline, reset to recover\n"); > > > > > + t7xx_reset_device(t7xx_dev, PLDR); > > > > > + } This looks like an unreliable way to detect issues. It only works if AER is enabled, and the device is only marked "offline" some arbitrary length of time *after* a driver access to the device has failed. You can't reliably detect errors on writes to the device. You can only reliably detect errors on reads from the device by looking for PCI_POSSIBLE_ERROR(). Obviously ~0 might be a valid value to read from some registers, so you need device-specific knowledge to know whether ~0 is valid or indicates an error. If AER or DPC are enabled, the driver can be *notified* about read errors and some write errors via pci_error_handlers, but the notification is long after the error. Bjorn
Top-post reply. What Bjorn says is exactly right. Just one clarifying remark: When PCI error recovery was designed, it was envisioned for high-end, high-availability servers so that they could reset a failing device without forcing a reboot of the OS. Concepts like suspend/resume did not perturb even the unconscious sleep of the engineers. Thus, stepping through error recovery during suspend would be novel, untested, and perhaps prone to confusion. The error recovery procedure is trying to reset the device into a fully-powered-on, fully-functional and connected state. This might be problematic if the suspend has already walked half-way through power-down. The fact that error recovery might run a very long fraction of a second after the error is detected further complicates things. The fact that error recovery usually has the driver fully reinitializing the device, including reloading the firmware, could be a problem if the firmware is not in RAM or is sitting on a suspended block device. So it all could be an adventure. As to testing: I asked one of the guys in the lab how he did it, and he said he would brush a metal wire against the PCI pins. I winced. But I guess it's cleaner than pouring coffee on it. Later we developed a software tool to artificially inject errors. --linas On Fri, Nov 1, 2024 at 8:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote: > > From: Sergey Ryazanov <ryazanov.s.a@gmail.com> > > > On 29.10.2024 05:46, Jinjian Song wrote: > > > > From: Sergey Ryazanov <ryazanov.s.a@gmail.com> > > > > > On 22.10.2024 11:43, Jinjian Song wrote: > > > > > > If driver fails to set the device to suspend, it means that the > > > > > > device is abnormal. In this case, reset the device to recover > > > > > > when PCIe device is offline. > > > > > > > > > > Is it a reproducible or a speculative issue? Does the fix > > > > > recover modem from a problematic state? > > > > > > > > > > Anyway we need someone more familiar with this hardware (Intel > > > > > or MediaTek engineer) to Ack the change to make sure we are not > > > > > going to put a system in a more complicated state. > > > > > > > > This is a very difficult issue to replicate onece occured and fixed. > > > > > > > > The issue occured when driver and device lost the connection. I have > > > > encountered this problem twice so far: > > > > 1. During suspend/resume stress test, there was a probabilistic D3L2 > > > > time sequence issue with the BIOS, result in PCIe link down, driver > > > > read and write the register of device invalid, so suspend failed. > > > > This issue was eventually fixed in the BIOS and I was able to restore > > > > it through the reset module after reproducing the problem. > > > > > > > > 2. During idle test, the modem probabilistic hang up, result in PCIe > > > > link down, driver read and write the register of device invalid, so > > > > suspend failed. This issue was eventually fiex in device modem firmware > > > > by adjust a certain power supply voltage, and reset modem as a workround > > > > to restore when the MBIM port command timeout in userspace applycations. > > > > > > > > Hardware reset modem to recover was discussed with MTK, and they said > > > > that if we don't want to keep the on-site problem location in case of > > > > suspend failure, we can use the recover solution. > > > > Both the ocurred issues result in the PCIe link issue, driver can't > > > > read and writer the register of WWAN device, so I want to add this > > > > path > > > > to restore, hardware reset modem can recover modem, but using the > > > > pci_channle_offline() as the judgment is my inference. > > > > > > Thank you for the clarification. Let me summarize what I've understood > > > from the explanation: > > > a) there were hardware (firmware) issues, > > > b) issues already were solved, > > > c) issues were not directly related to the device suspension procedure, > > > d) you want to implement a backup plan to make the modem support robust. > > > > > > If got it right, then I would like to recommend to implement a generic > > > error handling solution for the PCIe interface. You can check this > > > document: Documentation/PCI/pci-error-recovery.rst > > > > Yes, got it right. > > I want to identify the scenario and then recover by reset device, > > otherwise suspend failure will aways prevent the system from suspending > > if it occurs. > > If a PCIe link goes down here's my understanding of what happens: > > - Writes to the device are silently dropped. > > - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()). > > - If the device is in a slot and pciehp is enabled, a Data Link > Layer State Changed interrupt will cause pciehp_unconfigure_device() > to detach the driver and remove the pci_dev. > > - If AER is enabled, a failed access to the device will cause an AER > interrupt. If the driver has registered pci_error_handlers, the > driver callbacks will be called, and based on what the driver > returns, the PCI core may reset the device. > > The pciehp and AER interrupts are *asynchronous* to link down events > and to any driver access to the device, so they may be delayed an > arbitrary amount of time. > > Both interrupt paths may lead to the device being marked as "offline". > Obviously this is asynchronous with respect to the driver. > > > > > > > +++ b/drivers/net/wwan/t7xx/t7xx_pci.c > > > > > > @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct > > > > > > pci_dev *pdev) > > > > > > iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + > > > > > > ENABLE_ASPM_LOWPWR); > > > > > > atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED); > > > > > > t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT); > > > > > > + if (pci_channel_offline(pdev)) { > > > > > > + dev_err(&pdev->dev, "Device offline, reset to recover\n"); > > > > > > + t7xx_reset_device(t7xx_dev, PLDR); > > > > > > + } > > This looks like an unreliable way to detect issues. It only works if > AER is enabled, and the device is only marked "offline" some arbitrary > length of time *after* a driver access to the device has failed. > > You can't reliably detect errors on writes to the device. > > You can only reliably detect errors on reads from the device by > looking for PCI_POSSIBLE_ERROR(). Obviously ~0 might be a valid value > to read from some registers, so you need device-specific knowledge to > know whether ~0 is valid or indicates an error. > > If AER or DPC are enabled, the driver can be *notified* about read > errors and some write errors via pci_error_handlers, but the > notification is long after the error. > > Bjorn -- Patrick: Are they laughing at us? Sponge Bob: No, Patrick, they are laughing next to us.
Hello Bjorn, Linas, many thanks for clarifying this tricky topic! Hope this information can serve a good starting point for Jinjian to develop a proper solution for that case. On 03.11.2024 03:24, Linas Vepstas wrote: > Top-post reply. > > What Bjorn says is exactly right. Just one clarifying remark: When PCI > error recovery was designed, it was envisioned for high-end, > high-availability servers so that they could reset a failing device > without forcing a reboot of the OS. Concepts like suspend/resume did > not perturb even the unconscious sleep of the engineers. Thus, > stepping through error recovery during suspend would be novel, > untested, and perhaps prone to confusion. The error recovery procedure > is trying to reset the device into a fully-powered-on, > fully-functional and connected state. This might be problematic if the > suspend has already walked half-way through power-down. The fact that > error recovery might run a very long fraction of a second after the > error is detected further complicates things. The fact that error > recovery usually has the driver fully reinitializing the device, > including reloading the firmware, could be a problem if the firmware > is not in RAM or is sitting on a suspended block device. So it all > could be an adventure. > > As to testing: I asked one of the guys in the lab how he did it, and > he said he would brush a metal wire against the PCI pins. I winced. > But I guess it's cleaner than pouring coffee on it. Later we developed > a software tool to artificially inject errors. > > On Fri, Nov 1, 2024 at 8:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote: >> On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote: >>> From: Sergey Ryazanov <ryazanov.s.a@gmail.com> >>>> On 29.10.2024 05:46, Jinjian Song wrote: >>>>> From: Sergey Ryazanov <ryazanov.s.a@gmail.com> >>>>>> On 22.10.2024 11:43, Jinjian Song wrote: >>>>>>> If driver fails to set the device to suspend, it means that the >>>>>>> device is abnormal. In this case, reset the device to recover >>>>>>> when PCIe device is offline. >>>>>> >>>>>> Is it a reproducible or a speculative issue? Does the fix >>>>>> recover modem from a problematic state? >>>>>> >>>>>> Anyway we need someone more familiar with this hardware (Intel >>>>>> or MediaTek engineer) to Ack the change to make sure we are not >>>>>> going to put a system in a more complicated state. >>>>> >>>>> This is a very difficult issue to replicate onece occured and fixed. >>>>> >>>>> The issue occured when driver and device lost the connection. I have >>>>> encountered this problem twice so far: >>>>> 1. During suspend/resume stress test, there was a probabilistic D3L2 >>>>> time sequence issue with the BIOS, result in PCIe link down, driver >>>>> read and write the register of device invalid, so suspend failed. >>>>> This issue was eventually fixed in the BIOS and I was able to restore >>>>> it through the reset module after reproducing the problem. >>>>> >>>>> 2. During idle test, the modem probabilistic hang up, result in PCIe >>>>> link down, driver read and write the register of device invalid, so >>>>> suspend failed. This issue was eventually fiex in device modem firmware >>>>> by adjust a certain power supply voltage, and reset modem as a workround >>>>> to restore when the MBIM port command timeout in userspace applycations. >>>>> >>>>> Hardware reset modem to recover was discussed with MTK, and they said >>>>> that if we don't want to keep the on-site problem location in case of >>>>> suspend failure, we can use the recover solution. >>>>> Both the ocurred issues result in the PCIe link issue, driver can't >>>>> read and writer the register of WWAN device, so I want to add this >>>>> path >>>>> to restore, hardware reset modem can recover modem, but using the >>>>> pci_channle_offline() as the judgment is my inference. >>>> >>>> Thank you for the clarification. Let me summarize what I've understood >>>> from the explanation: >>>> a) there were hardware (firmware) issues, >>>> b) issues already were solved, >>>> c) issues were not directly related to the device suspension procedure, >>>> d) you want to implement a backup plan to make the modem support robust. >>>> >>>> If got it right, then I would like to recommend to implement a generic >>>> error handling solution for the PCIe interface. You can check this >>>> document: Documentation/PCI/pci-error-recovery.rst >>> >>> Yes, got it right. >>> I want to identify the scenario and then recover by reset device, >>> otherwise suspend failure will aways prevent the system from suspending >>> if it occurs. >> >> If a PCIe link goes down here's my understanding of what happens: >> >> - Writes to the device are silently dropped. >> >> - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()). >> >> - If the device is in a slot and pciehp is enabled, a Data Link >> Layer State Changed interrupt will cause pciehp_unconfigure_device() >> to detach the driver and remove the pci_dev. >> >> - If AER is enabled, a failed access to the device will cause an AER >> interrupt. If the driver has registered pci_error_handlers, the >> driver callbacks will be called, and based on what the driver >> returns, the PCI core may reset the device. >> >> The pciehp and AER interrupts are *asynchronous* to link down events >> and to any driver access to the device, so they may be delayed an >> arbitrary amount of time. >> >> Both interrupt paths may lead to the device being marked as "offline". >> Obviously this is asynchronous with respect to the driver. >> >>>>>>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c >>>>>>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct >>>>>>> pci_dev *pdev) >>>>>>> iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + >>>>>>> ENABLE_ASPM_LOWPWR); >>>>>>> atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED); >>>>>>> t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT); >>>>>>> + if (pci_channel_offline(pdev)) { >>>>>>> + dev_err(&pdev->dev, "Device offline, reset to recover\n"); >>>>>>> + t7xx_reset_device(t7xx_dev, PLDR); >>>>>>> + } >> >> This looks like an unreliable way to detect issues. It only works if >> AER is enabled, and the device is only marked "offline" some arbitrary >> length of time *after* a driver access to the device has failed. >> >> You can't reliably detect errors on writes to the device. >> >> You can only reliably detect errors on reads from the device by >> looking for PCI_POSSIBLE_ERROR(). Obviously ~0 might be a valid value >> to read from some registers, so you need device-specific knowledge to >> know whether ~0 is valid or indicates an error. >> >> If AER or DPC are enabled, the driver can be *notified* about read >> errors and some write errors via pci_error_handlers, but the >> notification is long after the error. -- Sergey
© 2016 - 2024 Red Hat, Inc.