[PATCH v1] PCI/AER: Avoid power state transition during system suspend

Raag Jadav posted 1 patch 1 day, 3 hours ago
drivers/pci/pcie/aer.c | 6 ++++++
1 file changed, 6 insertions(+)
[PATCH v1] PCI/AER: Avoid power state transition during system suspend
Posted by Raag Jadav 1 day, 3 hours ago
If an error is triggered while system suspend is in progress, any bus
level power state transition will result in unpredictable error handling.
Mark skip_bus_pm flag as true to avoid this.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---

Ideally we'd want to defer recovery until system resume, but this is
good enough to prevent device suspend.

More discussion at [1].
[1] https://lore.kernel.org/r/Z-38rPeN_j7YGiEl@black.fi.intel.com

 drivers/pci/pcie/aer.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 508474e17183..5acf4efc2df3 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1108,6 +1108,12 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
+	/*
+	 * Avoid any power state transition if an error is triggered during
+	 * system suspend.
+	 */
+	dev->skip_bus_pm = true;
+
 	cxl_rch_handle_error(dev, info);
 	pci_aer_handle_error(dev, info);
 	pci_dev_put(dev);
-- 
2.34.1
Re: [PATCH v1] PCI/AER: Avoid power state transition during system suspend
Posted by Lukas Wunner 8 hours ago
On Thu, Apr 03, 2025 at 01:14:25PM +0530, Raag Jadav wrote:
> If an error is triggered while system suspend is in progress, any bus
> level power state transition will result in unpredictable error handling.
> Mark skip_bus_pm flag as true to avoid this.
[...]
> Ideally we'd want to defer recovery until system resume, but this is
> good enough to prevent device suspend.

if (system_state == SYSTEM_SUSPEND)

... tells you whether the system is suspending, so you could catch that
in the error recovery code.

Suspend to ACPI state S3 or S4 shouldn't need error recovery through reset
upon resume because devices are generally reset by BIOS on resume anyway.

Thanks,

Lukas
Re: [PATCH v1] PCI/AER: Avoid power state transition during system suspend
Posted by Raag Jadav 6 hours ago
On Fri, Apr 04, 2025 at 05:08:45AM +0200, Lukas Wunner wrote:
> On Thu, Apr 03, 2025 at 01:14:25PM +0530, Raag Jadav wrote:
> > If an error is triggered while system suspend is in progress, any bus
> > level power state transition will result in unpredictable error handling.
> > Mark skip_bus_pm flag as true to avoid this.
> [...]
> > Ideally we'd want to defer recovery until system resume, but this is
> > good enough to prevent device suspend.
> 
> if (system_state == SYSTEM_SUSPEND)
> 
> ... tells you whether the system is suspending, so you could catch that
> in the error recovery code.

Even if we catch it, what'd be the expectation with it?
Do we can simply ignore the error because of system state?

I'm assuming deferring will require a fair bit of revamp (and I'm
not sure if I'm qualified for it).

> Suspend to ACPI state S3 or S4 shouldn't need error recovery through reset
> upon resume because devices are generally reset by BIOS on resume anyway.

Thanks for your input. We have s2idle usecase as well.

So the question here is whether we should allow suspending the device
with errors at all (atleast until successful recovery). Wouldn't the
device resume be unpredictable because of it?

Raag
Re: [PATCH v1] PCI/AER: Avoid power state transition during system suspend
Posted by Rafael J. Wysocki 16 hours ago
On Thu, Apr 3, 2025 at 9:45 AM Raag Jadav <raag.jadav@intel.com> wrote:
>
> If an error is triggered while system suspend is in progress, any bus
> level power state transition will result in unpredictable error handling.
> Mark skip_bus_pm flag as true to avoid this.

This needs to be synchronized with the skip_bus_pm clearing in pci_pm_suspend().

Also, skip_bus_pm is only used in the _noirq phases, so if a driver
calls pci_set_power_state() from its ->suspend() callback, this change
won't help.

I think that you're aware of it, but the changelog should mention this
limitation.

> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>
> Ideally we'd want to defer recovery until system resume, but this is
> good enough to prevent device suspend.
>
> More discussion at [1].
> [1] https://lore.kernel.org/r/Z-38rPeN_j7YGiEl@black.fi.intel.com
>
>  drivers/pci/pcie/aer.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 508474e17183..5acf4efc2df3 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1108,6 +1108,12 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> +       /*
> +        * Avoid any power state transition if an error is triggered during
> +        * system suspend.
> +        */
> +       dev->skip_bus_pm = true;
> +
>         cxl_rch_handle_error(dev, info);
>         pci_aer_handle_error(dev, info);
>         pci_dev_put(dev);
> --
Re: [PATCH v1] PCI/AER: Avoid power state transition during system suspend
Posted by Raag Jadav 6 hours ago
On Thu, Apr 03, 2025 at 08:35:45PM +0200, Rafael J. Wysocki wrote:
> On Thu, Apr 3, 2025 at 9:45 AM Raag Jadav <raag.jadav@intel.com> wrote:
> >
> > If an error is triggered while system suspend is in progress, any bus
> > level power state transition will result in unpredictable error handling.
> > Mark skip_bus_pm flag as true to avoid this.
> 
> This needs to be synchronized with the skip_bus_pm clearing in pci_pm_suspend().

I'm wondering if we can have something like aer_in_progress flag
that we can use in PCI PM for this? ...

> Also, skip_bus_pm is only used in the _noirq phases, so if a driver
> calls pci_set_power_state() from its ->suspend() callback, this change
> won't help.

... and perhaps skip these if it is set?

Raag
Re: [PATCH v1] PCI/AER: Avoid power state transition during system suspend
Posted by Raag Jadav 17 hours ago
On Thu, Apr 03, 2025 at 01:14:25PM +0530, Raag Jadav wrote:
> If an error is triggered while system suspend is in progress, any bus
> level power state transition will result in unpredictable error handling.
> Mark skip_bus_pm flag as true to avoid this.
> 
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
> 
> Ideally we'd want to defer recovery until system resume, but this is
> good enough to prevent device suspend.
> 
> More discussion at [1].
> [1] https://lore.kernel.org/r/Z-38rPeN_j7YGiEl@black.fi.intel.com

I just realized I messed up the link, please use [2] instead.
Apologies for the inconvenience.

[2] https://lore.kernel.org/all/CAJZ5v0g-aJXfVH+Uc=9eRPuW08t-6PwzdyMXsC6FZRKYJtY03Q@mail.gmail.com

>  drivers/pci/pcie/aer.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 508474e17183..5acf4efc2df3 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1108,6 +1108,12 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> +	/*
> +	 * Avoid any power state transition if an error is triggered during
> +	 * system suspend.
> +	 */
> +	dev->skip_bus_pm = true;
> +
>  	cxl_rch_handle_error(dev, info);
>  	pci_aer_handle_error(dev, info);
>  	pci_dev_put(dev);
> -- 
> 2.34.1
>