[PATCH] PCI/sysfs: Ensure devices are powered for config reads

Brian Norris posted 1 patch 1 month, 2 weeks ago
There is a newer version of this series
drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)
[PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Brian Norris 1 month, 2 weeks ago
From: Brian Norris <briannorris@google.com>

max_link_speed, max_link_width, current_link_speed, current_link_width,
secondary_bus_number, and subordinate_bus_number all access config
registers, but they don't check the runtime PM state. If the device is
in D3cold, we may see -EINVAL or even bogus values.

Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
rest of the similar sysfs attributes.

Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
Cc: stable@vger.kernel.org
Signed-off-by: Brian Norris <briannorris@google.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
---

 drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 5eea14c1f7f5..160df897dc5e 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
+	ssize_t ret;
+
+	pci_config_pm_runtime_get(pdev);
 
-	return sysfs_emit(buf, "%s\n",
-			  pci_speed_string(pcie_get_speed_cap(pdev)));
+	ret = sysfs_emit(buf, "%s\n",
+			 pci_speed_string(pcie_get_speed_cap(pdev)));
+
+	pci_config_pm_runtime_put(pdev);
+
+	return ret;
 }
 static DEVICE_ATTR_RO(max_link_speed);
 
@@ -201,8 +208,15 @@ static ssize_t max_link_width_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
+	ssize_t ret;
+
+	pci_config_pm_runtime_get(pdev);
+
+	ret = sysfs_emit(buf, "%u\n", pcie_get_width_cap(pdev));
 
-	return sysfs_emit(buf, "%u\n", pcie_get_width_cap(pdev));
+	pci_config_pm_runtime_put(pdev);
+
+	return ret;
 }
 static DEVICE_ATTR_RO(max_link_width);
 
@@ -214,7 +228,10 @@ static ssize_t current_link_speed_show(struct device *dev,
 	int err;
 	enum pci_bus_speed speed;
 
+	pci_config_pm_runtime_get(pci_dev);
 	err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, &linkstat);
+	pci_config_pm_runtime_put(pci_dev);
+
 	if (err)
 		return -EINVAL;
 
@@ -231,7 +248,10 @@ static ssize_t current_link_width_show(struct device *dev,
 	u16 linkstat;
 	int err;
 
+	pci_config_pm_runtime_get(pci_dev);
 	err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, &linkstat);
+	pci_config_pm_runtime_put(pci_dev);
+
 	if (err)
 		return -EINVAL;
 
@@ -247,7 +267,10 @@ static ssize_t secondary_bus_number_show(struct device *dev,
 	u8 sec_bus;
 	int err;
 
+	pci_config_pm_runtime_get(pci_dev);
 	err = pci_read_config_byte(pci_dev, PCI_SECONDARY_BUS, &sec_bus);
+	pci_config_pm_runtime_put(pci_dev);
+
 	if (err)
 		return -EINVAL;
 
@@ -263,7 +286,10 @@ static ssize_t subordinate_bus_number_show(struct device *dev,
 	u8 sub_bus;
 	int err;
 
+	pci_config_pm_runtime_get(pci_dev);
 	err = pci_read_config_byte(pci_dev, PCI_SUBORDINATE_BUS, &sub_bus);
+	pci_config_pm_runtime_put(pci_dev);
+
 	if (err)
 		return -EINVAL;
 
-- 
2.51.0.rc1.193.gad69d77794-goog
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Bjorn Helgaas 1 week, 3 days ago
[+cc Ethan, Andrey]

On Wed, Aug 20, 2025 at 10:26:08AM -0700, Brian Norris wrote:
> From: Brian Norris <briannorris@google.com>
> 
> max_link_speed, max_link_width, current_link_speed, current_link_width,
> secondary_bus_number, and subordinate_bus_number all access config
> registers, but they don't check the runtime PM state. If the device is
> in D3cold, we may see -EINVAL or even bogus values.
> 
> Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
> rest of the similar sysfs attributes.

Protecting the config reads seems right to me.

If the device is in D3cold, a config read will result in a Completion
Timeout.  On most x86 platforms that's "fine" and merely results in ~0
data.  But that's merely convention, not a PCIe spec requirement.

I think it's a potential issue with PCIe controllers used on arm64 and
might result in an SError or synchronous abort from which we don't
recover well.  I'd love to hear actual experience about how reading
"current_link_speed" works on a device in D3cold in an arm64 system.

As Ethan and Andrey pointed out, we could skip max_link_speed_show()
because pcie_get_speed_cap() already uses a cached value and doesn't
do a config access.

max_link_width_show() is similar and also comes from PCI_EXP_LNKCAP
but is not currently cached, so I think we do need that one.  Worth a
comment to explain the non-obvious difference.

PCI_EXP_LNKCAP is ostensibly read-only and could conceivably be
cached, but the ASPM exit latencies can change based on the Common
Clock Configuration.

> Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
> Cc: stable@vger.kernel.org
> Signed-off-by: Brian Norris <briannorris@google.com>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> ---
> 
>  drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 5eea14c1f7f5..160df897dc5e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
>  				   struct device_attribute *attr, char *buf)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
> +	ssize_t ret;
> +
> +	pci_config_pm_runtime_get(pdev);
>  
> -	return sysfs_emit(buf, "%s\n",
> -			  pci_speed_string(pcie_get_speed_cap(pdev)));
> +	ret = sysfs_emit(buf, "%s\n",
> +			 pci_speed_string(pcie_get_speed_cap(pdev)));
> +
> +	pci_config_pm_runtime_put(pdev);
> +
> +	return ret;
>  }
>  static DEVICE_ATTR_RO(max_link_speed);
>  
> @@ -201,8 +208,15 @@ static ssize_t max_link_width_show(struct device *dev,
>  				   struct device_attribute *attr, char *buf)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
> +	ssize_t ret;
> +
> +	pci_config_pm_runtime_get(pdev);
> +
> +	ret = sysfs_emit(buf, "%u\n", pcie_get_width_cap(pdev));
>  
> -	return sysfs_emit(buf, "%u\n", pcie_get_width_cap(pdev));
> +	pci_config_pm_runtime_put(pdev);
> +
> +	return ret;
>  }
>  static DEVICE_ATTR_RO(max_link_width);
>  
> @@ -214,7 +228,10 @@ static ssize_t current_link_speed_show(struct device *dev,
>  	int err;
>  	enum pci_bus_speed speed;
>  
> +	pci_config_pm_runtime_get(pci_dev);
>  	err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, &linkstat);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>  	if (err)
>  		return -EINVAL;
>  
> @@ -231,7 +248,10 @@ static ssize_t current_link_width_show(struct device *dev,
>  	u16 linkstat;
>  	int err;
>  
> +	pci_config_pm_runtime_get(pci_dev);
>  	err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, &linkstat);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>  	if (err)
>  		return -EINVAL;
>  
> @@ -247,7 +267,10 @@ static ssize_t secondary_bus_number_show(struct device *dev,
>  	u8 sec_bus;
>  	int err;
>  
> +	pci_config_pm_runtime_get(pci_dev);
>  	err = pci_read_config_byte(pci_dev, PCI_SECONDARY_BUS, &sec_bus);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>  	if (err)
>  		return -EINVAL;
>  
> @@ -263,7 +286,10 @@ static ssize_t subordinate_bus_number_show(struct device *dev,
>  	u8 sub_bus;
>  	int err;
>  
> +	pci_config_pm_runtime_get(pci_dev);
>  	err = pci_read_config_byte(pci_dev, PCI_SUBORDINATE_BUS, &sub_bus);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>  	if (err)
>  		return -EINVAL;
>  
> -- 
> 2.51.0.rc1.193.gad69d77794-goog
>
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Brian Norris 1 week, 3 days ago
Hi Bjorn,

On Tue, Sep 23, 2025 at 02:02:31PM -0500, Bjorn Helgaas wrote:
> [+cc Ethan, Andrey]
> 
> On Wed, Aug 20, 2025 at 10:26:08AM -0700, Brian Norris wrote:
> > From: Brian Norris <briannorris@google.com>
> > 
> > max_link_speed, max_link_width, current_link_speed, current_link_width,
> > secondary_bus_number, and subordinate_bus_number all access config
> > registers, but they don't check the runtime PM state. If the device is
> > in D3cold, we may see -EINVAL or even bogus values.
> > 
> > Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
> > rest of the similar sysfs attributes.
> 
> Protecting the config reads seems right to me.
> 
> If the device is in D3cold, a config read will result in a Completion
> Timeout.  On most x86 platforms that's "fine" and merely results in ~0
> data.  But that's merely convention, not a PCIe spec requirement.
> 
> I think it's a potential issue with PCIe controllers used on arm64 and
> might result in an SError or synchronous abort from which we don't
> recover well.  I'd love to hear actual experience about how reading
> "current_link_speed" works on a device in D3cold in an arm64 system.

I'm working on a few such arm64 systems :) (pcie-qcom Chromebooks, and
non-upstream DWC-based Pixel phones; I have a little more knowledge of
the latter.) The answers may vary by SoC, and especially by PCIe
implementation. ARM SoCs are notoriously ... diverse.

To my knowledge, it can be several of the above on arm64 + DWC.

* pci_generic_config_read() -> pci_ops::map_bus() may return NULL, in
  which case we get PCIBIOS_DEVICE_NOT_FOUND / -EINVAL. And specifically
  on arm64 with DWC PCIe, dw_pcie_other_conf_map_bus() may see the link
  down on a suspended bridge and return NULL.

* The map_bus() check is admittedly racy, so we might still *actually*
  hit the hardware, at which point this gets even more
  implementation-defined:

  (a) if the PCIe HW is not powered (for example, if we put the link to
      L3 and fully powered off the controller to save power), we might
      not even get a completion timeout, and it depends on how the
      SoC is wired up. But I believe this tends to be SError, and a
      crash.

  (b) if the PCIe HW is powered but something else is down (e.g., link
      in L2, device in D3cold, etc.), we'll get a Completion Timeout,
      and a ~0 response. I also was under the impression a ~0 response
      is not spec-mandated, but I believe it's noted in the Synopsys
      documentation.

NB: I'm not sure there is really great upstream support for arm64 +
D3cold yet. If they're not using ACPI (as few arm64 systems do), they
probably don't have the appropriate platform_pci_* hooks to really
manage it properly. There have been some prior attempts at adding
non-x86/ACPI hooks for this, although for different reasons:

    https://lore.kernel.org/linux-pci/a38e76d6f3a90d7c968c32cee97604f3c41cbccf.camel@mediatek.com/
    [PATCH] PCI:PM: Support platforms that do not implement ACPI

That submission stalled because it didn't really have the whole picture
(in that case, the wwan/modem driver in question).

> As Ethan and Andrey pointed out, we could skip max_link_speed_show()
> because pcie_get_speed_cap() already uses a cached value and doesn't
> do a config access.

Ack, I'll drop that part of the change.

> max_link_width_show() is similar and also comes from PCI_EXP_LNKCAP
> but is not currently cached, so I think we do need that one.  Worth a
> comment to explain the non-obvious difference.

Sure, I'll add a comment for max_link_width.

> PCI_EXP_LNKCAP is ostensibly read-only and could conceivably be
> cached, but the ASPM exit latencies can change based on the Common
> Clock Configuration.

I'll plan not to add additional caching, unless excess wakeups becomes a
problem.

Brian
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Bjorn Helgaas 1 week, 3 days ago
On Tue, Sep 23, 2025 at 04:07:29PM -0700, Brian Norris wrote:
> On Tue, Sep 23, 2025 at 02:02:31PM -0500, Bjorn Helgaas wrote:
> > On Wed, Aug 20, 2025 at 10:26:08AM -0700, Brian Norris wrote:
> > > From: Brian Norris <briannorris@google.com>
> > > 
> > > max_link_speed, max_link_width, current_link_speed, current_link_width,
> > > secondary_bus_number, and subordinate_bus_number all access config
> > > registers, but they don't check the runtime PM state. If the device is
> > > in D3cold, we may see -EINVAL or even bogus values.
> > > 
> > > Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
> > > rest of the similar sysfs attributes.
> > 
> > Protecting the config reads seems right to me.
> > 
> > If the device is in D3cold, a config read will result in a Completion
> > Timeout.  On most x86 platforms that's "fine" and merely results in ~0
> > data.  But that's merely convention, not a PCIe spec requirement.
> > 
> > I think it's a potential issue with PCIe controllers used on arm64 and
> > might result in an SError or synchronous abort from which we don't
> > recover well.  I'd love to hear actual experience about how reading
> > "current_link_speed" works on a device in D3cold in an arm64 system.
> 
> I'm working on a few such arm64 systems :) (pcie-qcom Chromebooks, and
> non-upstream DWC-based Pixel phones; I have a little more knowledge of
> the latter.) The answers may vary by SoC, and especially by PCIe
> implementation. ARM SoCs are notoriously ... diverse.
> 
> To my knowledge, it can be several of the above on arm64 + DWC.
> 
> * pci_generic_config_read() -> pci_ops::map_bus() may return NULL, in
>   which case we get PCIBIOS_DEVICE_NOT_FOUND / -EINVAL. And specifically
>   on arm64 with DWC PCIe, dw_pcie_other_conf_map_bus() may see the link
>   down on a suspended bridge and return NULL.
> 
> * The map_bus() check is admittedly racy, so we might still *actually*
>   hit the hardware, at which point this gets even more
>   implementation-defined:
> 
>   (a) if the PCIe HW is not powered (for example, if we put the link to
>       L3 and fully powered off the controller to save power), we might
>       not even get a completion timeout, and it depends on how the
>       SoC is wired up. But I believe this tends to be SError, and a
>       crash.
> 
>   (b) if the PCIe HW is powered but something else is down (e.g., link
>       in L2, device in D3cold, etc.), we'll get a Completion Timeout,
>       and a ~0 response. I also was under the impression a ~0 response
>       is not spec-mandated, but I believe it's noted in the Synopsys
>       documentation.

The ~0 response is not required by the PCIe spec, although there's at
least one implementation note to the effect that a Root Complex
intended for use with software that depends on ~0 data when a config
request fails with Unsupported Request must synthesize that value
(this one is from PCIe r7.0, sec 2.3.2).

> NB: I'm not sure there is really great upstream support for arm64 +
> D3cold yet. If they're not using ACPI (as few arm64 systems do), they
> probably don't have the appropriate platform_pci_* hooks to really
> manage it properly. There have been some prior attempts at adding
> non-x86/ACPI hooks for this, although for different reasons:
> 
>     https://lore.kernel.org/linux-pci/a38e76d6f3a90d7c968c32cee97604f3c41cbccf.camel@mediatek.com/
>     [PATCH] PCI:PM: Support platforms that do not implement ACPI
> 
> That submission stalled because it didn't really have the whole picture
> (in that case, the wwan/modem driver in question).
> 
> > As Ethan and Andrey pointed out, we could skip max_link_speed_show()
> > because pcie_get_speed_cap() already uses a cached value and doesn't
> > do a config access.
> 
> Ack, I'll drop that part of the change.
> 
> > max_link_width_show() is similar and also comes from PCI_EXP_LNKCAP
> > but is not currently cached, so I think we do need that one.  Worth a
> > comment to explain the non-obvious difference.
> 
> Sure, I'll add a comment for max_link_width.
> 
> > PCI_EXP_LNKCAP is ostensibly read-only and could conceivably be
> > cached, but the ASPM exit latencies can change based on the Common
> > Clock Configuration.
> 
> I'll plan not to add additional caching, unless excess wakeups becomes a
> problem.

Perfect, thanks, I'll watch for this.

Bjorn
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Andrey Ryabinin 1 week, 3 days ago
On Fri, Aug 22, 2025 at 10:10 PM Brian Norris <briannorris@chromium.org> wrote:
>
> From: Brian Norris <briannorris@google.com>
>
> max_link_speed, max_link_width, current_link_speed, current_link_width,
> secondary_bus_number, and subordinate_bus_number all access config
> registers, but they don't check the runtime PM state. If the device is
> in D3cold, we may see -EINVAL or even bogus values.

I've hit this bug as well, except in my case the device was behind a
suspended PCI
bridge, which seems to block config space accesses.

>
> Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
                       accesses

> rest of the similar sysfs attributes.
>
> Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
> Cc: stable@vger.kernel.org
> Signed-off-by: Brian Norris <briannorris@google.com>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> ---
>
>  drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 5eea14c1f7f5..160df897dc5e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
>                                    struct device_attribute *attr, char *buf)
>  {
>         struct pci_dev *pdev = to_pci_dev(dev);
> +       ssize_t ret;
> +
> +       pci_config_pm_runtime_get(pdev);
>
> -       return sysfs_emit(buf, "%s\n",
> -                         pci_speed_string(pcie_get_speed_cap(pdev)));
> +       ret = sysfs_emit(buf, "%s\n",
> +                        pci_speed_string(pcie_get_speed_cap(pdev)));

pci_speed_string() & pcie_get_speed_cap() don't access config space,
so no need to change this one.

> +
> +       pci_config_pm_runtime_put(pdev);
> +
> +       return ret;
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Ethan Zhao 1 month, 2 weeks ago

On 8/21/2025 1:26 AM, Brian Norris wrote:
> From: Brian Norris <briannorris@google.com>
> 
> max_link_speed, max_link_width, current_link_speed, current_link_width,
> secondary_bus_number, and subordinate_bus_number all access config
> registers, but they don't check the runtime PM state. If the device is
> in D3cold, we may see -EINVAL or even bogus values. 
My understanding, if your device is in D3cold, returning of -EINVAL is
the right behavior.  >
> Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
> rest of the similar sysfs attributes.
> 
> Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
> Cc: stable@vger.kernel.org
> Signed-off-by: Brian Norris <briannorris@google.com>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> ---
> 
>   drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
>   1 file changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 5eea14c1f7f5..160df897dc5e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
>   				   struct device_attribute *attr, char *buf)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
> +	ssize_t ret;
> +
> +	pci_config_pm_runtime_get(pdev);
This function would potentially change the power state of device,
that would be a complex process, beyond the meaning of
max_link_speed_show(), given the semantics of these functions (
max_link_speed_show()/max_link_width_show()/current_link_speed_show()/
....),
this cannot be done !

Thanks,
Ethan>
> -	return sysfs_emit(buf, "%s\n",
> -			  pci_speed_string(pcie_get_speed_cap(pdev)));
> +	ret = sysfs_emit(buf, "%s\n",
> +			 pci_speed_string(pcie_get_speed_cap(pdev)));
> +
> +	pci_config_pm_runtime_put(pdev);
> +
> +	return ret;
>   }
>   static DEVICE_ATTR_RO(max_link_speed);
>   
> @@ -201,8 +208,15 @@ static ssize_t max_link_width_show(struct device *dev,
>   				   struct device_attribute *attr, char *buf)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
> +	ssize_t ret;
> +
> +	pci_config_pm_runtime_get(pdev);
> +
> +	ret = sysfs_emit(buf, "%u\n", pcie_get_width_cap(pdev));
>   
> -	return sysfs_emit(buf, "%u\n", pcie_get_width_cap(pdev));
> +	pci_config_pm_runtime_put(pdev);
> +
> +	return ret;
>   }
>   static DEVICE_ATTR_RO(max_link_width);
>   
> @@ -214,7 +228,10 @@ static ssize_t current_link_speed_show(struct device *dev,
>   	int err;
>   	enum pci_bus_speed speed;
>   
> +	pci_config_pm_runtime_get(pci_dev);
>   	err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, &linkstat);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>   	if (err)
>   		return -EINVAL;
>   
> @@ -231,7 +248,10 @@ static ssize_t current_link_width_show(struct device *dev,
>   	u16 linkstat;
>   	int err;
>   
> +	pci_config_pm_runtime_get(pci_dev);
>   	err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, &linkstat);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>   	if (err)
>   		return -EINVAL;
>   
> @@ -247,7 +267,10 @@ static ssize_t secondary_bus_number_show(struct device *dev,
>   	u8 sec_bus;
>   	int err;
>   
> +	pci_config_pm_runtime_get(pci_dev);
>   	err = pci_read_config_byte(pci_dev, PCI_SECONDARY_BUS, &sec_bus);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>   	if (err)
>   		return -EINVAL;
>   
> @@ -263,7 +286,10 @@ static ssize_t subordinate_bus_number_show(struct device *dev,
>   	u8 sub_bus;
>   	int err;
>   
> +	pci_config_pm_runtime_get(pci_dev);
>   	err = pci_read_config_byte(pci_dev, PCI_SUBORDINATE_BUS, &sub_bus);
> +	pci_config_pm_runtime_put(pci_dev);
> +
>   	if (err)
>   		return -EINVAL;
>
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Brian Norris 1 month, 1 week ago
On Thu, Aug 21, 2025 at 08:54:52AM +0800, Ethan Zhao wrote:
> On 8/21/2025 1:26 AM, Brian Norris wrote:
> > From: Brian Norris <briannorris@google.com>
> > 
> > max_link_speed, max_link_width, current_link_speed, current_link_width,
> > secondary_bus_number, and subordinate_bus_number all access config
> > registers, but they don't check the runtime PM state. If the device is
> > in D3cold, we may see -EINVAL or even bogus values.
> My understanding, if your device is in D3cold, returning of -EINVAL is
> the right behavior.

That's not the guaranteed result though. Some hosts don't properly
return PCIBIOS_DEVICE_NOT_FOUND, for one. But also, it's racy -- because
we don't even try to hold a pm_runtime reference, the device could
possibly enter D3cold while we're in the middle of reading from it. If
you're lucky, that'll get you a completion timeout and an all-1's
result, and we'll return a garbage result.

So if we want to purposely not resume the device and retain "I can't
give you what you asked for" behavior, we'd at least need a
pm_runtime_get_noresume() or similar.

> > Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
> > rest of the similar sysfs attributes.
> > 
> > Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Brian Norris <briannorris@google.com>
> > Signed-off-by: Brian Norris <briannorris@chromium.org>
> > ---
> > 
> >   drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
> >   1 file changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > index 5eea14c1f7f5..160df897dc5e 100644
> > --- a/drivers/pci/pci-sysfs.c
> > +++ b/drivers/pci/pci-sysfs.c
> > @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
> >   				   struct device_attribute *attr, char *buf)
> >   {
> >   	struct pci_dev *pdev = to_pci_dev(dev);
> > +	ssize_t ret;
> > +
> > +	pci_config_pm_runtime_get(pdev);
> This function would potentially change the power state of device,
> that would be a complex process, beyond the meaning of
> max_link_speed_show(), given the semantics of these functions (
> max_link_speed_show()/max_link_width_show()/current_link_speed_show()/
> ....),
> this cannot be done !

What makes this different than the 'config' attribute (i.e., "read
config register")? Why shouldn't that just return -EINVAL? I don't
really buy your reasoning -- "it's a complex process" is not a reason
not to do something. The user asked for the link speed; why not give it?
If the user wanted to know if the device was powered, they could check
the 'power_state' attribute instead.

(Side note: these attributes don't show up anywhere in Documentation/,
so it's also a bit hard to declare "best" semantics for them.)

To flip this question around a bit: if I have a system that aggressively
suspends devices when there's no recent activity, how am I supposed to
check what the link speed is? Probabilistically hammer the file while
hoping some other activity wakes the device, so I can find the small
windows of time where it's RPM_ACTIVE? Disable runtime_pm for the device
while I check?

Brian
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Ethan Zhao 1 month, 1 week ago

On 8/21/2025 10:56 AM, Brian Norris wrote:
> On Thu, Aug 21, 2025 at 08:54:52AM +0800, Ethan Zhao wrote:
>> On 8/21/2025 1:26 AM, Brian Norris wrote:
>>> From: Brian Norris <briannorris@google.com>
>>>
>>> max_link_speed, max_link_width, current_link_speed, current_link_width,
>>> secondary_bus_number, and subordinate_bus_number all access config
>>> registers, but they don't check the runtime PM state. If the device is
>>> in D3cold, we may see -EINVAL or even bogus values.
>> My understanding, if your device is in D3cold, returning of -EINVAL is
>> the right behavior.
> 
> That's not the guaranteed result though. Some hosts don't properly
> return PCIBIOS_DEVICE_NOT_FOUND, for one. But also, it's racy -- because
> we don't even try to hold a pm_runtime reference, the device could
> possibly enter D3cold while we're in the middle of reading from it. If
> you're lucky, that'll get you a completion timeout and an all-1's
> result, and we'll return a garbage result.
> 
> So if we want to purposely not resume the device and retain "I can't
> give you what you asked for" behavior, we'd at least need a
> pm_runtime_get_noresume() or similar.
I understand you just want the stable result of these caps, meanwhile
you don't want the side effect either.>
>>> Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
>>> rest of the similar sysfs attributes.
>>>
>>> Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Brian Norris <briannorris@google.com>
>>> Signed-off-by: Brian Norris <briannorris@chromium.org>
>>> ---
>>>
>>>    drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
>>>    1 file changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>> index 5eea14c1f7f5..160df897dc5e 100644
>>> --- a/drivers/pci/pci-sysfs.c
>>> +++ b/drivers/pci/pci-sysfs.c
>>> @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
>>>    				   struct device_attribute *attr, char *buf)
>>>    {
>>>    	struct pci_dev *pdev = to_pci_dev(dev);
>>> +	ssize_t ret;
>>> +
>>> +	pci_config_pm_runtime_get(pdev);
>> This function would potentially change the power state of device,
>> that would be a complex process, beyond the meaning of
>> max_link_speed_show(), given the semantics of these functions (
>> max_link_speed_show()/max_link_width_show()/current_link_speed_show()/
>> ....),
>> this cannot be done !
> 
> What makes this different than the 'config' attribute (i.e., "read
> config register")? Why shouldn't that just return -EINVAL? I don't
> really buy your reasoning -- "it's a complex process" is not a reason
It is a reason to know there is side effect to be taken into account.> 
not to do something. The user asked for the link speed; why not give it?
> If the user wanted to know if the device was powered, they could check
> the 'power_state' attribute instead.
> 
> (Side note: these attributes don't show up anywhere in Documentation/,
> so it's also a bit hard to declare "best" semantics for them.)
> 
> To flip this question around a bit: if I have a system that aggressively
> suspends devices when there's no recent activity, how am I supposed to
> check what the link speed is? Probabilistically hammer the file while
> hoping some other activity wakes the device, so I can find the small
> windows of time where it's RPM_ACTIVE? Disable runtime_pm for the device
> while I check?
Hold a PM reference by pci_config_pm_runtime_get() and then write some
data to the PCIe config space, no objection.

To know about the linkspeed etc capabilities/not status, how about
creating a cached version of these caps, no need to change their
power state.

If there is aggressive power saving requirement, and the polling
of these caps will make up wakeup/poweron bugs.


Thanks,
Ethan







> 
> Brian
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Brian Norris 1 month, 1 week ago
Hi Ethan,

Note: I'm having a hard time reading your emails sometimes, because you
aren't really adding in appropriate newlines that separate your reply
from quoted text. So your own sentences just run together with parts of
my sentences at times. I've tried to resolve this as best I can.

On Thu, Aug 21, 2025 at 08:41:28PM +0800, Ethan Zhao wrote:
> 
> 
> On 8/21/2025 10:56 AM, Brian Norris wrote:
> > On Thu, Aug 21, 2025 at 08:54:52AM +0800, Ethan Zhao wrote:
> > > On 8/21/2025 1:26 AM, Brian Norris wrote:
> > > > From: Brian Norris <briannorris@google.com>
> > > > 
> > > > max_link_speed, max_link_width, current_link_speed, current_link_width,
> > > > secondary_bus_number, and subordinate_bus_number all access config
> > > > registers, but they don't check the runtime PM state. If the device is
> > > > in D3cold, we may see -EINVAL or even bogus values.
> > > My understanding, if your device is in D3cold, returning of -EINVAL is
> > > the right behavior.
> > 
> > That's not the guaranteed result though. Some hosts don't properly
> > return PCIBIOS_DEVICE_NOT_FOUND, for one. But also, it's racy -- because
> > we don't even try to hold a pm_runtime reference, the device could
> > possibly enter D3cold while we're in the middle of reading from it. If
> > you're lucky, that'll get you a completion timeout and an all-1's
> > result, and we'll return a garbage result.
> > 
> > So if we want to purposely not resume the device and retain "I can't
> > give you what you asked for" behavior, we'd at least need a
> > pm_runtime_get_noresume() or similar.
> I understand you just want the stable result of these caps,

Yes, I'd like a valid result, not EINVAL. Why would I check this file if
I didn't want the result?

> meanwhile
> you don't want the side effect either.

Personally, I think side effect is completely fine. Or, it's just as
fine as it is for the 'config' attribute or for 'resource_N_size'
attributes that already do the same.

> > > > Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
> > > > rest of the similar sysfs attributes.
> > > > 
> > > > Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Brian Norris <briannorris@google.com>
> > > > Signed-off-by: Brian Norris <briannorris@chromium.org>
> > > > ---
> > > > 
> > > >    drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
> > > >    1 file changed, 29 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > > > index 5eea14c1f7f5..160df897dc5e 100644
> > > > --- a/drivers/pci/pci-sysfs.c
> > > > +++ b/drivers/pci/pci-sysfs.c
> > > > @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
> > > >    				   struct device_attribute *attr, char *buf)
> > > >    {
> > > >    	struct pci_dev *pdev = to_pci_dev(dev);
> > > > +	ssize_t ret;
> > > > +
> > > > +	pci_config_pm_runtime_get(pdev);
> > > This function would potentially change the power state of device,
> > > that would be a complex process, beyond the meaning of
> > > max_link_speed_show(), given the semantics of these functions (
> > > max_link_speed_show()/max_link_width_show()/current_link_speed_show()/
> > > ....),
> > > this cannot be done !
> > 
> > What makes this different than the 'config' attribute (i.e., "read
> > config register")? Why shouldn't that just return -EINVAL? I don't
> > really buy your reasoning -- "it's a complex process" is not a reason
> It is a reason to know there is side effect to be taken into account.

OK, agreed, there's a side effect. I don't think you've convinced me the
side effect is bad though.

> > not
> > to do something. The user asked for the link speed; why not give it?
> > If the user wanted to know if the device was powered, they could check
> > the 'power_state' attribute instead.
> > 
> > (Side note: these attributes don't show up anywhere in Documentation/,
> > so it's also a bit hard to declare "best" semantics for them.)
> > 
> > To flip this question around a bit: if I have a system that aggressively
> > suspends devices when there's no recent activity, how am I supposed to
> > check what the link speed is? Probabilistically hammer the file while
> > hoping some other activity wakes the device, so I can find the small
> > windows of time where it's RPM_ACTIVE? Disable runtime_pm for the device
> > while I check?
> Hold a PM reference by pci_config_pm_runtime_get() and then write some
> data to the PCIe config space, no objection.
> 
> To know about the linkspeed etc capabilities/not status, how about
> creating a cached version of these caps, no need to change their
> power state.

For static values like the "max" attributes, maybe that's fine.

But Linux is not always the one changing the link speed. I've seen PCI
devices that autonomously request link-speed changes, and AFAICT, the
only way we'd know in host software is to go reread the config
registers. So caching just produces cache invalidation problems.

> If there is aggressive power saving requirement, and the polling
> of these caps will make up wakeup/poweron bugs.

If you're worried about wakeup frequency, I think that's a matter of
user space / system administraction to decide -- if it doesn't want to
potentially wake up the link, it shouldn't be poking at config-based
sysfs attributes.

Brian
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Ethan Zhao 1 month, 1 week ago

On 8/21/2025 11:28 PM, Brian Norris wrote:
> Hi Ethan,
> 
> Note: I'm having a hard time reading your emails sometimes, because you
> aren't really adding in appropriate newlines that separate your reply
> from quoted text. So your own sentences just run together with parts of
> my sentences at times. I've tried to resolve this as best I can.
> 
> On Thu, Aug 21, 2025 at 08:41:28PM +0800, Ethan Zhao wrote:
>>
>>
>> On 8/21/2025 10:56 AM, Brian Norris wrote:
>>> On Thu, Aug 21, 2025 at 08:54:52AM +0800, Ethan Zhao wrote:
>>>> On 8/21/2025 1:26 AM, Brian Norris wrote:
>>>>> From: Brian Norris <briannorris@google.com>
>>>>>
>>>>> max_link_speed, max_link_width, current_link_speed, current_link_width,
>>>>> secondary_bus_number, and subordinate_bus_number all access config
>>>>> registers, but they don't check the runtime PM state. If the device is
>>>>> in D3cold, we may see -EINVAL or even bogus values.
>>>> My understanding, if your device is in D3cold, returning of -EINVAL is
>>>> the right behavior.
>>>
>>> That's not the guaranteed result though. Some hosts don't properly
>>> return PCIBIOS_DEVICE_NOT_FOUND, for one. But also, it's racy -- because
>>> we don't even try to hold a pm_runtime reference, the device could
>>> possibly enter D3cold while we're in the middle of reading from it. If
>>> you're lucky, that'll get you a completion timeout and an all-1's
>>> result, and we'll return a garbage result.
>>>
>>> So if we want to purposely not resume the device and retain "I can't
>>> give you what you asked for" behavior, we'd at least need a
>>> pm_runtime_get_noresume() or similar.
>> I understand you just want the stable result of these caps,
> 
> Yes, I'd like a valid result, not EINVAL. Why would I check this file if
> I didn't want the result?
> 
>> meanwhile
>> you don't want the side effect either.
> 
> Personally, I think side effect is completely fine. Or, it's just as
> fine as it is for the 'config' attribute or for 'resource_N_size'
> attributes that already do the same.
> 
>>>>> Wrap these access in pci_config_pm_runtime_{get,put}() like most of the
>>>>> rest of the similar sysfs attributes.
>>>>>
>>>>> Fixes: 56c1af4606f0 ("PCI: Add sysfs max_link_speed/width, current_link_speed/width, etc")
>>>>> Cc: stable@vger.kernel.org
>>>>> Signed-off-by: Brian Norris <briannorris@google.com>
>>>>> Signed-off-by: Brian Norris <briannorris@chromium.org>
>>>>> ---
>>>>>
>>>>>     drivers/pci/pci-sysfs.c | 32 +++++++++++++++++++++++++++++---
>>>>>     1 file changed, 29 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>>>> index 5eea14c1f7f5..160df897dc5e 100644
>>>>> --- a/drivers/pci/pci-sysfs.c
>>>>> +++ b/drivers/pci/pci-sysfs.c
>>>>> @@ -191,9 +191,16 @@ static ssize_t max_link_speed_show(struct device *dev,
>>>>>     				   struct device_attribute *attr, char *buf)
>>>>>     {
>>>>>     	struct pci_dev *pdev = to_pci_dev(dev);
>>>>> +	ssize_t ret;
>>>>> +
>>>>> +	pci_config_pm_runtime_get(pdev);
>>>> This function would potentially change the power state of device,
>>>> that would be a complex process, beyond the meaning of
>>>> max_link_speed_show(), given the semantics of these functions (
>>>> max_link_speed_show()/max_link_width_show()/current_link_speed_show()/
>>>> ....),
>>>> this cannot be done !
>>>
>>> What makes this different than the 'config' attribute (i.e., "read
>>> config register")? Why shouldn't that just return -EINVAL? I don't
>>> really buy your reasoning -- "it's a complex process" is not a reason
>> It is a reason to know there is side effect to be taken into account.
> 
> OK, agreed, there's a side effect. I don't think you've convinced me the
> side effect is bad though.
> 
>>> not
>>> to do something. The user asked for the link speed; why not give it?
>>> If the user wanted to know if the device was powered, they could check
>>> the 'power_state' attribute instead.
>>>
>>> (Side note: these attributes don't show up anywhere in Documentation/,
>>> so it's also a bit hard to declare "best" semantics for them.)
>>>
>>> To flip this question around a bit: if I have a system that aggressively
>>> suspends devices when there's no recent activity, how am I supposed to
>>> check what the link speed is? Probabilistically hammer the file while
>>> hoping some other activity wakes the device, so I can find the small
>>> windows of time where it's RPM_ACTIVE? Disable runtime_pm for the device
>>> while I check?
>> Hold a PM reference by pci_config_pm_runtime_get() and then write some
>> data to the PCIe config space, no objection.
>>
>> To know about the linkspeed etc capabilities/not status, how about
>> creating a cached version of these caps, no need to change their
>> power state.
> 
> For static values like the "max" attributes, maybe that's fine.
> 
> But Linux is not always the one changing the link speed. I've seen PCI
> devices that autonomously request link-speed changes, and AFAICT, the
> only way we'd know in host software is to go reread the config
> registers. So caching just produces cache invalidation problems.
Maybe you meant the link-speed status, that would be volatile based on
link retraining.
Here we are talking about some non-volatile capabilities value no
invalidation needed to their cached variables.>
>> If there is aggressive power saving requirement, and the polling
>> of these caps will make up wakeup/poweron bugs.
> 
> If you're worried about wakeup frequency, I think that's a matter of
> user space / system administraction to decide -- if it doesn't want to
> potentially wake up the link, it shouldn't be poking at config-based
IMHO, sysfs interface is part of KABI, you change its behavior , you
definitely would break some running binaries. there is alternative
way to avoid re-cooking binaries or waking up administrator to modify
their configuration/script in the deep night. you already got it.

Thanks,
Ethan  > sysfs attributes.
> 
> Brian
Re: [PATCH] PCI/sysfs: Ensure devices are powered for config reads
Posted by Brian Norris 3 weeks, 2 days ago
On Fri, Aug 22, 2025 at 09:11:25AM +0800, Ethan Zhao wrote:
> On 8/21/2025 11:28 PM, Brian Norris wrote:
> > On Thu, Aug 21, 2025 at 08:41:28PM +0800, Ethan Zhao wrote:
> > > Hold a PM reference by pci_config_pm_runtime_get() and then write some
> > > data to the PCIe config space, no objection.
> > > 
> > > To know about the linkspeed etc capabilities/not status, how about
> > > creating a cached version of these caps, no need to change their
> > > power state.
> > 
> > For static values like the "max" attributes, maybe that's fine.
> > 
> > But Linux is not always the one changing the link speed. I've seen PCI
> > devices that autonomously request link-speed changes, and AFAICT, the
> > only way we'd know in host software is to go reread the config
> > registers. So caching just produces cache invalidation problems.
> Maybe you meant the link-speed status, that would be volatile based on
> link retraining.

Yes.

> Here we are talking about some non-volatile capabilities value no
> invalidation needed to their cached variables.

I missed the "not status" part a few lines up.

So yes, I agree it's possible to make some of these (but not all) use a
cache. I could perhaps give that a shot, if it's acknowledged that the
non-cacheable attributes are worth fixing.

>
> > > If there is aggressive power saving requirement, and the polling
> > > of these caps will make up wakeup/poweron bugs.
> > 
> > If you're worried about wakeup frequency, I think that's a matter of
> > user space / system administraction to decide -- if it doesn't want to
> > potentially wake up the link, it shouldn't be poking at config-based
> IMHO, sysfs interface is part of KABI, you change its behavior , you
> definitely would break some running binaries. there is alternative
> way to avoid re-cooking binaries or waking up administrator to modify
> their configuration/script in the deep night. you already got it.

That's not how KABI works. Just because there's a potentially-observable
difference doesn't mean we're "breaking" the ABI. You'd have to
demonstrate an actual use case that is breaking. I don't see how it's
"broken" to wake up a device when the API is asking for a value that can
only be retrieved while awake. Sure, it's potentially a small change in
power consumption, but that can apply to almost any kind of change.

My claim is that this is a currently broken area, and that it is
impossible to use these interfaces on a system that aggressively enters
D3cold. If a system observes any difference from this change, then it
was broken before. Bugfixes are not inherently KABI breakages just
because they can be observed.

Brian