The current reset process saves the device's config space state before
reset and restores it afterward. However errors may occur unexpectedly and
it may then be impossible to save config space because the device may be
inaccessible (e.g. DPC) or config space may be corrupted. This results in
saving corrupted values that get written back to the device during state
restoration.
Since commit a2f1e22390ac ("PCI/ERR: Ensure error recoverability at all times"),
we now save the state of device at enumeration. On every restore we should
either use the enumeration saved state or driver's intentional saved state,
never a state saved at the unpredictable time of an error recovery reset.
Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/pci/pci.c | 32 +++++++++++++++-----------------
1 file changed, 15 insertions(+), 17 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 3090c727b76f..2242b97e7d46 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5012,7 +5012,7 @@ void pci_dev_unlock(struct pci_dev *dev)
}
EXPORT_SYMBOL_GPL(pci_dev_unlock);
-static void pci_dev_save_and_disable(struct pci_dev *dev)
+static void pci_dev_disable(struct pci_dev *dev)
{
const struct pci_error_handlers *err_handler =
dev->driver ? dev->driver->err_handler : NULL;
@@ -5029,13 +5029,11 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
pci_warn(dev, "resetting");
/*
- * Wake-up device prior to save. PM registers default to D0 after
- * reset and a simple register restore doesn't reliably return
- * to a non-D0 state anyway.
+ * PM registers default to D0 after reset and a simple register
+ * restore doesn't reliably return to a non-D0 state.
*/
pci_set_power_state(dev, PCI_D0);
- pci_save_state(dev);
/*
* Disable the device by clearing the Command register, except for
* INTx-disable which is set. This not only disables MMIO and I/O port
@@ -5199,7 +5197,7 @@ int pci_reset_function(struct pci_dev *dev)
pci_dev_lock(bridge);
pci_dev_lock(dev);
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
rc = __pci_reset_function_locked(dev);
@@ -5241,7 +5239,7 @@ int pci_reset_function_locked(struct pci_dev *dev)
if (!pci_reset_supported(dev))
return -ENOTTY;
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
rc = __pci_reset_function_locked(dev);
@@ -5267,7 +5265,7 @@ int pci_try_reset_function(struct pci_dev *dev)
if (!pci_dev_trylock(dev))
return -EAGAIN;
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
rc = __pci_reset_function_locked(dev);
pci_dev_restore(dev);
pci_dev_unlock(dev);
@@ -5441,17 +5439,17 @@ static int pci_slot_trylock(struct pci_slot *slot)
}
/*
- * Save and disable devices from the top of the tree down while holding
+ * Disable devices from the top of the tree down while holding
* the @dev mutex lock for the entire tree.
*/
-static void pci_bus_save_and_disable_locked(struct pci_bus *bus)
+static void pci_bus_disable_locked(struct pci_bus *bus)
{
struct pci_dev *dev;
list_for_each_entry(dev, &bus->devices, bus_list) {
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
if (dev->subordinate)
- pci_bus_save_and_disable_locked(dev->subordinate);
+ pci_bus_disable_locked(dev->subordinate);
}
}
@@ -5477,16 +5475,16 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
* Save and disable devices from the top of the tree down while holding
* the @dev mutex lock for the entire tree.
*/
-static void pci_slot_save_and_disable_locked(struct pci_slot *slot)
+static void pci_slot_disable_locked(struct pci_slot *slot)
{
struct pci_dev *dev;
list_for_each_entry(dev, &slot->bus->devices, bus_list) {
if (!dev->slot || dev->slot != slot)
continue;
- pci_dev_save_and_disable(dev);
+ pci_dev_disable(dev);
if (dev->subordinate)
- pci_bus_save_and_disable_locked(dev->subordinate);
+ pci_bus_disable_locked(dev->subordinate);
}
}
@@ -5566,7 +5564,7 @@ static int __pci_reset_slot(struct pci_slot *slot)
return rc;
if (pci_slot_trylock(slot)) {
- pci_slot_save_and_disable_locked(slot);
+ pci_slot_disable_locked(slot);
might_sleep();
rc = pci_reset_hotplug_slot(slot->hotplug, PCI_RESET_DO_RESET);
pci_slot_restore_locked(slot);
@@ -5660,7 +5658,7 @@ int __pci_reset_bus(struct pci_bus *bus)
return rc;
if (pci_bus_trylock(bus)) {
- pci_bus_save_and_disable_locked(bus);
+ pci_bus_disable_locked(bus);
might_sleep();
rc = pci_bridge_secondary_bus_reset(bus->self);
pci_bus_restore_locked(bus);
--
2.43.0
On Tue, Feb 17, 2026 at 10:22:51AM -0800, Farhan Ali wrote:
> The current reset process saves the device's config space state before
> reset and restores it afterward. However errors may occur unexpectedly and
> it may then be impossible to save config space because the device may be
> inaccessible (e.g. DPC) or config space may be corrupted. This results in
> saving corrupted values that get written back to the device during state
> restoration.
>
> Since commit a2f1e22390ac ("PCI/ERR: Ensure error recoverability at all times"),
> we now save the state of device at enumeration. On every restore we should
> either use the enumeration saved state or driver's intentional saved state,
> never a state saved at the unpredictable time of an error recovery reset.
The vfio driver calls pci_try_reset_function after pci_enable_device,
but before calling pci_store_saved_state. Won't this change, then, mean
that the PCI Command register will get restored to the wrong state with
the resources disabled?
On 2/17/2026 11:11 AM, Keith Busch wrote:
> On Tue, Feb 17, 2026 at 10:22:51AM -0800, Farhan Ali wrote:
>> The current reset process saves the device's config space state before
>> reset and restores it afterward. However errors may occur unexpectedly and
>> it may then be impossible to save config space because the device may be
>> inaccessible (e.g. DPC) or config space may be corrupted. This results in
>> saving corrupted values that get written back to the device during state
>> restoration.
>>
>> Since commit a2f1e22390ac ("PCI/ERR: Ensure error recoverability at all times"),
>> we now save the state of device at enumeration. On every restore we should
>> either use the enumeration saved state or driver's intentional saved state,
>> never a state saved at the unpredictable time of an error recovery reset.
> The vfio driver calls pci_try_reset_function after pci_enable_device,
> but before calling pci_store_saved_state. Won't this change, then, mean
> that the PCI Command register will get restored to the wrong state with
> the resources disabled?
Yes I think you are right, with this change the PCI Command register
gets restored to state at enumeration. So we will lose the updated state
after pci_clear_master() and pci_enable_device(). I think we can update
the vfio driver to call pci_save_state() after pci_enable_device()?
Thanks
Farhan
On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote: > > Yes I think you are right, with this change the PCI Command register gets > restored to state at enumeration. So we will lose the updated state after > pci_clear_master() and pci_enable_device(). I think we can update the vfio > driver to call pci_save_state() after pci_enable_device()? Either that, or move the pci_enable_device() call to after the function reset.
On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote: > On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote: > > > > Yes I think you are right, with this change the PCI Command > > register gets restored to state at enumeration. So we will lose > > the updated state after pci_clear_master() and > > pci_enable_device(). I think we can update the vfio driver to call > > pci_save_state() after pci_enable_device()? > > Either that, or move the pci_enable_device() call to after the > function reset. I kind of like the latter idea because it seems a little simpler for the rule of thumb to be that a reset done by the PCI core returns the device to the same state as when the driver first probed the device. Drivers would generally not use pci_save_state() at all, and they could share some initialization logic between probe and post-reset recovery. But I would really like to have Lukas's take on this. Clearly some drivers would have to be adapted if we stop saving config space in the PCI core reset path. We can take care of that for upstream drivers, but it seems risky for out-of-tree drivers. Bjorn
On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
> On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
>> On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
>>> Yes I think you are right, with this change the PCI Command
>>> register gets restored to state at enumeration. So we will lose
>>> the updated state after pci_clear_master() and
>>> pci_enable_device(). I think we can update the vfio driver to call
>>> pci_save_state() after pci_enable_device()?
>> Either that, or move the pci_enable_device() call to after the
>> function reset.
> I kind of like the latter idea because it seems a little simpler for
> the rule of thumb to be that a reset done by the PCI core returns the
> device to the same state as when the driver first probed the device.
> Drivers would generally not use pci_save_state() at all, and they
> could share some initialization logic between probe and post-reset
> recovery.
I think the vfio-pci driver was intentionally doing the
pci_enable_device() before doing the reset. As per commit 9a92c5091a42
("vfio-pci: Enable device before attempting reset") it was done to
handle devices using PM reset, that were getting incorrectly identified
not supporting PM reset due to current state of the device not being D0.
It looks like pci_pm_reset() still returns -EINVAL if current power
state is not D0. So I think we can't move pci_enable_device() after
reset. Unless we want to update pci_pm_reset() to not use cached value
of current_state and read it directly from register?
Thanks
Farhan
>
> But I would really like to have Lukas's take on this. Clearly some
> drivers would have to be adapted if we stop saving config space in the
> PCI core reset path. We can take care of that for upstream drivers,
> but it seems risky for out-of-tree drivers.
>
> Bjorn
On Wed, Feb 18, 2026 at 01:48:57PM -0800, Farhan Ali wrote:
> On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
> > > On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
> > > > Yes I think you are right, with this change the PCI Command
> > > > register gets restored to state at enumeration. So we will
> > > > lose the updated state after pci_clear_master() and
> > > > pci_enable_device(). I think we can update the vfio driver to
> > > > call pci_save_state() after pci_enable_device()?
> > >
> > > Either that, or move the pci_enable_device() call to after the
> > > function reset.
> >
> > I kind of like the latter idea because it seems a little simpler
> > for the rule of thumb to be that a reset done by the PCI core
> > returns the device to the same state as when the driver first
> > probed the device. Drivers would generally not use
> > pci_save_state() at all, and they could share some initialization
> > logic between probe and post-reset recovery.
>
> I think the vfio-pci driver was intentionally doing the
> pci_enable_device() before doing the reset. As per commit
> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") it
> was done to handle devices using PM reset, that were getting
> incorrectly identified not supporting PM reset due to current state
> of the device not being D0. It looks like pci_pm_reset() still
> returns -EINVAL if current power state is not D0. So I think we
> can't move pci_enable_device() after reset. Unless we want to update
> pci_pm_reset() to not use cached value of current_state and read it
> directly from register?
Devices are generally disabled at .probe() time, so that will be the
default saved state. But every driver will expect the device to be
enabled after the reset. Skipping the save state at reset time seems
like it would need a lot of work first and maybe it wouldn't ever be
practical. It wasn't really thought out; I was just hoping we could
simplify the save-state model and maybe unify driver reset and error
recovery paths. I think we need to drop this patch at least for now.
9a92c5091a42 ("vfio-pci: Enable device before attempting reset") was
mostly done to make pci_pm_reset() work, which requires the device to
be in D0. The main purpose of pci_enable_device() is to make device
BARs accessible; it *does* also put the device in D0 because BARs are
only accessible in D0, but pci_pm_reset() itself doesn't need the
BARs.
Other reset methods, e.g., FLR, don't seem to require the device to be
in D0, so I'm not sure why pci_pm_reset() requires that. I think the
critical piece is the D3->D0 transition, and maybe we could arrange
for that to happen even if the device is already in D1/D2/D3hot or
even D3cold.
On 2/18/2026 4:20 PM, Bjorn Helgaas wrote:
> On Wed, Feb 18, 2026 at 01:48:57PM -0800, Farhan Ali wrote:
>> On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
>>> On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
>>>> On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
>>>>> Yes I think you are right, with this change the PCI Command
>>>>> register gets restored to state at enumeration. So we will
>>>>> lose the updated state after pci_clear_master() and
>>>>> pci_enable_device(). I think we can update the vfio driver to
>>>>> call pci_save_state() after pci_enable_device()?
>>>> Either that, or move the pci_enable_device() call to after the
>>>> function reset.
>>> I kind of like the latter idea because it seems a little simpler
>>> for the rule of thumb to be that a reset done by the PCI core
>>> returns the device to the same state as when the driver first
>>> probed the device. Drivers would generally not use
>>> pci_save_state() at all, and they could share some initialization
>>> logic between probe and post-reset recovery.
>> I think the vfio-pci driver was intentionally doing the
>> pci_enable_device() before doing the reset. As per commit
>> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") it
>> was done to handle devices using PM reset, that were getting
>> incorrectly identified not supporting PM reset due to current state
>> of the device not being D0. It looks like pci_pm_reset() still
>> returns -EINVAL if current power state is not D0. So I think we
>> can't move pci_enable_device() after reset. Unless we want to update
>> pci_pm_reset() to not use cached value of current_state and read it
>> directly from register?
> Devices are generally disabled at .probe() time, so that will be the
> default saved state. But every driver will expect the device to be
> enabled after the reset. Skipping the save state at reset time seems
> like it would need a lot of work first and maybe it wouldn't ever be
> practical. It wasn't really thought out; I was just hoping we could
> simplify the save-state model and maybe unify driver reset and error
> recovery paths. I think we need to drop this patch at least for now.
Yeah, I agree this patch might be too disruptive for drivers. In that
case would my previous version [1] to at least prevent saving state in
case of an error be acceptable? Or is there another approach we should
consider?
[1] https://lore.kernel.org/all/20260122194437.1903-4-alifm@linux.ibm.com/
>
> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") was
> mostly done to make pci_pm_reset() work, which requires the device to
> be in D0. The main purpose of pci_enable_device() is to make device
> BARs accessible; it *does* also put the device in D0 because BARs are
> only accessible in D0, but pci_pm_reset() itself doesn't need the
> BARs.
>
> Other reset methods, e.g., FLR, don't seem to require the device to be
> in D0, so I'm not sure why pci_pm_reset() requires that. I think the
> critical piece is the D3->D0 transition, and maybe we could arrange
> for that to happen even if the device is already in D1/D2/D3hot or
> even D3cold.
Looking at the PCI spec (v6.1) I didn't see any requirement for the
device to be in D0 state to perform a power state change. So I think we
should be able to transition from D1/D2/D3hot to D0. But IIUC if a
device is in D3cold, then won't register reads/writes fail till power is
available to the device?
Thanks
Farhan
On Thu, 19 Feb 2026 10:06:05 -0800
Farhan Ali <alifm@linux.ibm.com> wrote:
> On 2/18/2026 4:20 PM, Bjorn Helgaas wrote:
> > On Wed, Feb 18, 2026 at 01:48:57PM -0800, Farhan Ali wrote:
> >> On 2/18/2026 11:35 AM, Bjorn Helgaas wrote:
> >>> On Wed, Feb 18, 2026 at 12:02:01PM -0700, Keith Busch wrote:
> >>>> On Tue, Feb 17, 2026 at 11:55:43AM -0800, Farhan Ali wrote:
> >>>>> Yes I think you are right, with this change the PCI Command
> >>>>> register gets restored to state at enumeration. So we will
> >>>>> lose the updated state after pci_clear_master() and
> >>>>> pci_enable_device(). I think we can update the vfio driver to
> >>>>> call pci_save_state() after pci_enable_device()?
> >>>> Either that, or move the pci_enable_device() call to after the
> >>>> function reset.
> >>> I kind of like the latter idea because it seems a little simpler
> >>> for the rule of thumb to be that a reset done by the PCI core
> >>> returns the device to the same state as when the driver first
> >>> probed the device. Drivers would generally not use
> >>> pci_save_state() at all, and they could share some initialization
> >>> logic between probe and post-reset recovery.
> >> I think the vfio-pci driver was intentionally doing the
> >> pci_enable_device() before doing the reset. As per commit
> >> 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") it
> >> was done to handle devices using PM reset, that were getting
> >> incorrectly identified not supporting PM reset due to current state
> >> of the device not being D0. It looks like pci_pm_reset() still
> >> returns -EINVAL if current power state is not D0. So I think we
> >> can't move pci_enable_device() after reset. Unless we want to update
> >> pci_pm_reset() to not use cached value of current_state and read it
> >> directly from register?
> > Devices are generally disabled at .probe() time, so that will be the
> > default saved state. But every driver will expect the device to be
> > enabled after the reset. Skipping the save state at reset time seems
> > like it would need a lot of work first and maybe it wouldn't ever be
> > practical. It wasn't really thought out; I was just hoping we could
> > simplify the save-state model and maybe unify driver reset and error
> > recovery paths. I think we need to drop this patch at least for now.
>
> Yeah, I agree this patch might be too disruptive for drivers. In that
> case would my previous version [1] to at least prevent saving state in
> case of an error be acceptable? Or is there another approach we should
> consider?
>
> [1] https://lore.kernel.org/all/20260122194437.1903-4-alifm@linux.ibm.com/
>
> >
> > 9a92c5091a42 ("vfio-pci: Enable device before attempting reset") was
> > mostly done to make pci_pm_reset() work, which requires the device to
> > be in D0. The main purpose of pci_enable_device() is to make device
> > BARs accessible; it *does* also put the device in D0 because BARs are
> > only accessible in D0, but pci_pm_reset() itself doesn't need the
> > BARs.
> >
> > Other reset methods, e.g., FLR, don't seem to require the device to be
> > in D0, so I'm not sure why pci_pm_reset() requires that. I think the
> > critical piece is the D3->D0 transition, and maybe we could arrange
> > for that to happen even if the device is already in D1/D2/D3hot or
> > even D3cold.
>
> Looking at the PCI spec (v6.1) I didn't see any requirement for the
> device to be in D0 state to perform a power state change. So I think we
> should be able to transition from D1/D2/D3hot to D0. But IIUC if a
> device is in D3cold, then won't register reads/writes fail till power is
> available to the device?
Yes, config space could be inaccessible in D3cold. IIRC, 9a92c5091a42
was specifically addressing that devices are typically provided to the
driver in the PCI_UNKNOWN state and at the time vfio-pci wasn't
changing that in the .probe function, like most drivers would, so we
needed to adjust the ordering of enabling the device versus calling
reset function.
Now that we've gained PM management in vfio-pci, that's no longer an
issue, but pci_pm_reset() does still require the device to arrive in
D0. Accepting devices arriving in D3cold or D3hot (with NoSoftReset-)
might avoid a power state bounce in some circumstances, but would not
have solved the original 9a92c5091a42 scenario where the device was in
PCI_UNKNOWN power state.
Sorry I missed my opportunity to reply to the suggestion for this
approach in the previous revision. I'm not sure if anything
specifically breaks with this approach to restore the initial device
state, but it's certainly not the contract I currently expect as a
user of the reset-function interfaces. I think that contract is
"reset the internal state of the device while saving and restoring
current config space". If we stray from that, what's the expectation
for things like resizable BARs? I don't think we want to reprovision
resources as a result of reset.
Here we seem to be worried about a specific, testable scenario where
config space might be inaccessible after error and applying the
workaround to that regardless whether that specific scenario is preset.
I don't see that a "test if config space is accessible and stuff the
original save state into the buffer rather than creating an invalid
save state" should be so complex as to require this simplification and
associated risk. Thanks,
Alex
© 2016 - 2026 Red Hat, Inc.