The current reset process saves the device's config space state before
reset and restores it afterward. However, when a device is in an error
state before reset, config space reads may return error values instead of
valid data. This results in saving corrupted values that get written back
to the device during state restoration.
Avoid saving the state of the config space when the device is in error.
While restoring we only restore the state that can be restored through
kernel data such as BARs or doesn't depend on the saved state.
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
drivers/pci/pci.c | 25 ++++++++++++++++++++++---
drivers/pci/pcie/aer.c | 3 +++
drivers/pci/pcie/dpc.c | 3 +++
drivers/pci/pcie/ptm.c | 3 +++
drivers/pci/tph.c | 3 +++
drivers/pci/vc.c | 3 +++
6 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b0f4d98036cd..a3d93d1baee7 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1720,6 +1720,9 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
struct pci_cap_saved_state *save_state;
u16 *cap;
+ if (!dev->state_saved)
+ return;
+
/*
* Restore max latencies (in the LTR capability) before enabling
* LTR itself in PCI_EXP_DEVCTL2.
@@ -1775,6 +1778,9 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
struct pci_cap_saved_state *save_state;
u16 *cap;
+ if (!dev->state_saved)
+ return;
+
save_state = pci_find_saved_cap(dev, PCI_CAP_ID_PCIX);
pos = pci_find_capability(dev, PCI_CAP_ID_PCIX);
if (!save_state || !pos)
@@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
int pci_save_state(struct pci_dev *dev)
{
int i;
+ u32 val;
+
+ pci_read_config_dword(dev, PCI_COMMAND, &val);
+ if (PCI_POSSIBLE_ERROR(val)) {
+ pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
+ return -EIO;
+ }
+
/* XXX: 100% dword access ok here? */
for (i = 0; i < 16; i++) {
pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
@@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
static void pci_restore_config_space(struct pci_dev *pdev)
{
+ if (!pdev->state_saved) {
+ pci_warn(pdev, "No saved config space, restoring BARs\n");
+ pci_restore_bars(pdev);
+ pci_write_config_word(pdev, PCI_COMMAND,
+ PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
+ return;
+ }
+
if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
pci_restore_config_space_range(pdev, 10, 15, 0, false);
/* Restore BARs before the command register. */
@@ -1906,9 +1928,6 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
*/
void pci_restore_state(struct pci_dev *dev)
{
- if (!dev->state_saved)
- return;
-
pci_restore_pcie_state(dev);
pci_restore_pasid_state(dev);
pci_restore_pri_state(dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d716..e6023ffbf94d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -361,6 +361,9 @@ void pci_restore_aer_state(struct pci_dev *dev)
if (!aer)
return;
+ if (!dev->state_saved)
+ return;
+
save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_ERR);
if (!save_state)
return;
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..e0d7034c2cd8 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -67,6 +67,9 @@ void pci_restore_dpc_state(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return;
+ if (!dev->state_saved)
+ return;
+
save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
if (!save_state)
return;
diff --git a/drivers/pci/pcie/ptm.c b/drivers/pci/pcie/ptm.c
index 65e4b008be00..78613000acfb 100644
--- a/drivers/pci/pcie/ptm.c
+++ b/drivers/pci/pcie/ptm.c
@@ -112,6 +112,9 @@ void pci_restore_ptm_state(struct pci_dev *dev)
if (!ptm)
return;
+ if (!dev->state_saved)
+ return;
+
save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_PTM);
if (!save_state)
return;
diff --git a/drivers/pci/tph.c b/drivers/pci/tph.c
index cc64f93709a4..c0fa1b9a7a78 100644
--- a/drivers/pci/tph.c
+++ b/drivers/pci/tph.c
@@ -435,6 +435,9 @@ void pci_restore_tph_state(struct pci_dev *pdev)
if (!pdev->tph_enabled)
return;
+ if (!pdev->state_saved)
+ return;
+
save_state = pci_find_saved_ext_cap(pdev, PCI_EXT_CAP_ID_TPH);
if (!save_state)
return;
diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index a4ff7f5f66dd..00609c7e032a 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -391,6 +391,9 @@ void pci_restore_vc_state(struct pci_dev *dev)
{
int i;
+ if (!dev->state_saved)
+ return;
+
for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
int pos;
struct pci_cap_saved_state *save_state;
--
2.43.0
On Wed, Sep 24, 2025 at 10:16:19AM -0700, Farhan Ali wrote:
> @@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
> int pci_save_state(struct pci_dev *dev)
> {
> int i;
> + u32 val;
> +
> + pci_read_config_dword(dev, PCI_COMMAND, &val);
> + if (PCI_POSSIBLE_ERROR(val)) {
> + pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
> + return -EIO;
Should it set `dev->state_saved` to `false`, to be on the save side?
Not sure whether we run a risk of restoring an old, outdated state otherwise.
> + }
> +
> /* XXX: 100% dword access ok here? */
> for (i = 0; i < 16; i++) {
> pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
> @@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
>
> static void pci_restore_config_space(struct pci_dev *pdev)
> {
> + if (!pdev->state_saved) {
> + pci_warn(pdev, "No saved config space, restoring BARs\n");
> + pci_restore_bars(pdev);
> + pci_write_config_word(pdev, PCI_COMMAND,
> + PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
Is this really something that ought to be universally enabled? I thought this
depends on whether attached resources are IO and/or MEM?
int pci_enable_resources(struct pci_dev *dev, int mask)
{
...
pci_dev_for_each_resource(dev, r, i) {
...
if (r->flags & IORESOURCE_IO)
cmd |= PCI_COMMAND_IO;
if (r->flags & IORESOURCE_MEM)
cmd |= PCI_COMMAND_MEMORY;
}
...
}
Also IIRC, especially on s390, we never have IO resources?
int zpci_setup_bus_resources(struct zpci_dev *zdev)
{
...
for (i = 0; i < PCI_STD_NUM_BARS; i++) {
...
/* only MMIO is supported */
flags = IORESOURCE_MEM;
if (zdev->bars[i].val & 8)
flags |= IORESOURCE_PREFETCH;
if (zdev->bars[i].val & 4)
flags |= IORESOURCE_MEM_64;
...
}
...
}
So I guess this would have to have some form of the same logic as in
`pci_enable_resources()`, after restoring the BARs.
Or am I missing something?
> + return;
> + }
--
Best Regards, Benjamin Block / Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH / https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt / Geschäftsführung: David Faller
Sitz der Ges.: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
On 10/1/2025 8:15 AM, Benjamin Block wrote:
> On Wed, Sep 24, 2025 at 10:16:19AM -0700, Farhan Ali wrote:
>> @@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
>> int pci_save_state(struct pci_dev *dev)
>> {
>> int i;
>> + u32 val;
>> +
>> + pci_read_config_dword(dev, PCI_COMMAND, &val);
>> + if (PCI_POSSIBLE_ERROR(val)) {
>> + pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
>> + return -EIO;
> Should it set `dev->state_saved` to `false`, to be on the save side?
> Not sure whether we run a risk of restoring an old, outdated state otherwise.
AFAIU if the state_saved flag was set to true then any state that we
have saved should be valid and should be okay to be restored from. We
just want to avoid saving any invalid data.
>
>> + }
>> +
>> /* XXX: 100% dword access ok here? */
>> for (i = 0; i < 16; i++) {
>> pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
>> @@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
>>
>> static void pci_restore_config_space(struct pci_dev *pdev)
>> {
>> + if (!pdev->state_saved) {
>> + pci_warn(pdev, "No saved config space, restoring BARs\n");
>> + pci_restore_bars(pdev);
>> + pci_write_config_word(pdev, PCI_COMMAND,
>> + PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
> Is this really something that ought to be universally enabled? I thought this
> depends on whether attached resources are IO and/or MEM?
>
> int pci_enable_resources(struct pci_dev *dev, int mask)
> {
> ...
> pci_dev_for_each_resource(dev, r, i) {
> ...
> if (r->flags & IORESOURCE_IO)
> cmd |= PCI_COMMAND_IO;
> if (r->flags & IORESOURCE_MEM)
> cmd |= PCI_COMMAND_MEMORY;
> }
> ...
> }
>
> Also IIRC, especially on s390, we never have IO resources?
>
> int zpci_setup_bus_resources(struct zpci_dev *zdev)
> {
> ...
> for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> ...
> /* only MMIO is supported */
> flags = IORESOURCE_MEM;
> if (zdev->bars[i].val & 8)
> flags |= IORESOURCE_PREFETCH;
> if (zdev->bars[i].val & 4)
> flags |= IORESOURCE_MEM_64;
> ...
> }
> ...
> }
>
> So I guess this would have to have some form of the same logic as in
> `pci_enable_resources()`, after restoring the BARs.
>
> Or am I missing something?
As per my understanding of the spec, setting both I/O Space and Memory
Space should be safe. The spec also mentions if a function doesn't
support IO/Memory space access it could hardwire the bit to zero. We
could add the logic to iterate through all the resources and set the
bits accordingly, but in this case trying a best effort restoration it
should be fine?
Also I didn't see any issues testing on s390x with the NVMe, RoCE and
NETD devices, but I could have missed something.
Thanks
Farhan
>
>> + return;
>> + }
On Wed, Oct 01, 2025 at 10:12:03AM -0700, Farhan Ali wrote: > AFAIU if the state_saved flag was set to true then any state that we have > saved should be valid and should be okay to be restored from. We just want > to avoid saving any invalid data. The state_saved flag is used by the PCI core to detect whether a driver has called pci_save_state() in one of its suspend callbacks. If it did, the PCI core assumes that the driver has taken on the responsibility to put the device into a low power state. The PCI core will thus not put the device into a low power state itself and it won't (again) call pci_save_state(). Hence state_saved is cleared before the driver suspend callbacks are invoked and it is checked afterwards. Clearing the state_saved flag in pci_restore_state() merely serves the purpose of ensuring that the flag is cleared ahead of the next suspend and resume cycle. It is a fallacy to think that state_saved indicates validity of the saved state. Unfortunately pci_restore_state() was amended by c82f63e411f1 to bail out if state_saved is false. This has arguably caused more problems than it solved, so I have prepared this development branch which essentially reverts the commit and undoes most of the awful workarounds that it necessitated: https://github.com/l1k/linux/commits/aer_reset_v1 I intend to submit this after the merge window has closed. The motivation of c82f63e411f1 was to prevent restoring state if pci_save_state() hasn't been called before. I am solving that by calling pci_save_state() on device addition, hence error recoverability is ensured at all times. I believe this also makes patch [01/10] in your series unnecessary. A lot of drivers call pci_save_state() in their probe hook and that continues to be correct if they modified Config Space vis-a-vis what was saved on device addition. Thanks, Lukas
On 10/4/2025 7:54 AM, Lukas Wunner wrote: > On Wed, Oct 01, 2025 at 10:12:03AM -0700, Farhan Ali wrote: >> AFAIU if the state_saved flag was set to true then any state that we have >> saved should be valid and should be okay to be restored from. We just want >> to avoid saving any invalid data. > The state_saved flag is used by the PCI core to detect whether a driver > has called pci_save_state() in one of its suspend callbacks. If it did, > the PCI core assumes that the driver has taken on the responsibility to > put the device into a low power state. The PCI core will thus not put > the device into a low power state itself and it won't (again) call > pci_save_state(). > > Hence state_saved is cleared before the driver suspend callbacks are > invoked and it is checked afterwards. > > Clearing the state_saved flag in pci_restore_state() merely serves the > purpose of ensuring that the flag is cleared ahead of the next suspend > and resume cycle. > > It is a fallacy to think that state_saved indicates validity of the > saved state. Hi Lukas, Thanks for the detailed explanation, this was very helpful for me. > Unfortunately pci_restore_state() was amended by c82f63e411f1 to > bail out if state_saved is false. This has arguably caused more > problems than it solved, so I have prepared this development branch > which essentially reverts the commit and undoes most of the awful > workarounds that it necessitated: > > https://github.com/l1k/linux/commits/aer_reset_v1 > > I intend to submit this after the merge window has closed. > > The motivation of c82f63e411f1 was to prevent restoring state if > pci_save_state() hasn't been called before. I am solving that by > calling pci_save_state() on device addition, hence error > recoverability is ensured at all times. > > I believe this also makes patch [01/10] in your series unnecessary. I tested your patches + patches 2-10 of this series. It unfortunately didn't completely help with the s390x use case. We still need the check to in pci_save_state() from this patch to make sure we are not saving error values, which can be written back to the device in pci_restore_state(). As part of the error recovery userspace can use the VFIO_DEVICE_RESET to reset the device (pci_try_reset_function()). The function call for this is: pci_dev_save_and_disable <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_save_and_disable>(); __pci_reset_function_locked <https://elixir.bootlin.com/linux/v6.17.1/C/ident/__pci_reset_function_locked>(); pci_dev_restore <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_restore>(); So we can end up overwriting the initial saved state (added by you in pci_bus_add_device()). Do we need to update the pci_dev_save_and_disable() not to save the state? Thanks Farhan > > A lot of drivers call pci_save_state() in their probe hook and > that continues to be correct if they modified Config Space > vis-a-vis what was saved on device addition. > > Thanks, > > Lukas
On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote: > On 10/4/2025 7:54 AM, Lukas Wunner wrote: > > I believe this also makes patch [01/10] in your series unnecessary. > > I tested your patches + patches 2-10 of this series. It unfortunately didn't > completely help with the s390x use case. We still need the check to in > pci_save_state() from this patch to make sure we are not saving error > values, which can be written back to the device in pci_restore_state(). What's the caller of pci_save_state() that needs this? Can you move the check for PCI_POSSIBLE_ERROR() to the caller? I think plenty of other callers don't need this, so it adds extra overhead for them and down the road it'll be difficult to untangle which caller needs it and which doesn't. > As part of the error recovery userspace can use the VFIO_DEVICE_RESET to > reset the device (pci_try_reset_function()). The function call for this is: > > pci_dev_save_and_disable <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_save_and_disable>(); > > __pci_reset_function_locked <https://elixir.bootlin.com/linux/v6.17.1/C/ident/__pci_reset_function_locked>(); > > pci_dev_restore > <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_restore>(); > > So we can end up overwriting the initial saved state (added by you in > pci_bus_add_device()). Do we need to update the pci_dev_save_and_disable() > not to save the state? The state saved on device addition is just the initial state and it is fine if later on it gets updated (which is a nicer term than "overwritten"). E.g. when portdrv.c instantiates port services and drivers are bound to them, various registers in Config Space are changed, hence pcie_portdrv_probe() calls pci_save_state() again. However we can discuss whether pci_save_state() is still needed in pci_dev_save_and_disable(). Thanks, Lukas
On 10/6/2025 12:26 PM, Lukas Wunner wrote:
> On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
>> On 10/4/2025 7:54 AM, Lukas Wunner wrote:
>>> I believe this also makes patch [01/10] in your series unnecessary.
>> I tested your patches + patches 2-10 of this series. It unfortunately didn't
>> completely help with the s390x use case. We still need the check to in
>> pci_save_state() from this patch to make sure we are not saving error
>> values, which can be written back to the device in pci_restore_state().
> What's the caller of pci_save_state() that needs this?
>
> Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
> I think plenty of other callers don't need this, so it adds
> extra overhead for them and down the road it'll be difficult
> to untangle which caller needs it and which doesn't.
The caller would be pci_dev_save_and_disable(). Are you suggesting
moving the PCI_POSSIBLE_ERROR() prior to calling pci_save_state()?
>
>> As part of the error recovery userspace can use the VFIO_DEVICE_RESET to
>> reset the device (pci_try_reset_function()). The function call for this is:
>>
>> pci_dev_save_and_disable <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_save_and_disable>();
>>
>> __pci_reset_function_locked <https://elixir.bootlin.com/linux/v6.17.1/C/ident/__pci_reset_function_locked>();
>>
>> pci_dev_restore
>> <https://elixir.bootlin.com/linux/v6.17.1/C/ident/pci_dev_restore>();
>>
>> So we can end up overwriting the initial saved state (added by you in
>> pci_bus_add_device()). Do we need to update the pci_dev_save_and_disable()
>> not to save the state?
> The state saved on device addition is just the initial state and
> it is fine if later on it gets updated (which is a nicer term than
> "overwritten"). E.g. when portdrv.c instantiates port services
> and drivers are bound to them, various registers in Config Space
> are changed, hence pcie_portdrv_probe() calls pci_save_state()
> again.
>
> However we can discuss whether pci_save_state() is still needed
> in pci_dev_save_and_disable().
The commit 8dd7f8036c12 ("PCI: add support for function level reset")
introduced the logic of saving/restoring the device state after an FLR.
My assumption is it was done to save the most recent state of the device
(as the state could be updated by drivers). So I think it would still
make sense to save the device state in pci_dev_save_and_disable() if the
Config Space is still accessible?
Thanks
Farhan
On Mon, Oct 06, 2025 at 02:35:49PM -0700, Farhan Ali wrote:
> On 10/6/2025 12:26 PM, Lukas Wunner wrote:
> > On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
> > > On 10/4/2025 7:54 AM, Lukas Wunner wrote:
> > > > I believe this also makes patch [01/10] in your series unnecessary.
> > > I tested your patches + patches 2-10 of this series. It unfortunately didn't
> > > completely help with the s390x use case. We still need the check to in
> > > pci_save_state() from this patch to make sure we are not saving error
> > > values, which can be written back to the device in pci_restore_state().
> > What's the caller of pci_save_state() that needs this?
> >
> > Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
> > I think plenty of other callers don't need this, so it adds
> > extra overhead for them and down the road it'll be difficult
> > to untangle which caller needs it and which doesn't.
>
> The caller would be pci_dev_save_and_disable(). Are you suggesting moving
> the PCI_POSSIBLE_ERROR() prior to calling pci_save_state()?
I'm not sure yet. Let's back up a little: I'm missing an
architectural description how you're intending to do error
recovery in the VM. If I understand correctly, you're
informing the VM of the error via the ->error_detected() callback.
You're saying you need to check for accessibility of the device
prior to resetting it from the VM, does that mean you're attempting
a reset from the ->error_detected() callback?
According to Documentation/PCI/pci-error-recovery.rst, the device
isn't supposed to be considered accessible in ->error_detected().
The first callback which allows access is ->mmio_enabled().
I also don't quite understand why the VM needs to perform a reset.
Why can't you just let the VM tell the host that a reset is needed
(PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
behalf of the VM?
Furthermore, I'm thinking that you should be using pci_channel_offline()
to detect accessibility of the device, rather than reading from
Config Space and checking for PCI_POSSIBLE_ERROR().
> > The state saved on device addition is just the initial state and
> > it is fine if later on it gets updated (which is a nicer term than
> > "overwritten"). E.g. when portdrv.c instantiates port services
> > and drivers are bound to them, various registers in Config Space
> > are changed, hence pcie_portdrv_probe() calls pci_save_state()
> > again.
> >
> > However we can discuss whether pci_save_state() is still needed
> > in pci_dev_save_and_disable().
>
> The commit 8dd7f8036c12 ("PCI: add support for function level reset")
> introduced the logic of saving/restoring the device state after an FLR. My
> assumption is it was done to save the most recent state of the device (as
> the state could be updated by drivers). So I think it would still make sense
> to save the device state in pci_dev_save_and_disable() if the Config Space
> is still accessible?
Yes, right now we can't assume that drivers call pci_save_state()
in their probe hook if they modified Config Space. They may rely
on the state being saved prior to reset or a D3hot/D3cold transition.
So we need to keep the pci_dev_save_and_disable() call for now.
Generally the expectation is that Config Space is accessible when
performing a reset with pci_try_reset_function(). Since that's
apparently not guaranteed for your use case, I'm wondering if you
might be using the function in a context it's not supposed to be used.
Thanks,
Lukas
On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> On Mon, Oct 06, 2025 at 02:35:49PM -0700, Farhan Ali wrote:
>> On 10/6/2025 12:26 PM, Lukas Wunner wrote:
>>> On Mon, Oct 06, 2025 at 10:54:51AM -0700, Farhan Ali wrote:
>>>> On 10/4/2025 7:54 AM, Lukas Wunner wrote:
>>>>> I believe this also makes patch [01/10] in your series unnecessary.
>>>> I tested your patches + patches 2-10 of this series. It unfortunately didn't
>>>> completely help with the s390x use case. We still need the check to in
>>>> pci_save_state() from this patch to make sure we are not saving error
>>>> values, which can be written back to the device in pci_restore_state().
>>> What's the caller of pci_save_state() that needs this?
>>>
>>> Can you move the check for PCI_POSSIBLE_ERROR() to the caller?
>>> I think plenty of other callers don't need this, so it adds
>>> extra overhead for them and down the road it'll be difficult
>>> to untangle which caller needs it and which doesn't.
>> The caller would be pci_dev_save_and_disable(). Are you suggesting moving
>> the PCI_POSSIBLE_ERROR() prior to calling pci_save_state()?
> I'm not sure yet. Let's back up a little: I'm missing an
> architectural description how you're intending to do error
> recovery in the VM. If I understand correctly, you're
> informing the VM of the error via the ->error_detected() callback.
>
> You're saying you need to check for accessibility of the device
> prior to resetting it from the VM, does that mean you're attempting
> a reset from the ->error_detected() callback?
>
> According to Documentation/PCI/pci-error-recovery.rst, the device
> isn't supposed to be considered accessible in ->error_detected().
> The first callback which allows access is ->mmio_enabled().
>
> I also don't quite understand why the VM needs to perform a reset.
> Why can't you just let the VM tell the host that a reset is needed
> (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> behalf of the VM?
The ->error_detected() callback is used to inform userspace of an error.
In the case of a VM, using QEMU as a userspace, once notified of an
error QEMU will inject an error into the guest in s390x architecture
specific way [1] (probably should have linked the QEMU series in the
cover letter). Once notified of the error VM's device driver will drive
the recovery action. The recovery action require a reset of the device
and on s390x PCI devices are reset using architecture specific
instructions (zpci_device_hot_reset()). QEMU will intercept any low
level recovery instructions from the VM and then perform a reset of
device on behalf of the VM [2]. QEMU performs a reset by invoking the
VFIO_DEVICE_RESET ioctl, which attempts to reset the device
using pci_try_reset_function().
Once a device is in an error state, MMIO to the device is blocked and so
any PCI reads to the Config Space will return -1. Since
pci_try_reset_function() will try to save the state of device's Config
Space with pci_dev_save_and_disable(), it will end up saving -1 as the
state. Later when we try to restore the state after a reset, we end up
corrupting device registers which can send the device into an error
state again. I was trying to avoid this with the patch.
Hopefully, this answers your questions.
[1] QEMU series
https://lore.kernel.org/all/20250925174852.1302-1-alifm@linux.ibm.com/
[2] v1 patch series discussion on some nuances of reset mechanism
https://lore.kernel.org/all/20250814145743.204ca19a.alex.williamson@redhat.com/
>
> Furthermore, I'm thinking that you should be using pci_channel_offline()
> to detect accessibility of the device, rather than reading from
> Config Space and checking for PCI_POSSIBLE_ERROR().
>
>>> The state saved on device addition is just the initial state and
>>> it is fine if later on it gets updated (which is a nicer term than
>>> "overwritten"). E.g. when portdrv.c instantiates port services
>>> and drivers are bound to them, various registers in Config Space
>>> are changed, hence pcie_portdrv_probe() calls pci_save_state()
>>> again.
>>>
>>> However we can discuss whether pci_save_state() is still needed
>>> in pci_dev_save_and_disable().
>> The commit 8dd7f8036c12 ("PCI: add support for function level reset")
>> introduced the logic of saving/restoring the device state after an FLR. My
>> assumption is it was done to save the most recent state of the device (as
>> the state could be updated by drivers). So I think it would still make sense
>> to save the device state in pci_dev_save_and_disable() if the Config Space
>> is still accessible?
> Yes, right now we can't assume that drivers call pci_save_state()
> in their probe hook if they modified Config Space. They may rely
> on the state being saved prior to reset or a D3hot/D3cold transition.
> So we need to keep the pci_dev_save_and_disable() call for now.
>
> Generally the expectation is that Config Space is accessible when
> performing a reset with pci_try_reset_function(). Since that's
> apparently not guaranteed for your use case, I'm wondering if you
> might be using the function in a context it's not supposed to be used.
I am open to suggestions on how we can do this.
Thanks
Farhan
On Wed, Oct 08, 2025 at 10:56:35AM -0700, Farhan Ali wrote:
> On 10/8/2025 6:34 AM, Lukas Wunner wrote:
> > I'm not sure yet. Let's back up a little: I'm missing an
> > architectural description how you're intending to do error
> > recovery in the VM. If I understand correctly, you're
> > informing the VM of the error via the ->error_detected() callback.
> >
> > You're saying you need to check for accessibility of the device
> > prior to resetting it from the VM, does that mean you're attempting
> > a reset from the ->error_detected() callback?
> >
> > According to Documentation/PCI/pci-error-recovery.rst, the device
> > isn't supposed to be considered accessible in ->error_detected().
> > The first callback which allows access is ->mmio_enabled().
> >
>
> The ->error_detected() callback is used to inform userspace of an error. In
> the case of a VM, using QEMU as a userspace, once notified of an error QEMU
> will inject an error into the guest in s390x architecture specific way [1]
> (probably should have linked the QEMU series in the cover letter). Once
> notified of the error VM's device driver will drive the recovery action. The
> recovery action require a reset of the device and on s390x PCI devices are
> reset using architecture specific instructions (zpci_device_hot_reset()).
According to Documentation/PCI/pci-error-recovery.rst:
"STEP 1: Notification
--------------------
Platform calls the error_detected() callback on every instance of
every driver affected by the error.
At this point, the device might not be accessible anymore, [...]
it gives the driver a chance to cleanup, waiting for pending stuff
(timers, whatever, etc...) to complete; it can take semaphores,
schedule, etc... everything but touch the device."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
And yet you're touching the device by trying to reset it.
The code you're introducing in patch [01/10] only becomes necessary
because you're not following the above-quoted protocol. If you
follow the protocol, patch [01/10] becomes unnecessary.
> > I also don't quite understand why the VM needs to perform a reset.
> > Why can't you just let the VM tell the host that a reset is needed
> > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on
> > behalf of the VM?
Could you answer this question please?
Thanks,
Lukas
On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote: > On Wed, Oct 08, 2025 at 10:56:35AM -0700, Farhan Ali wrote: > > On 10/8/2025 6:34 AM, Lukas Wunner wrote: > > > I'm not sure yet. Let's back up a little: I'm missing an > > > architectural description how you're intending to do error > > > recovery in the VM. If I understand correctly, you're > > > informing the VM of the error via the ->error_detected() callback. > > > > > > You're saying you need to check for accessibility of the device > > > prior to resetting it from the VM, does that mean you're attempting > > > a reset from the ->error_detected() callback? > > > > > > According to Documentation/PCI/pci-error-recovery.rst, the device > > > isn't supposed to be considered accessible in ->error_detected(). > > > The first callback which allows access is ->mmio_enabled(). > > > > > > > The ->error_detected() callback is used to inform userspace of an error. In > > the case of a VM, using QEMU as a userspace, once notified of an error QEMU > > will inject an error into the guest in s390x architecture specific way [1] > > (probably should have linked the QEMU series in the cover letter). Once > > notified of the error VM's device driver will drive the recovery action. The > > recovery action require a reset of the device and on s390x PCI devices are > > reset using architecture specific instructions (zpci_device_hot_reset()). > > According to Documentation/PCI/pci-error-recovery.rst: > > "STEP 1: Notification > -------------------- > Platform calls the error_detected() callback on every instance of > every driver affected by the error. > At this point, the device might not be accessible anymore, [...] > it gives the driver a chance to cleanup, waiting for pending stuff > (timers, whatever, etc...) to complete; it can take semaphores, > schedule, etc... everything but touch the device." > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > And yet you're touching the device by trying to reset it. > > The code you're introducing in patch [01/10] only becomes necessary > because you're not following the above-quoted protocol. If you > follow the protocol, patch [01/10] becomes unnecessary. > I agree with your point above error_detected() should not touch the device. My understanding of Farhan's series though is that it follows that rule. As I understand it error_detected() is only used to inject the s390 specific PCI error event into the VM using the information stored in patch 7. As before vfio-pci returns PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7 the pass-through case is detected and this gets turned into PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets skipped. And yeah, writing it down I'm not super happy with this part, maybe it would be better to have an explicit PCI_ERS_RESULT_LEAVE_AS_IS. Either way this leaves the PCI device in the error state just like for the host the platform leaves the device in the error state. Up until this point even if the VM/QEMU tried to do a reset already it would get blocked on at least the zdev->state_lock until the recovery code is done. Only after the VM would run its recovery code and with that drive the reset. > > > I also don't quite understand why the VM needs to perform a reset. > > > Why can't you just let the VM tell the host that a reset is needed > > > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on > > > behalf of the VM? The reason is that we want the behavior from the VMs perspective to follow s390's PCI error event handling architecture. In this model however there is no mechanism to synchroniously ask the OS "An error occurred would you want the device reset?" or to tell it that we as hypervisor already unblocked MMIO/DMA or performed a reset. So instead our idea was that we just do the error_detected() part in the host's recovery code and then leave the device as is driving the rest from the guest. Thanks, Niklas
On Thu, Oct 09, 2025 at 11:12:03AM +0200, Niklas Schnelle wrote: > On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote: > > And yet you're touching the device by trying to reset it. > > > > The code you're introducing in patch [01/10] only becomes necessary > > because you're not following the above-quoted protocol. If you > > follow the protocol, patch [01/10] becomes unnecessary. > > I agree with your point above error_detected() should not touch the > device. My understanding of Farhan's series though is that it follows > that rule. As I understand it error_detected() is only used to inject > the s390 specific PCI error event into the VM using the information > stored in patch 7. As before vfio-pci returns > PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7 > the pass-through case is detected and this gets turned into > PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets > skipped. And yeah, writing it down I'm not super happy with this part, > maybe it would be better to have an explicit > PCI_ERS_RESULT_LEAVE_AS_IS. Thanks, that's the high-level overview I was looking for. It would be good to include something like this at least in the cover letter or additionally in the commit messages so that it's easier for reviewers to connect the dots. I was expecting paravirtualized error handling, i.e. the VM is aware it's virtualized and vfio essentially proxies the pci_ers_result return value of the driver (e.g. nvme) back to the host, thereby allowing the host to drive error recovery normally. I'm not sure if there are technical reasons preventing such an approach. If you do want to stick with your alternative approach, maybe doing the error handling in the ->mmio_enabled() phase instead of ->error_detected() would make more sense. In that phase you're allowed to access the device, you can also attempt a local reset and return PCI_ERS_RESULT_RECOVERED on success. You'd have to return PCI_ERS_RESULT_CAN_RECOVER though from the ->error_detected() callback in order to progress to the ->mmio_enabled() step. Does that make sense? Thanks, Lukas
On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote:
> On Thu, Oct 09, 2025 at 11:12:03AM +0200, Niklas Schnelle wrote:
> > On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote:
> > > And yet you're touching the device by trying to reset it.
> > >
> > > The code you're introducing in patch [01/10] only becomes necessary
> > > because you're not following the above-quoted protocol. If you
> > > follow the protocol, patch [01/10] becomes unnecessary.
> >
> > I agree with your point above error_detected() should not touch the
> > device. My understanding of Farhan's series though is that it follows
> > that rule. As I understand it error_detected() is only used to inject
> > the s390 specific PCI error event into the VM using the information
> > stored in patch 7. As before vfio-pci returns
> > PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7
> > the pass-through case is detected and this gets turned into
> > PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets
> > skipped. And yeah, writing it down I'm not super happy with this part,
> > maybe it would be better to have an explicit
> > PCI_ERS_RESULT_LEAVE_AS_IS.
>
> Thanks, that's the high-level overview I was looking for.
>
> It would be good to include something like this at least
> in the cover letter or additionally in the commit messages
> so that it's easier for reviewers to connect the dots.
>
> I was expecting paravirtualized error handling, i.e. the
> VM is aware it's virtualized and vfio essentially proxies
> the pci_ers_result return value of the driver (e.g. nvme)
> back to the host, thereby allowing the host to drive error
> recovery normally. I'm not sure if there are technical
> reasons preventing such an approach.
It does sound technically feasible but sticking to the already
architected error reporting and recovery has clear advantages. For one
it will work with existing Linux versions without guest changes and it
also has precedent with it working already in the z/VM hypervisor for
years. I agree that there is some level of mismatch with Linux'
recovery support but I don't think that outweighs having a clean
virtualization support where the host and guest use the same interface.
>
> If you do want to stick with your alternative approach,
> maybe doing the error handling in the ->mmio_enabled() phase
> instead of ->error_detected() would make more sense.
> In that phase you're allowed to access the device,
> you can also attempt a local reset and return
> PCI_ERS_RESULT_RECOVERED on success.
>
> You'd have to return PCI_ERS_RESULT_CAN_RECOVER though
> from the ->error_detected() callback in order to progress
> to the ->mmio_enabled() step.
>
> Does that make sense?
>
> Thanks,
>
> Lukas
The problem with using ->mmio_enabled() is two fold. For one we
sometimes have to do a reset instead of clearing the error state, for
example if the device was not only put in the error state but also
disabled, or if the guest driver wants it, so we would also have to use
->slot_reset() and could end up with two resets. Second and more
importantly this would break the guests assumption that the device will
be in the error state with MMIO and DMA blocked when it gets an error
event. On the other hand, that's exactly the state it is in if we
report the error in the ->error_detected() callback and then skip the
rest of recovery so it can be done in the guest, likely with the exact
same Linux code. I'd assume this should be similar if QEMU/KVM wanted
to virtualize AER+DPC except that there MMIO remains accessible?
Here's an idea. Could it be an option to detect the pass-through in the
vfio-pci driver's ->error_detected() callback, possibly with feedback
from QEMU (@Alex?), and then return PCI_ERS_RESULT_RECOVERED from there
skipping the rest of recovery?
The skipping would be in-line with the below section of the
documentation i.e. "no further intervention":
- PCI_ERS_RESULT_RECOVERED
Driver returns this if it thinks the device is usable despite
the error and does not need further intervention.
It's just that in this case the device really remains with MMIO and DMA
blocked, usable only in the sense that the vfio-pci + guest VM combo
knows how to use a device with MMIO and DMA blocked with the guest
recovery.
Thanks,
Niklas
On Tue, Oct 14, 2025 at 02:07:57PM +0200, Niklas Schnelle wrote: > On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote: > > If you do want to stick with your alternative approach, > > maybe doing the error handling in the ->mmio_enabled() phase > > instead of ->error_detected() would make more sense. > > In that phase you're allowed to access the device, > > you can also attempt a local reset and return > > PCI_ERS_RESULT_RECOVERED on success. > > > > You'd have to return PCI_ERS_RESULT_CAN_RECOVER though > > from the ->error_detected() callback in order to progress > > to the ->mmio_enabled() step. > > The problem with using ->mmio_enabled() is two fold. For one we > sometimes have to do a reset instead of clearing the error state, for > example if the device was not only put in the error state but also > disabled, or if the guest driver wants it, Well in that case you could reset the device in the ->mmio_enabled() step from the guest using the vfio reset ioctl. > Second and more > importantly this would break the guests assumption that the device will > be in the error state with MMIO and DMA blocked when it gets an error > event. On the other hand, that's exactly the state it is in if we > report the error in the ->error_detected() callback At the risk of continuously talking past each other: How about this, the host notifies the guest of the error in the ->error_detected() callback. The guest notifies the driver and collects the result (whether a reset is requested or not), then returns PCI_ERS_RESULT_CAN_RECOVER to the host. The host re-enables I/O to the device, invokes the ->mmio_detected() callback. The guest then resets the device based on the result it collected earlier or invokes the driver's ->mmio_enabled() callback. If the driver returns PCI_ERS_RESULT_NEED_RESET from the ->mmio_enabled() callback, you can likewise reset the device from the guest using the ioctl method. My concern is that by insisting that you handle device recovery completely in the ->error_detected() phase, you're not complying with the protocol specified in Documentation/PCI/pci-error-recovery.rst and as a result, you have to amend the reset code in the PCI core because it assumes that all arches adheres to the protocol. In my view, that suggests that the approach needs to be reworked to comply with the protocol. Then the workarounds for performing a reset while I/O is blocked become unnecessary. Thanks, Lukas
On Sun, 2025-10-19 at 16:34 +0200, Lukas Wunner wrote: > On Tue, Oct 14, 2025 at 02:07:57PM +0200, Niklas Schnelle wrote: > > On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote: > > > If you do want to stick with your alternative approach, > > > maybe doing the error handling in the ->mmio_enabled() phase > > > instead of ->error_detected() would make more sense. > > > In that phase you're allowed to access the device, > > > you can also attempt a local reset and return > > > PCI_ERS_RESULT_RECOVERED on success. > > > > > > You'd have to return PCI_ERS_RESULT_CAN_RECOVER though > > > from the ->error_detected() callback in order to progress > > > to the ->mmio_enabled() step. > > > > The problem with using ->mmio_enabled() is two fold. For one we > > sometimes have to do a reset instead of clearing the error state, for > > example if the device was not only put in the error state but also > > disabled, or if the guest driver wants it, > > Well in that case you could reset the device in the ->mmio_enabled() step > from the guest using the vfio reset ioctl. > > > Second and more > > importantly this would break the guests assumption that the device will > > be in the error state with MMIO and DMA blocked when it gets an error > > event. On the other hand, that's exactly the state it is in if we > > report the error in the ->error_detected() callback > > At the risk of continuously talking past each other: > > How about this, the host notifies the guest of the error in the > ->error_detected() callback. The guest notifies the driver and > collects the result (whether a reset is requested or not), then > returns PCI_ERS_RESULT_CAN_RECOVER to the host. > > The host re-enables I/O to the device, invokes the ->mmio_detected() > callback. The guest then resets the device based on the result it > collected earlier or invokes the driver's ->mmio_enabled() callback. > > If the driver returns PCI_ERS_RESULT_NEED_RESET from the > ->mmio_enabled() callback, you can likewise reset the device from > the guest using the ioctl method. > > My concern is that by insisting that you handle device recovery > completely in the ->error_detected() phase, you're not complying > with the protocol specified in Documentation/PCI/pci-error-recovery.rst > and as a result, you have to amend the reset code in the PCI core > because it assumes that all arches adheres to the protocol. > In my view, that suggests that the approach needs to be reworked > to comply with the protocol. Then the workarounds for performing > a reset while I/O is blocked become unnecessary. > > Thanks, > > Lukas Yeah I think we're talking past each other a bit. In my mind we're really not doing the recovery in ->error_detected() at all. Within that callback we only do the notify, and then do nothing in the rest of recovery. Only after will the guest do recovery though I do see your point that leaving the device in the error state kind of means that recovery is still ongoing even if we're not in the recovery handler anymore. But then any driver could also just return PCI_ERS_RESULT_RECOVERED in error_detected() and land us in the same situation. And at least on s390 there is also the case that the device actually enters the error state before we get the error event so we could already race into a situation where the guest does a reset and the device is already in the error state but we haven't seen the event yet. And this seems inherent to hardware blocking I/O on error which by it's nature can happen at any time. But let's put that aside, say we want to implement your model where we do check with the guest and its device driver. How would that work, somehow error_detected() would have to wait for the guest to proceed into recovery and since the guest could just not do that we'd have to have some kind of timeout. Also we can't allow the guest to choose PCI_ERS_RESULT_RECOVERED because otherwise we'd again be in the situation where recovery is completed without unblocking I/O. And if we want to stick to the architecture QEMU/KVM will have to kind of have a mode where after being informed of ongoing recovery for a device they intercept attempts to reset / firmware calls for reset and turn that into the correct return. And somehow also deal with the timeout because e.g. old Linux guests won't do recovery but there is also no architected way for a guest to say that it does recovery. To me this seems worse than accepting that even with PCI error recovery, resets can encounter PCI devices with blocked I/O which is already true now if e.g. user-space happens to drive a reset racing with hardware I/O blocking. Thanks, Niklas
On Mon, Oct 20, 2025 at 10:59:48AM +0200, Niklas Schnelle wrote: > Yeah I think we're talking past each other a bit. In my mind we're > really not doing the recovery in ->error_detected() at all. Within that > callback we only do the notify, and then do nothing in the rest of > recovery. Only after will the guest do recovery though I do see your > point that leaving the device in the error state kind of means that > recovery is still ongoing even if we're not in the recovery handler > anymore. But then any driver could also just return > PCI_ERS_RESULT_RECOVERED in error_detected() and land us in the same > situation. That would be a bug in the driver. The point of the pci_error_handlers is to attempt recovery of the device in concert with the driver. If the driver "fakes" a recovered device towards the PCI core and then attempts recovery behind the PCI core's back, it gets to keep the pieces... > But let's put that aside, say we want to implement your model where we > do check with the guest and its device driver. How would that work, > somehow error_detected() would have to wait for the guest to proceed > into recovery and since the guest could just not do that we'd have to > have some kind of timeout. Right, a timeout seems reasonable. > Also we can't allow the guest to choose > PCI_ERS_RESULT_RECOVERED because otherwise we'd again be in the > situation where recovery is completed without unblocking I/O. The guest should only return that if the device has really recovered. On an architecture which blocks I/O upon an error, by definition the device cannot already be recovered in the ->error_detected() stage. > And if we > want to stick to the architecture QEMU/KVM will have to kind of have a > mode where after being informed of ongoing recovery for a device they > intercept attempts to reset / firmware calls for reset and turn that > into the correct return. And somehow also deal with the timeout because > e.g. old Linux guests won't do recovery but there is also no > architected way for a guest to say that it does recovery. I guess there are gaps in qemu with regards to error recovery, but I think the solution is to add the missing functionality, not try to work around the gaps. Thanks, Lukas
On 10/14/2025 5:07 AM, Niklas Schnelle wrote: > On Sun, 2025-10-12 at 08:34 +0200, Lukas Wunner wrote: >> On Thu, Oct 09, 2025 at 11:12:03AM +0200, Niklas Schnelle wrote: >>> On Wed, 2025-10-08 at 20:14 +0200, Lukas Wunner wrote: >>>> And yet you're touching the device by trying to reset it. >>>> >>>> The code you're introducing in patch [01/10] only becomes necessary >>>> because you're not following the above-quoted protocol. If you >>>> follow the protocol, patch [01/10] becomes unnecessary. >>> I agree with your point above error_detected() should not touch the >>> device. My understanding of Farhan's series though is that it follows >>> that rule. As I understand it error_detected() is only used to inject >>> the s390 specific PCI error event into the VM using the information >>> stored in patch 7. As before vfio-pci returns >>> PCI_ERS_RESULT_CAN_RECOVER from error_detected() but then with patch 7 >>> the pass-through case is detected and this gets turned into >>> PCI_ERS_RESULT_RECOVERED and the rest of the s390 recovery code gets >>> skipped. And yeah, writing it down I'm not super happy with this part, >>> maybe it would be better to have an explicit >>> PCI_ERS_RESULT_LEAVE_AS_IS. >> Thanks, that's the high-level overview I was looking for. >> >> It would be good to include something like this at least >> in the cover letter or additionally in the commit messages >> so that it's easier for reviewers to connect the dots. >> >> I was expecting paravirtualized error handling, i.e. the >> VM is aware it's virtualized and vfio essentially proxies >> the pci_ers_result return value of the driver (e.g. nvme) >> back to the host, thereby allowing the host to drive error >> recovery normally. I'm not sure if there are technical >> reasons preventing such an approach. > It does sound technically feasible but sticking to the already > architected error reporting and recovery has clear advantages. For one > it will work with existing Linux versions without guest changes and it > also has precedent with it working already in the z/VM hypervisor for > years. I agree that there is some level of mismatch with Linux' > recovery support but I don't think that outweighs having a clean > virtualization support where the host and guest use the same interface. > >> If you do want to stick with your alternative approach, >> maybe doing the error handling in the ->mmio_enabled() phase >> instead of ->error_detected() would make more sense. >> In that phase you're allowed to access the device, >> you can also attempt a local reset and return >> PCI_ERS_RESULT_RECOVERED on success. >> >> You'd have to return PCI_ERS_RESULT_CAN_RECOVER though >> from the ->error_detected() callback in order to progress >> to the ->mmio_enabled() step. >> >> Does that make sense? >> >> Thanks, >> >> Lukas > The problem with using ->mmio_enabled() is two fold. For one we > sometimes have to do a reset instead of clearing the error state, for > example if the device was not only put in the error state but also > disabled, or if the guest driver wants it, so we would also have to use > ->slot_reset() and could end up with two resets. Second and more > importantly this would break the guests assumption that the device will > be in the error state with MMIO and DMA blocked when it gets an error > event. On the other hand, that's exactly the state it is in if we > report the error in the ->error_detected() callback and then skip the > rest of recovery so it can be done in the guest, likely with the exact > same Linux code. I'd assume this should be similar if QEMU/KVM wanted > to virtualize AER+DPC except that there MMIO remains accessible? > > Here's an idea. Could it be an option to detect the pass-through in the > vfio-pci driver's ->error_detected() callback, possibly with feedback > from QEMU (@Alex?), and then return PCI_ERS_RESULT_RECOVERED from there > skipping the rest of recovery? > > The skipping would be in-line with the below section of the > documentation i.e. "no further intervention": > > - PCI_ERS_RESULT_RECOVERED > Driver returns this if it thinks the device is usable despite > the error and does not need further intervention. > > It's just that in this case the device really remains with MMIO and DMA > blocked, usable only in the sense that the vfio-pci + guest VM combo > knows how to use a device with MMIO and DMA blocked with the guest > recovery. > > Thanks, > Niklas Hi Lukas, Hope this helps to clarify why we still need patch [01/10] (or at least the check in pci_save_state() to see if the device responds with error value or not if we move forward with your patch series PCI: Universal error recoverability of devices). We can discuss if that check needs to be moved somewhere else if there is concern with overhead in pci_save_state(). Discussing with Niklas (off mailing list), we were thinking if it makes sense if vfio_pci_core_aer_err_detected() returned PCI_ERS_RESULT_RECOVERED if it doesn't need any further intervention from platform recovery to align closer to pcie-error-recovery documentation? One proposal would be to have a flag in struct vfio_pci_core_device(eg vdev->mediated_recovery), which can be used to return PCI_ERS_RESULT_RECOVERED in vfio_pci_core_aer_err_detected()if the flag was set. The flag could be set by userspace using VFIO_DEVICE_FEATURE_SET for the device feature VFIO_DEVICE_FEATURE_ZPCI_ERROR (would like to hear Alex's thoughts on this proposal). Thanks Farhan
On 10/8/2025 11:14 AM, Lukas Wunner wrote: > On Wed, Oct 08, 2025 at 10:56:35AM -0700, Farhan Ali wrote: >> On 10/8/2025 6:34 AM, Lukas Wunner wrote: >>> I'm not sure yet. Let's back up a little: I'm missing an >>> architectural description how you're intending to do error >>> recovery in the VM. If I understand correctly, you're >>> informing the VM of the error via the ->error_detected() callback. >>> >>> You're saying you need to check for accessibility of the device >>> prior to resetting it from the VM, does that mean you're attempting >>> a reset from the ->error_detected() callback? >>> >>> According to Documentation/PCI/pci-error-recovery.rst, the device >>> isn't supposed to be considered accessible in ->error_detected(). >>> The first callback which allows access is ->mmio_enabled(). >>> >> The ->error_detected() callback is used to inform userspace of an error. In >> the case of a VM, using QEMU as a userspace, once notified of an error QEMU >> will inject an error into the guest in s390x architecture specific way [1] >> (probably should have linked the QEMU series in the cover letter). Once >> notified of the error VM's device driver will drive the recovery action. The >> recovery action require a reset of the device and on s390x PCI devices are >> reset using architecture specific instructions (zpci_device_hot_reset()). > According to Documentation/PCI/pci-error-recovery.rst: > > "STEP 1: Notification > -------------------- > Platform calls the error_detected() callback on every instance of > every driver affected by the error. > At this point, the device might not be accessible anymore, [...] > it gives the driver a chance to cleanup, waiting for pending stuff > (timers, whatever, etc...) to complete; it can take semaphores, > schedule, etc... everything but touch the device." > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > And yet you're touching the device by trying to reset it. > > The code you're introducing in patch [01/10] only becomes necessary > because you're not following the above-quoted protocol. If you > follow the protocol, patch [01/10] becomes unnecessary. > >>> I also don't quite understand why the VM needs to perform a reset. >>> Why can't you just let the VM tell the host that a reset is needed >>> (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on >>> behalf of the VM? > Could you answer this question please? > > The reset is not performed by the VM, reset is still done by the host. My approach for a VM to let the host know that reset was needed, was to intercept any reset instructions for the PCI device in QEMU. QEMU would then drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something, but based on what we have today in vfio driver, we don't have a mechanism for userspace to reset a device other than VFIO_DEVICE_RESET and VFIO_PCI_DEVICE_HOT_RESET ioctls.
On Wed, Oct 08, 2025 at 02:55:56PM -0700, Farhan Ali wrote: > > > On 10/8/2025 6:34 AM, Lukas Wunner wrote: > > > > I also don't quite understand why the VM needs to perform a reset. > > > > Why can't you just let the VM tell the host that a reset is needed > > > > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on > > > > behalf of the VM? > > The reset is not performed by the VM, reset is still done by the host. My > approach for a VM to let the host know that reset was needed, was to > intercept any reset instructions for the PCI device in QEMU. QEMU would then > drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something, but based > on what we have today in vfio driver, we don't have a mechanism for > userspace to reset a device other than VFIO_DEVICE_RESET and > VFIO_PCI_DEVICE_HOT_RESET ioctls. The ask is for the host to notify the VM of the ->error_detected() event and the VM then responding with one of the "enum pci_ers_result" values. If the VM returns PCI_ERS_RESULT_NEED_RESET from ->error_detected(), the host knows that it shall reset the device. There is no need for the VM to initiate a reset through an ioctl. Just integrate with the existing error handling flow described in Documentation/PCI/pci-error-recovery.rst and thereby use the existing mechanism to reset the device. Thanks, Lukas
On 10/8/2025 9:52 PM, Lukas Wunner wrote: > On Wed, Oct 08, 2025 at 02:55:56PM -0700, Farhan Ali wrote: >>>> On 10/8/2025 6:34 AM, Lukas Wunner wrote: >>>>> I also don't quite understand why the VM needs to perform a reset. >>>>> Why can't you just let the VM tell the host that a reset is needed >>>>> (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on >>>>> behalf of the VM? >> The reset is not performed by the VM, reset is still done by the host. My >> approach for a VM to let the host know that reset was needed, was to >> intercept any reset instructions for the PCI device in QEMU. QEMU would then >> drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something, but based >> on what we have today in vfio driver, we don't have a mechanism for >> userspace to reset a device other than VFIO_DEVICE_RESET and >> VFIO_PCI_DEVICE_HOT_RESET ioctls. > The ask is for the host to notify the VM of the ->error_detected() event > and the VM then responding with one of the "enum pci_ers_result" values. Maybe there is some confusion here. Could you clarify what do you mean by VM responding with "enum pci_ers_result" values? Is it a device driver (for example an NVMe driver) running in the VM that should do that? Or is it something else you are suggesting? Let me try to clarify what I am trying to do with this patch series. For passthrough devices to a VM, the driver bound to the device on the host is vfio-pci. vfio-pci driver does support the error_detected() callback (vfio_pci_core_aer_err_detected()), and on an PCI error s390x recovery code on the host will call the vfio-pci error_detected() callback. The vfio-pci error_detected() callback will notify userspace/QEMU via an eventfd, and return PCI_ERS_RESULT_CAN_RECOVER. At this point the s390x error recovery on the host will skip any further action(see patch 7) and let userspace drive the error recovery. Once userspace/QEMU is notified, it then inject this error into the VM so device drivers in the VM can take recovery actions. For example for a passthrough NVMe device, the VM's OS NVMe driver will access the device. At this point the VM's NVMe driver's error_detected() will drive the recovery by returning PCI_ERS_RESULT_NEED_RESET, and the s390x error recovery in the VM's OS will try to do a reset. Resets are privileged operations and so the VM will need intervention from QEMU to perform the reset. QEMU will invoke the ioctls to now notify the host that the VM is requesting a reset of the device. The vfio-pci driver on the host will then perform the reset on the device. Thanks Farhan
On Thu, Oct 09, 2025 at 10:02:12AM -0700, Farhan Ali wrote: > On 10/8/2025 9:52 PM, Lukas Wunner wrote: > > On Wed, Oct 08, 2025 at 02:55:56PM -0700, Farhan Ali wrote: > > > > > On 10/8/2025 6:34 AM, Lukas Wunner wrote: > > > > > > I also don't quite understand why the VM needs to perform a reset. > > > > > > Why can't you just let the VM tell the host that a reset is needed > > > > > > (PCI_ERS_RESULT_NEED_RESET) and then the host resets the device on > > > > > > behalf of the VM? > > > The reset is not performed by the VM, reset is still done by the host. My > > > approach for a VM to let the host know that reset was needed, was to > > > intercept any reset instructions for the PCI device in QEMU. QEMU would > > > then drive a reset via VFIO_DEVICE_RESET. Maybe I am missing something, > > > but based on what we have today in vfio driver, we don't have a mechanism > > > for userspace to reset a device other than VFIO_DEVICE_RESET and > > > VFIO_PCI_DEVICE_HOT_RESET ioctls. > > The ask is for the host to notify the VM of the ->error_detected() event > > and the VM then responding with one of the "enum pci_ers_result" values. > > Maybe there is some confusion here. Could you clarify what do you mean by VM > responding with "enum pci_ers_result" values? Is it a device driver (for > example an NVMe driver) running in the VM that should do that? Or is it > something else you are suggesting? My expectation was that the host notifies the VM of the error, the kernel in the VM notifies the nvme driver of the error, the nvme driver returns a pci_ers_result return value from its pci_error_handlers which the VM passes back to the host, the host drives error recovery normally. I was missing the high-level architectural overview that Niklas subsequently provided. You should provide it as part of your series because otherwise it's difficult for reviewers to understand what the individual patches are trying to achieve as a whole. Thanks, Lukas
On Wed, Oct 01, 2025 at 10:12:03AM -0700, Farhan Ali wrote:
>
> On 10/1/2025 8:15 AM, Benjamin Block wrote:
> > On Wed, Sep 24, 2025 at 10:16:19AM -0700, Farhan Ali wrote:
> >> @@ -1792,6 +1798,14 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
> >> int pci_save_state(struct pci_dev *dev)
> >> {
> >> int i;
> >> + u32 val;
> >> +
> >> + pci_read_config_dword(dev, PCI_COMMAND, &val);
> >> + if (PCI_POSSIBLE_ERROR(val)) {
> >> + pci_warn(dev, "Device config space inaccessible, will only be partially restored\n");
> >> + return -EIO;
> >
> > Should it set `dev->state_saved` to `false`, to be on the save side?
> > Not sure whether we run a risk of restoring an old, outdated state otherwise.
>
> AFAIU if the state_saved flag was set to true then any state that we
> have saved should be valid and should be okay to be restored from. We
> just want to avoid saving any invalid data.
Hmm, so I dug a bit more, and I see
void pci_restore_state(struct pci_dev *dev) {}
has `dev->state_saved = false` at the end, so I guess if a device is put into
suspend, and then later woken again, the flag gets reset every time.
And then there is also code like this:
static pci_ers_result_t e1000_io_slot_reset(struct pci_dev *pdev)
{
...
err = pci_enable_device_mem(pdev);
if (err) {
...
} else {
pdev->state_saved = true;
pci_restore_state(pdev);
I don't know..
But I see Alex suggested this before.
> >> + }
> >> +
> >> /* XXX: 100% dword access ok here? */
> >> for (i = 0; i < 16; i++) {
> >> pci_read_config_dword(dev, i * 4, &dev->saved_config_space[i]);
> >> @@ -1854,6 +1868,14 @@ static void pci_restore_config_space_range(struct pci_dev *pdev,
> >>
> >> static void pci_restore_config_space(struct pci_dev *pdev)
> >> {
> >> + if (!pdev->state_saved) {
> >> + pci_warn(pdev, "No saved config space, restoring BARs\n");
> >> + pci_restore_bars(pdev);
> >> + pci_write_config_word(pdev, PCI_COMMAND,
> >> + PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
> >
> > Is this really something that ought to be universally enabled? I thought this
> > depends on whether attached resources are IO and/or MEM?
> >
> > int pci_enable_resources(struct pci_dev *dev, int mask)
> > {
> > ...
> > pci_dev_for_each_resource(dev, r, i) {
> > ...
> > if (r->flags & IORESOURCE_IO)
> > cmd |= PCI_COMMAND_IO;
> > if (r->flags & IORESOURCE_MEM)
> > cmd |= PCI_COMMAND_MEMORY;
> > }
> > ...
> > }
> >
> > Also IIRC, especially on s390, we never have IO resources?
> >
> > int zpci_setup_bus_resources(struct zpci_dev *zdev)
> > {
> > ...
> > for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> > ...
> > /* only MMIO is supported */
> > flags = IORESOURCE_MEM;
> > if (zdev->bars[i].val & 8)
> > flags |= IORESOURCE_PREFETCH;
> > if (zdev->bars[i].val & 4)
> > flags |= IORESOURCE_MEM_64;
> > ...
> > }
> > ...
> > }
> >
> > So I guess this would have to have some form of the same logic as in
> > `pci_enable_resources()`, after restoring the BARs.
> >
> > Or am I missing something?
>
> As per my understanding of the spec, setting both I/O Space and Memory
> Space should be safe. The spec also mentions if a function doesn't
> support IO/Memory space access it could hardwire the bit to zero. We
> could add the logic to iterate through all the resources and set the
> bits accordingly, but in this case trying a best effort restoration it
> should be fine?
>
> Also I didn't see any issues testing on s390x with the NVMe, RoCE and
> NETD devices, but I could have missed something.
Well, just taking a coarse look at how some other PCI device drivers use the
Command register (this being non-s390 specific after all); some of them base
decisions on whether either/or these flags are set in config space. Now that
this sets both flags, this might have surprising side-effects.
On the other hand, iterating over the resources might not even be enough
with some device-drivers, since they base their decision on whether to enable
either/or on other knowledge.
So I don't know. Enabling both just in case might be a good compromise.
--
Best Regards, Benjamin Block / Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH / https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt / Geschäftsführung: David Faller
Sitz der Ges.: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
© 2016 - 2026 Red Hat, Inc.