[PATCH v1 0/6] Error recovery for vfio-pci devices on s390x

Farhan Ali posted 6 patches 1 month, 3 weeks ago
There is a newer version of this series
arch/s390/include/asm/pci.h       |  29 +++++++
arch/s390/pci/pci.c               |   2 +
arch/s390/pci/pci_event.c         | 107 ++++++++++++++-----------
arch/s390/pci/pci_irq.c           |   3 +-
drivers/vfio/pci/vfio_pci_core.c  |  22 +++++-
drivers/vfio/pci/vfio_pci_intrs.c |   2 +-
drivers/vfio/pci/vfio_pci_priv.h  |   8 ++
drivers/vfio/pci/vfio_pci_zdev.c  | 126 +++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h         |   2 +
include/uapi/linux/vfio_zdev.h    |   5 ++
10 files changed, 253 insertions(+), 53 deletions(-)
[PATCH v1 0/6] Error recovery for vfio-pci devices on s390x
Posted by Farhan Ali 1 month, 3 weeks ago
Hi,

This Linux kernel patch series introduces support for error recovery for
passthrough PCI devices on System Z (s390x). 

Background
----------
For PCI devices on s390x an operating system receives platform specific 
error events from firmware rather than through AER.Today for
passthrough/userspace devices, we don't attempt any error recovery
and ignore any error events for the devices. The passthrough/userspace devices are 
managed by the vfio-pci driver. The driver does register error handling 
callbacks (error_detected), and on an error trigger an eventfd to userspace. 
But we need a mechanism to notify userspace (QEMU/guest/userspace drivers) about
the error event. 

Proposal
--------
We can expose this error information (currently only the PCI Error Code) via a 
device specific memory region for s390 vfio pci devices. Userspace can then read 
the memory region to obtain the error information and take appropriate actions
such as driving a device reset. The memory region provides some flexibility in 
providing more information in the future if required.

I would appreciate some feedback on this approach.

Thanks
Farhan

Farhan Ali (6):
  s390/pci: Restore airq unconditionally for the zPCI device
  s390/pci: Update the logic for detecting passthrough device
  s390/pci: Store PCI error information for passthrough devices
  vfio-pci/zdev: Setup a zpci memory region for error information
  vfio-pci/zdev: Perform platform specific function reset for zPCI
  vfio: Allow error notification and recovery for ISM device

 arch/s390/include/asm/pci.h       |  29 +++++++
 arch/s390/pci/pci.c               |   2 +
 arch/s390/pci/pci_event.c         | 107 ++++++++++++++-----------
 arch/s390/pci/pci_irq.c           |   3 +-
 drivers/vfio/pci/vfio_pci_core.c  |  22 +++++-
 drivers/vfio/pci/vfio_pci_intrs.c |   2 +-
 drivers/vfio/pci/vfio_pci_priv.h  |   8 ++
 drivers/vfio/pci/vfio_pci_zdev.c  | 126 +++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h         |   2 +
 include/uapi/linux/vfio_zdev.h    |   5 ++
 10 files changed, 253 insertions(+), 53 deletions(-)

-- 
2.43.0
Re: [PATCH v1 0/6] Error recovery for vfio-pci devices on s390x
Posted by Farhan Ali 1 month, 3 weeks ago
Also posted a QEMU series utilizing these kernel patches
https://lore.kernel.org/qemu-devel/20250813174152.1238-1-alifm@linux.ibm.com/

Thanks
Farhan

On 8/13/2025 10:08 AM, Farhan Ali wrote:
> Hi,
>
> This Linux kernel patch series introduces support for error recovery for
> passthrough PCI devices on System Z (s390x).
>
> Background
> ----------
> For PCI devices on s390x an operating system receives platform specific
> error events from firmware rather than through AER.Today for
> passthrough/userspace devices, we don't attempt any error recovery
> and ignore any error events for the devices. The passthrough/userspace devices are
> managed by the vfio-pci driver. The driver does register error handling
> callbacks (error_detected), and on an error trigger an eventfd to userspace.
> But we need a mechanism to notify userspace (QEMU/guest/userspace drivers) about
> the error event.
>
> Proposal
> --------
> We can expose this error information (currently only the PCI Error Code) via a
> device specific memory region for s390 vfio pci devices. Userspace can then read
> the memory region to obtain the error information and take appropriate actions
> such as driving a device reset. The memory region provides some flexibility in
> providing more information in the future if required.
>
> I would appreciate some feedback on this approach.
>
> Thanks
> Farhan
>
> Farhan Ali (6):
>    s390/pci: Restore airq unconditionally for the zPCI device
>    s390/pci: Update the logic for detecting passthrough device
>    s390/pci: Store PCI error information for passthrough devices
>    vfio-pci/zdev: Setup a zpci memory region for error information
>    vfio-pci/zdev: Perform platform specific function reset for zPCI
>    vfio: Allow error notification and recovery for ISM device
>
>   arch/s390/include/asm/pci.h       |  29 +++++++
>   arch/s390/pci/pci.c               |   2 +
>   arch/s390/pci/pci_event.c         | 107 ++++++++++++++-----------
>   arch/s390/pci/pci_irq.c           |   3 +-
>   drivers/vfio/pci/vfio_pci_core.c  |  22 +++++-
>   drivers/vfio/pci/vfio_pci_intrs.c |   2 +-
>   drivers/vfio/pci/vfio_pci_priv.h  |   8 ++
>   drivers/vfio/pci/vfio_pci_zdev.c  | 126 +++++++++++++++++++++++++++++-
>   include/uapi/linux/vfio.h         |   2 +
>   include/uapi/linux/vfio_zdev.h    |   5 ++
>   10 files changed, 253 insertions(+), 53 deletions(-)
>