drivers/vfio/pci/Kconfig | 12 +++ drivers/vfio/pci/vfio_pci_config.c | 164 ++++++++++++++++++++++++++++- include/uapi/linux/audit.h | 1 + 3 files changed, 176 insertions(+), 1 deletion(-)
Some PCIe devices trigger PCI bus errors when accesses are made to
unassigned regions within their PCI configuration space. On certain
platforms, this can lead to host system hangs or reboots.
The current vfio-pci driver allows guests to access unassigned regions
in the PCI configuration space. Therefore, when such a device is passed
through to a guest, the guest can induce a host system hang or reboot
through crafted configuration space accesses, posing a threat to
system availability.
This patch series introduces:
1. Support for blocking guest accesses to unassigned
PCI configuration space, and the ability to bypass this access control
for specific devices. The patch introduces three module parameters:
block_pci_unassigned_write:
Blocks write accesses to unassigned config space regions.
block_pci_unassigned_read:
Blocks read accesses to unassigned config space regions.
uaccess_allow_ids:
Specifies the devices for which the above access control is bypassed.
The value is a comma-separated list of device IDs in
<vendor_id>:<device_id> format.
Example usage:
To block guest write accesses to unassigned config regions for all
passed through devices except for the device with vendor ID 0x1234 and
device ID 0x5678:
block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678
2. Auditing support for config space accesses to unassigned regions.
When enabled, this logs such accesses for all passthrough devices.
This feature is controlled via a new Kconfig option:
CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT
A new audit event type, AUDIT_VFIO, has been introduced to support
this, allowing administrators to monitor and investigate suspicious
behavior by guests.
This proposal is intended to harden VFIO passthrough in environments
where guests are untrusted or system reliability is critical.
Any feedback and comments are greatly appreciated.
Chathura Rajapaksha (2):
block accesses to unassigned PCI config regions
audit accesses to unassigned PCI config regions
drivers/vfio/pci/Kconfig | 12 +++
drivers/vfio/pci/vfio_pci_config.c | 164 ++++++++++++++++++++++++++++-
include/uapi/linux/audit.h | 1 +
3 files changed, 176 insertions(+), 1 deletion(-)
base-commit: f1a3944c860b0615d0513110d8cf62bb94adbb41
--
2.34.1
On Sat, Apr 26, 2025 at 09:22:47PM +0000, Chathura Rajapaksha wrote: > Some PCIe devices trigger PCI bus errors when accesses are made to > unassigned regions within their PCI configuration space. On certain > platforms, this can lead to host system hangs or reboots. Do you have an example of this? What do you mean by bus error? I would expect the device to return some constant like 0, or to return an error TLP. The host bridge should convert the error TLP to 0XFFFFFFF like all other read error conversions. Is it a device problem or host bridge problem you are facing? > 1. Support for blocking guest accesses to unassigned > PCI configuration space, and the ability to bypass this access control > for specific devices. The patch introduces three module parameters: > > block_pci_unassigned_write: > Blocks write accesses to unassigned config space regions. > > block_pci_unassigned_read: > Blocks read accesses to unassigned config space regions. > > uaccess_allow_ids: > Specifies the devices for which the above access control is bypassed. > The value is a comma-separated list of device IDs in > <vendor_id>:<device_id> format. > > Example usage: > To block guest write accesses to unassigned config regions for all > passed through devices except for the device with vendor ID 0x1234 and > device ID 0x5678: > > block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678 No module parameters please. At worst the kernel should maintain a quirks list to control this, maybe with a sysfs to update it. Jason
On Mon, 28 Apr 2025 10:24:55 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Sat, Apr 26, 2025 at 09:22:47PM +0000, Chathura Rajapaksha wrote: > > Some PCIe devices trigger PCI bus errors when accesses are made to > > unassigned regions within their PCI configuration space. On certain > > platforms, this can lead to host system hangs or reboots. > > Do you have an example of this? What do you mean by bus error? > > I would expect the device to return some constant like 0, or to return > an error TLP. The host bridge should convert the error TLP to > 0XFFFFFFF like all other read error conversions. > > Is it a device problem or host bridge problem you are facing? Or system problem. Is it the access itself that generates a problem or is it what the device does as a result of the access? If the latter, does this only remove a config space fuzzing attack vector against that behavior or do we expect the device cannot generate the same behavior via MMIO or IO register accesses? We've previously leaned in the direction that we depend on hardware to contain errors. We cannot trap every access to the device or else we'd severely limit the devices available to use and the performance of those devices to the point that device assignment isn't worthwhile. PCI config space is a slow path, it's already trapped, and it's theoretically architected that we could restrict and audit much of it, though some devices do rely on access to unarchitected config space. But even within the architected space there are device specific capabilities with undocumented protocols, exposing unknown features of devices. Does this incrementally make things better in general, or is this largely masking a poorly behaved device/system? > > 1. Support for blocking guest accesses to unassigned > > PCI configuration space, and the ability to bypass this access control > > for specific devices. The patch introduces three module parameters: > > > > block_pci_unassigned_write: > > Blocks write accesses to unassigned config space regions. > > > > block_pci_unassigned_read: > > Blocks read accesses to unassigned config space regions. > > > > uaccess_allow_ids: > > Specifies the devices for which the above access control is bypassed. > > The value is a comma-separated list of device IDs in > > <vendor_id>:<device_id> format. > > > > Example usage: > > To block guest write accesses to unassigned config regions for all > > passed through devices except for the device with vendor ID 0x1234 and > > device ID 0x5678: > > > > block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678 > > No module parameters please. > > At worst the kernel should maintain a quirks list to control this, > maybe with a sysfs to update it. No module parameters might be difficult if we end up managing this as a default policy selection, but certainly agree that if we get into device specific behaviors we probably want those quirks automatically deployed by the kernel. Thanks, Alex
On Mon, Apr 28, 2025 at 02:25:58PM -0600, Alex Williamson wrote: > PCI config space is a slow path, it's already trapped, and it's > theoretically architected that we could restrict and audit much of it, > though some devices do rely on access to unarchitected config space. > But even within the architected space there are device specific > capabilities with undocumented protocols, exposing unknown features of > devices. Does this incrementally make things better in general, or is > this largely masking a poorly behaved device/system? I think there would be merit in having a qemu option to secure the config space. We talked about this before about presenting a perscribed virtualized config space. But we still have the issue that userpace with access to VFIO could crash the machine, on these uncontained platforms, which is not great. It would be nice if the kernel could discover this, but it doesn't seem possible. There is so much in the SOC design and FW implementation that has to be done correctly for errors to be properly containable. Jason
Hi Jason and Alex, Thank you for the comments, and apologies for the delayed response. On Mon, Apr 28, 2025 at 9:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > Some PCIe devices trigger PCI bus errors when accesses are made to > > unassigned regions within their PCI configuration space. On certain > > platforms, this can lead to host system hangs or reboots. > > Do you have an example of this? What do you mean by bus error? By PCI bus error, I was referring to AER-reported uncorrectable errors. For example: pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) pcieport 0000:c0:01.1: device [1022:1483] error status/mask=00004000/07a10000 pcieport 0000:c0:01.1: [14] CmpltTO (First) In another case, with a different device on a separate system, we observed an uncorrectable machine check exception: mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 6: fb80000000000e0b > I would expect the device to return some constant like 0, or to return > an error TLP. The host bridge should convert the error TLP to > 0XFFFFFFF like all other read error conversions. > > Is it a device problem or host bridge problem you are facing? We haven’t been able to confirm definitively whether it’s a device or host bridge issue due to limited visibility into platform internals. However, we suspect it’s device-specific, as the same device triggered similar failures across two different systems when writing to a specific unassigned region in its config space. That said, we haven’t exhaustively tested across devices and platforms. If you have suggestions for identifying whether this stems from the device or host bridge, we’d appreciate the guidance. On Mon, Apr 28, 2025 at 4:26 PM Alex Williamson <alex.williamson@redhat.com> wrote: > Or system problem. Is it the access itself that generates a problem or > is it what the device does as a result of the access? If the latter, > does this only remove a config space fuzzing attack vector against that > behavior or do we expect the device cannot generate the same behavior > via MMIO or IO register accesses? Unfortunately, we can't say for certain whether the fault lies in the access itself or in the device's response. The cloud environments we tested on don’t provide sufficient low-level hardware insight to determine that. Please let me know if you have any pointers on how to determine this at the kernel level. This patch specifically aims to eliminate the config space fuzzing vector. We have not investigated whether similar behavior can be triggered through MMIO or IO register accesses. > > No module parameters please. > > > > At worst the kernel should maintain a quirks list to control this, > > maybe with a sysfs to update it. > > No module parameters might be difficult if we end up managing this as a > default policy selection, but certainly agree that if we get into > device specific behaviors we probably want those quirks automatically > deployed by the kernel. Thanks, We used module parameters to give the flexibility to block unassigned config space accesses on specific devices, especially in cases where new problematic devices might emerge. Is it feasible to support such use cases using a quirk-based mechanism? For example, could we implement a quirk table that’s updateable via sysfs, as you suggested? Thank you for your time, and again, apologies for the delayed response. Thanks, Chathura
On Fri, May 16, 2025 at 06:17:54PM +0000, Chathura Rajapaksha wrote: > Hi Jason and Alex, > > Thank you for the comments, and apologies for the delayed response. > > On Mon, Apr 28, 2025 at 9:24 AM > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > Some PCIe devices trigger PCI bus errors when accesses are made to > > > unassigned regions within their PCI configuration space. On certain > > > platforms, this can lead to host system hangs or reboots. > > > > Do you have an example of this? What do you mean by bus error? > > By PCI bus error, I was referring to AER-reported uncorrectable errors. > For example: > pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) > pcieport 0000:c0:01.1: device [1022:1483] error status/mask=00004000/07a10000 > pcieport 0000:c0:01.1: [14] CmpltTO (First) That's sure looks like a device bug. You should not ever get time out for a config space read. > In another case, with a different device on a separate system, we > observed an uncorrectable machine check exception: > mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 6: fb80000000000e0b FW turning AER into a MCE is not suitable to use as a virtualization host, IMHO. It is not possible to contain PCIe errors when they are turned into MCE. > Is it feasible to support such use cases using a quirk-based mechanism? > For example, could we implement a quirk table that’s updateable via > sysfs, as you suggested? Dynamically updateable might be overkill, I think you have one defective device. Have you talked to the supplier to see if it can be corrected? I think Alex is right to worry, if the device got this wrong, what other mistakes have been made? Supporting virtualization is more than just making a PCI device and using VFIO. You need to robustly design HW to have full containment as well, including managing errors. Alternatively you could handle this in qemu by sanitizing the config space.. Jason
On Fri, May 16, 2025 at 2:35 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > By PCI bus error, I was referring to AER-reported uncorrectable errors. > > For example: > > pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) > > pcieport 0000:c0:01.1: device [1022:1483] error status/mask=00004000/07a10000 > > pcieport 0000:c0:01.1: [14] CmpltTO (First) > > That's sure looks like a device bug. You should not ever get time out > for a config space read. Just to clarify, the above error was triggered by a write to the configuration space. In fact, all the errors we have observed so far were triggered by writes to unassigned PCI config space regions. > Dynamically updateable might be overkill, I think you have one > defective device. Have you talked to the supplier to see if it can be > corrected? So far, we have seen this issue on five PCIe devices across GPU and storage classes from two different vendors. Therefore, we suspect the problem is not limited to a single device, vendor, or class of devices. We reported the issue to both vendors over two months ago. But we have not gained any insights into the root cause of the issue from either vendor so far. > Alternatively you could handle this in qemu by sanitizing the config > space.. While it's possible to address this issue for QEMU-KVM guests by modifying QEMU, PCIe devices can also be assigned directly to user-space applications such as DPDK via VFIO. We thought addressing this at the VFIO driver level would help mitigate the issue in a broader context beyond virtualized environments. Thanks, Chathura
On Sat, May 17, 2025 at 05:14:59PM +0000, Chathura Rajapaksha wrote: > On Fri, May 16, 2025 at 2:35 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > By PCI bus error, I was referring to AER-reported uncorrectable errors. > > > For example: > > > pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID) > > > pcieport 0000:c0:01.1: device [1022:1483] error status/mask=00004000/07a10000 > > > pcieport 0000:c0:01.1: [14] CmpltTO (First) > > > > That's sure looks like a device bug. You should not ever get time out > > for a config space read. > > Just to clarify, the above error was triggered by a write to the > configuration space. In fact, all the errors we have observed so far > were triggered by writes to unassigned PCI config space regions. Yuk, devices really shouldn't refuse to respond to writes or reads :( > So far, we have seen this issue on five PCIe devices across GPU and > storage classes from two different vendors. Ugh, that's awful. > > Alternatively you could handle this in qemu by sanitizing the config > > space.. > > While it's possible to address this issue for QEMU-KVM guests by > modifying QEMU, PCIe devices can also be assigned directly to > user-space applications such as DPDK via VFIO. We thought addressing > this at the VFIO driver level would help mitigate the issue in a > broader context beyond virtualized environments. VFIO can probably already trigger command timeouts if it tries hard enough, as long as it is a contained AER I don't see that the kernel *needs* to prevent it.. For virtualization I really do expect that any serious user will be tightly controlling the config space and maybe this finding just supports that qemu needs to be enhanced to have more configurability here. It certainly is easier to add an option to qemu to make it block any address not in a cap chain than to add a bunch of PCI ID tables and detection to the kernel.. Jason
© 2016 - 2026 Red Hat, Inc.