vfio/pci: Block and audit accesses to unassigned config regions

[RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Chathura Rajapaksha 9 months, 2 weeks ago

Some PCIe devices trigger PCI bus errors when accesses are made to
unassigned regions within their PCI configuration space. On certain
platforms, this can lead to host system hangs or reboots.

The current vfio-pci driver allows guests to access unassigned regions
in the PCI configuration space. Therefore, when such a device is passed
through to a guest, the guest can induce a host system hang or reboot
through crafted configuration space accesses, posing a threat to
system availability.

This patch series introduces:
1. Support for blocking guest accesses to unassigned
   PCI configuration space, and the ability to bypass this access control
   for specific devices. The patch introduces three module parameters:

   block_pci_unassigned_write:
   Blocks write accesses to unassigned config space regions.

   block_pci_unassigned_read:
   Blocks read accesses to unassigned config space regions.

   uaccess_allow_ids:
   Specifies the devices for which the above access control is bypassed.
   The value is a comma-separated list of device IDs in
   <vendor_id>:<device_id> format.

   Example usage:
   To block guest write accesses to unassigned config regions for all
   passed through devices except for the device with vendor ID 0x1234 and
   device ID 0x5678:

   block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678

2. Auditing support for config space accesses to unassigned regions.
   When enabled, this logs such accesses for all passthrough devices.
   This feature is controlled via a new Kconfig option:

     CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT

   A new audit event type, AUDIT_VFIO, has been introduced to support
   this, allowing administrators to monitor and investigate suspicious
   behavior by guests.

This proposal is intended to harden VFIO passthrough in environments
where guests are untrusted or system reliability is critical.

Any feedback and comments are greatly appreciated.

Chathura Rajapaksha (2):
  block accesses to unassigned PCI config regions
  audit accesses to unassigned PCI config regions

 drivers/vfio/pci/Kconfig           |  12 +++
 drivers/vfio/pci/vfio_pci_config.c | 164 ++++++++++++++++++++++++++++-
 include/uapi/linux/audit.h         |   1 +
 3 files changed, 176 insertions(+), 1 deletion(-)


base-commit: f1a3944c860b0615d0513110d8cf62bb94adbb41
-- 
2.34.1

Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Jason Gunthorpe 9 months, 2 weeks ago

On Sat, Apr 26, 2025 at 09:22:47PM +0000, Chathura Rajapaksha wrote:
> Some PCIe devices trigger PCI bus errors when accesses are made to
> unassigned regions within their PCI configuration space. On certain
> platforms, this can lead to host system hangs or reboots.

Do you have an example of this? What do you mean by bus error?

I would expect the device to return some constant like 0, or to return
an error TLP. The host bridge should convert the error TLP to
0XFFFFFFF like all other read error conversions.

Is it a device problem or host bridge problem you are facing?

> 1. Support for blocking guest accesses to unassigned
>    PCI configuration space, and the ability to bypass this access control
>    for specific devices. The patch introduces three module parameters:
> 
>    block_pci_unassigned_write:
>    Blocks write accesses to unassigned config space regions.
> 
>    block_pci_unassigned_read:
>    Blocks read accesses to unassigned config space regions.
> 
>    uaccess_allow_ids:
>    Specifies the devices for which the above access control is bypassed.
>    The value is a comma-separated list of device IDs in
>    <vendor_id>:<device_id> format.
> 
>    Example usage:
>    To block guest write accesses to unassigned config regions for all
>    passed through devices except for the device with vendor ID 0x1234 and
>    device ID 0x5678:
> 
>    block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678

No module parameters please.

At worst the kernel should maintain a quirks list to control this,
maybe with a sysfs to update it.

Jason

Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Alex Williamson 9 months, 2 weeks ago

On Mon, 28 Apr 2025 10:24:55 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Sat, Apr 26, 2025 at 09:22:47PM +0000, Chathura Rajapaksha wrote:
> > Some PCIe devices trigger PCI bus errors when accesses are made to
> > unassigned regions within their PCI configuration space. On certain
> > platforms, this can lead to host system hangs or reboots.  
> 
> Do you have an example of this? What do you mean by bus error?
> 
> I would expect the device to return some constant like 0, or to return
> an error TLP. The host bridge should convert the error TLP to
> 0XFFFFFFF like all other read error conversions.
> 
> Is it a device problem or host bridge problem you are facing?

Or system problem.  Is it the access itself that generates a problem or
is it what the device does as a result of the access?  If the latter,
does this only remove a config space fuzzing attack vector against that
behavior or do we expect the device cannot generate the same behavior
via MMIO or IO register accesses?

We've previously leaned in the direction that we depend on hardware to
contain errors.  We cannot trap every access to the device or else we'd
severely limit the devices available to use and the performance of
those devices to the point that device assignment isn't worthwhile.

PCI config space is a slow path, it's already trapped, and it's
theoretically architected that we could restrict and audit much of it,
though some devices do rely on access to unarchitected config space.
But even within the architected space there are device specific
capabilities with undocumented protocols, exposing unknown features of
devices.  Does this incrementally make things better in general, or is
this largely masking a poorly behaved device/system?

> > 1. Support for blocking guest accesses to unassigned
> >    PCI configuration space, and the ability to bypass this access control
> >    for specific devices. The patch introduces three module parameters:
> > 
> >    block_pci_unassigned_write:
> >    Blocks write accesses to unassigned config space regions.
> > 
> >    block_pci_unassigned_read:
> >    Blocks read accesses to unassigned config space regions.
> > 
> >    uaccess_allow_ids:
> >    Specifies the devices for which the above access control is bypassed.
> >    The value is a comma-separated list of device IDs in
> >    <vendor_id>:<device_id> format.
> > 
> >    Example usage:
> >    To block guest write accesses to unassigned config regions for all
> >    passed through devices except for the device with vendor ID 0x1234 and
> >    device ID 0x5678:
> > 
> >    block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678  
> 
> No module parameters please.
> 
> At worst the kernel should maintain a quirks list to control this,
> maybe with a sysfs to update it.

No module parameters might be difficult if we end up managing this as a
default policy selection, but certainly agree that if we get into
device specific behaviors we probably want those quirks automatically
deployed by the kernel.  Thanks,

Alex

Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Jason Gunthorpe 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 02:25:58PM -0600, Alex Williamson wrote:

> PCI config space is a slow path, it's already trapped, and it's
> theoretically architected that we could restrict and audit much of it,
> though some devices do rely on access to unarchitected config space.
> But even within the architected space there are device specific
> capabilities with undocumented protocols, exposing unknown features of
> devices.  Does this incrementally make things better in general, or is
> this largely masking a poorly behaved device/system?

I think there would be merit in having a qemu option to secure the
config space.

We talked about this before about presenting a perscribed virtualized
config space.

But we still have the issue that userpace with access to VFIO could
crash the machine, on these uncontained platforms, which is not great.

It would be nice if the kernel could discover this, but it doesn't
seem possible. There is so much in the SOC design and FW
implementation that has to be done correctly for errors to be properly
containable.

Jason

Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Chathura Rajapaksha 9 months ago

Hi Jason and Alex,

Thank you for the comments, and apologies for the delayed response.

On Mon, Apr 28, 2025 at 9:24 AM
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > Some PCIe devices trigger PCI bus errors when accesses are made to
> > unassigned regions within their PCI configuration space. On certain
> > platforms, this can lead to host system hangs or reboots.
>
> Do you have an example of this? What do you mean by bus error?

By PCI bus error, I was referring to AER-reported uncorrectable errors.
For example:
pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
pcieport 0000:c0:01.1:    [14] CmpltTO                (First)

In another case, with a different device on a separate system, we
observed an uncorrectable machine check exception:
mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 6: fb80000000000e0b

> I would expect the device to return some constant like 0, or to return
> an error TLP. The host bridge should convert the error TLP to
> 0XFFFFFFF like all other read error conversions.
>
> Is it a device problem or host bridge problem you are facing?

We haven’t been able to confirm definitively whether it’s a device or
host bridge issue due to limited visibility into platform internals.
However, we suspect it’s device-specific, as the same device triggered
similar failures across two different systems when writing to a
specific unassigned region in its config space. That said, we haven’t
exhaustively tested across devices and platforms.

If you have suggestions for identifying whether this stems from the
device or host bridge, we’d appreciate the guidance.

On Mon, Apr 28, 2025 at 4:26 PM
Alex Williamson <alex.williamson@redhat.com> wrote:
> Or system problem.  Is it the access itself that generates a problem or
> is it what the device does as a result of the access?  If the latter,
> does this only remove a config space fuzzing attack vector against that
> behavior or do we expect the device cannot generate the same behavior
> via MMIO or IO register accesses?

Unfortunately, we can't say for certain whether the fault lies in the
access itself or in the device's response. The cloud environments we
tested on don’t provide sufficient low-level hardware insight to
determine that. Please let me know if you have any pointers on how to
determine this at the kernel level.

This patch specifically aims to eliminate the config space fuzzing
vector. We have not investigated whether similar behavior can be
triggered through MMIO or IO register accesses.

> > No module parameters please.
> >
> > At worst the kernel should maintain a quirks list to control this,
> > maybe with a sysfs to update it.
>
> No module parameters might be difficult if we end up managing this as a
> default policy selection, but certainly agree that if we get into
> device specific behaviors we probably want those quirks automatically
> deployed by the kernel.  Thanks,

We used module parameters to give the flexibility to block unassigned
config space accesses on specific devices, especially in cases where new
problematic devices might emerge.

Is it feasible to support such use cases using a quirk-based mechanism?
For example, could we implement a quirk table that’s updateable via
sysfs, as you suggested?

Thank you for your time, and again, apologies for the delayed response.

Thanks,
Chathura

Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Jason Gunthorpe 9 months ago

On Fri, May 16, 2025 at 06:17:54PM +0000, Chathura Rajapaksha wrote:
> Hi Jason and Alex,
> 
> Thank you for the comments, and apologies for the delayed response.
> 
> On Mon, Apr 28, 2025 at 9:24 AM
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > Some PCIe devices trigger PCI bus errors when accesses are made to
> > > unassigned regions within their PCI configuration space. On certain
> > > platforms, this can lead to host system hangs or reboots.
> >
> > Do you have an example of this? What do you mean by bus error?
> 
> By PCI bus error, I was referring to AER-reported uncorrectable errors.
> For example:
> pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
> pcieport 0000:c0:01.1:    [14] CmpltTO                (First)

That's sure looks like a device bug. You should not ever get time out
for a config space read.

> In another case, with a different device on a separate system, we
> observed an uncorrectable machine check exception:
> mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 6: fb80000000000e0b

FW turning AER into a MCE is not suitable to use as a virtualization
host, IMHO. It is not possible to contain PCIe errors when they are
turned into MCE.

> Is it feasible to support such use cases using a quirk-based mechanism?
> For example, could we implement a quirk table that’s updateable via
> sysfs, as you suggested?

Dynamically updateable might be overkill, I think you have one
defective device. Have you talked to the supplier to see if it can be
corrected?

I think Alex is right to worry, if the device got this wrong, what
other mistakes have been made? Supporting virtualization is more than
just making a PCI device and using VFIO. You need to robustly design
HW to have full containment as well, including managing errors.

Alternatively you could handle this in qemu by sanitizing the config
space..

Jason

Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Chathura Rajapaksha 8 months, 4 weeks ago

On Fri, May 16, 2025 at 2:35 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > By PCI bus error, I was referring to AER-reported uncorrectable errors.
> > For example:
> > pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
> > pcieport 0000:c0:01.1:    [14] CmpltTO                (First)
>
> That's sure looks like a device bug. You should not ever get time out
> for a config space read.

Just to clarify, the above error was triggered by a write to the
configuration space. In fact, all the errors we have observed so far
were triggered by writes to unassigned PCI config space regions.

> Dynamically updateable might be overkill, I think you have one
> defective device. Have you talked to the supplier to see if it can be
> corrected?

So far, we have seen this issue on five PCIe devices across GPU and
storage classes from two different vendors. Therefore, we suspect the
problem is not limited to a single device, vendor, or class of devices.
We reported the issue to both vendors over two months ago. But we
have not gained any insights into the root cause of the issue from
either vendor so far.

> Alternatively you could handle this in qemu by sanitizing the config
> space..

While it's possible to address this issue for QEMU-KVM guests by
modifying QEMU, PCIe devices can also be assigned directly to
user-space applications such as DPDK via VFIO. We thought addressing
this at the VFIO driver level would help mitigate the issue in a
broader context beyond virtualized environments.

Thanks,
Chathura

Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions

Posted by Jason Gunthorpe 8 months, 2 weeks ago

On Sat, May 17, 2025 at 05:14:59PM +0000, Chathura Rajapaksha wrote:
> On Fri, May 16, 2025 at 2:35 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > By PCI bus error, I was referring to AER-reported uncorrectable errors.
> > > For example:
> > > pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
> > > pcieport 0000:c0:01.1:    [14] CmpltTO                (First)
> >
> > That's sure looks like a device bug. You should not ever get time out
> > for a config space read.
> 
> Just to clarify, the above error was triggered by a write to the
> configuration space. In fact, all the errors we have observed so far
> were triggered by writes to unassigned PCI config space regions.

Yuk, devices really shouldn't refuse to respond to writes or reads :(

> So far, we have seen this issue on five PCIe devices across GPU and
> storage classes from two different vendors. 

Ugh, that's awful.

> > Alternatively you could handle this in qemu by sanitizing the config
> > space..
> 
> While it's possible to address this issue for QEMU-KVM guests by
> modifying QEMU, PCIe devices can also be assigned directly to
> user-space applications such as DPDK via VFIO. We thought addressing
> this at the VFIO driver level would help mitigate the issue in a
> broader context beyond virtualized environments.

VFIO can probably already trigger command timeouts if it tries hard
enough, as long as it is a contained AER I don't see that the kernel
*needs* to prevent it.. 

For virtualization I really do expect that any serious user will be
tightly controlling the config space and maybe this finding just
supports that qemu needs to be enhanced to have more configurability
here.

It certainly is easier to add an option to qemu to make it block
any address not in a cap chain than to add a bunch of PCI ID tables
and detection to the kernel..

Jason