RE: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state

Duan, Zhenzhong posted 2 patches 2 days, 17 hours ago
Only 0 patches received!
There is a newer version of this series
RE: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
Posted by Duan, Zhenzhong 2 days, 17 hours ago

>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
>d3hot state
>
>On Wed, 19 Feb 2025 18:58:58 +0100
>Eric Auger <eric.auger@redhat.com> wrote:
>
>> Since kernel commit:
>> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
>> in D3hot power state")
>> any attempt to do an mmap access to a BAR when the device is in d3hot
>> state will generate a fault.
>>
>> On system_powerdown, if the VFIO device is translated by an IOMMU,
>> the device is moved to D3hot state and then the vIOMMU gets disabled
>> by the guest. As a result of this later operation, the address space is
>> swapped from translated to untranslated. When re-enabling the aliased
>> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
>> faults when attempting the operation on BARs.
>>
>> To avoid doing the remap on those BARs, we compute whether the
>> device is in D3hot state and if so, skip the DMA MAP.
>
>Thinking on this some more, QEMU PCI code already manages the device
>BARs appearing in the address space based on the memory enable bit in
>the command register.  Should we do the same for PM state?
>
>IOW, the device going into low power state should remove the BARs from
>the AddressSpace and waking the device should re-add them.  The BAR DMA
>mapping should then always be consistent, whereas here nothing would
>remap the BARs when the device is woken.

If BARs should be disabled before D3hot transition, isn't it guest's responsibility to do that itself?
Just like what have been done for FLR which calls pci_dev_save_and_disable().

Thanks
Zhenzhong

>
>I imagine we'd need an interface to register the PM capability with the
>core QEMU PCI code, where address space updates are performed relative
>to both memory enable and power status.  There might be a way to
>implement this just for vfio-pci devices by toggling the enable state
>of the BAR mmaps relative to PM state, but doing it at the PCI core
>level seems like it'd provide behavior more true to physical hardware.
>Thanks,
>
>Alex
Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
Posted by Alex Williamson 2 days, 16 hours ago
On Thu, 20 Feb 2025 04:24:13 +0000
"Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:

> >-----Original Message-----
> >From: Alex Williamson <alex.williamson@redhat.com>
> >Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
> >d3hot state
> >
> >On Wed, 19 Feb 2025 18:58:58 +0100
> >Eric Auger <eric.auger@redhat.com> wrote:
> >  
> >> Since kernel commit:
> >> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
> >> in D3hot power state")
> >> any attempt to do an mmap access to a BAR when the device is in d3hot
> >> state will generate a fault.
> >>
> >> On system_powerdown, if the VFIO device is translated by an IOMMU,
> >> the device is moved to D3hot state and then the vIOMMU gets disabled
> >> by the guest. As a result of this later operation, the address space is
> >> swapped from translated to untranslated. When re-enabling the aliased
> >> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
> >> faults when attempting the operation on BARs.
> >>
> >> To avoid doing the remap on those BARs, we compute whether the
> >> device is in D3hot state and if so, skip the DMA MAP.  
> >
> >Thinking on this some more, QEMU PCI code already manages the device
> >BARs appearing in the address space based on the memory enable bit in
> >the command register.  Should we do the same for PM state?
> >
> >IOW, the device going into low power state should remove the BARs from
> >the AddressSpace and waking the device should re-add them.  The BAR DMA
> >mapping should then always be consistent, whereas here nothing would
> >remap the BARs when the device is woken.  
> 
> If BARs should be disabled before D3hot transition, isn't it guest's responsibility to do that itself?
> Just like what have been done for FLR which calls pci_dev_save_and_disable().

Nothing requires the guest to clear memory and IO from the command
register before entering a low power state, nor are we going to get
very far arguing that it's the guest's fault for triggering an error in
the hypervisor.  The PCI spec indicates that memory and IO BARs are only
accessible when the device is in the D0 power state.  On bare metal
accessing the BAR for a device in a low power state would generate an
unsupported request.  Therefore why should QEMU map BARs of devices in
low power states into the address space?  Thanks,

Alex
RE: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in d3hot state
Posted by Duan, Zhenzhong 2 days, 13 hours ago

>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
>d3hot state
>
>On Thu, 20 Feb 2025 04:24:13 +0000
>"Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:
>
>> >-----Original Message-----
>> >From: Alex Williamson <alex.williamson@redhat.com>
>> >Subject: Re: [RFC 0/2] hw/vfio/pci: Prevent BARs from being dma mapped in
>> >d3hot state
>> >
>> >On Wed, 19 Feb 2025 18:58:58 +0100
>> >Eric Auger <eric.auger@redhat.com> wrote:
>> >
>> >> Since kernel commit:
>> >> 2b2c651baf1c ("vfio/pci: Invalidate mmaps and block the access
>> >> in D3hot power state")
>> >> any attempt to do an mmap access to a BAR when the device is in d3hot
>> >> state will generate a fault.
>> >>
>> >> On system_powerdown, if the VFIO device is translated by an IOMMU,
>> >> the device is moved to D3hot state and then the vIOMMU gets disabled
>> >> by the guest. As a result of this later operation, the address space is
>> >> swapped from translated to untranslated. When re-enabling the aliased
>> >> regions, the RAM regions are dma-mapped again and this causes DMA_MAP
>> >> faults when attempting the operation on BARs.
>> >>
>> >> To avoid doing the remap on those BARs, we compute whether the
>> >> device is in D3hot state and if so, skip the DMA MAP.
>> >
>> >Thinking on this some more, QEMU PCI code already manages the device
>> >BARs appearing in the address space based on the memory enable bit in
>> >the command register.  Should we do the same for PM state?
>> >
>> >IOW, the device going into low power state should remove the BARs from
>> >the AddressSpace and waking the device should re-add them.  The BAR DMA
>> >mapping should then always be consistent, whereas here nothing would
>> >remap the BARs when the device is woken.
>>
>> If BARs should be disabled before D3hot transition, isn't it guest's responsibility
>to do that itself?
>> Just like what have been done for FLR which calls pci_dev_save_and_disable().
>
>Nothing requires the guest to clear memory and IO from the command
>register before entering a low power state, nor are we going to get
>very far arguing that it's the guest's fault for triggering an error in
>the hypervisor.  The PCI spec indicates that memory and IO BARs are only
>accessible when the device is in the D0 power state.  On bare metal
>accessing the BAR for a device in a low power state would generate an
>unsupported request.

Understood, yes it makes sense to remove BARs from AddressSpace when D3hot.

> Therefore why should QEMU map BARs of devices in
>low power states into the address space?
Should not.

Thanks
Zhenzhong