[v1] PCI: Implement basic PCI PM capability backing

[PATCH 0/5] PCI: Implement basic PCI PM capability backing

Posted by Alex Williamson 11 months, 3 weeks ago

Eric recently identified an issue[1] where during graceful shutdown
of a VM in a vIOMMU configuration, the guest driver places the device
into the D3 power state, the vIOMMU is then disabled, triggering an
AddressSpace update.  The device BARs are still mapped into the AS,
but the vfio host driver refuses to DMA map the MMIO space due to the
device power state.

The proposed solution in [1] was to skip mappings based on the
device power state.  Here we take a different approach.  The PCI spec
defines that devices in D1/2/3 power state should respond only to
configuration and message requests and all other requests should be
handled as an Unsupported Request.  In other words, the memory and
IO BARs are not accessible except when the device is in the D0 power
state.

To emulate this behavior, we can factor the device power state into
the mapping state of the device BARs.  Therefore the BAR is marked
as unmapped if either the respective command register enable bit is
clear or the device is not in the D0 power state.

In order to implement this, the PowerState field of the PMCSR
register becomes writable, which allows the device to appear in
lower power states.  This also therefore implements D3 support
(insofar as the BAR behavior) for all devices implementing the PM
capability.  The PCI spec requires D3 support.

An aspect that needs attention here is whether this change in the
wmask and PMCSR bits becomes a problem for migration, and how we
might solve it.  For a guest migrating old->new, the device would
always be in the D0 power state, but the register becomes writable.
In the opposite direction, is it possible that a device could
migrate in a low power state and be stuck there since the bits are
read-only in old QEMU?  Do we need an option for this behavior and a
machine state bump, or are there alternatives?

Thanks,
Alex

[1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/

Alex Williamson (5):
  hw/pci: Basic support for PCI power management
  pci: Use PCI PM capability initializer
  vfio/pci: Delete local pm_cap
  pcie, virtio: Remove redundant pm_cap
  hw/vfio/pci: Re-order pre-reset

 hw/net/e1000e.c                 |  3 +-
 hw/net/eepro100.c               |  4 +-
 hw/net/igb.c                    |  3 +-
 hw/nvme/ctrl.c                  |  3 +-
 hw/pci-bridge/pcie_pci_bridge.c |  3 +-
 hw/pci/pci.c                    | 83 ++++++++++++++++++++++++++++++++-
 hw/pci/trace-events             |  2 +
 hw/vfio/pci.c                   | 29 ++++++------
 hw/vfio/pci.h                   |  1 -
 hw/virtio/virtio-pci.c          | 11 ++---
 include/hw/pci/pci.h            |  3 ++
 include/hw/pci/pci_device.h     |  3 ++
 include/hw/pci/pcie.h           |  2 -
 13 files changed, 112 insertions(+), 38 deletions(-)

-- 
2.48.1

Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing

Posted by Cédric Le Goater 11 months, 2 weeks ago

> An aspect that needs attention here is whether this change in the
> wmask and PMCSR bits becomes a problem for migration, and how we
> might solve it.  For a guest migrating old->new, the device would
> always be in the D0 power state, but the register becomes writable.
> In the opposite direction, is it possible that a device could
> migrate in a low power state and be stuck there since the bits are
> read-only in old QEMU?  Do we need an option for this behavior and a
> machine state bump, or are there alternatives?

Should we introduce a migration blocker when a PCI device is in low
power state  ?


Thanks,

C.

Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing

Posted by Alex Williamson 11 months, 2 weeks ago

On Mon, 24 Feb 2025 09:14:19 +0100
Cédric Le Goater <clg@redhat.com> wrote:

> > An aspect that needs attention here is whether this change in the
> > wmask and PMCSR bits becomes a problem for migration, and how we
> > might solve it.  For a guest migrating old->new, the device would
> > always be in the D0 power state, but the register becomes writable.
> > In the opposite direction, is it possible that a device could
> > migrate in a low power state and be stuck there since the bits are
> > read-only in old QEMU?  Do we need an option for this behavior and a
> > machine state bump, or are there alternatives?  
> 
> Should we introduce a migration blocker when a PCI device is in low
> power state  ?

Blocking relative to the power state of a device seems relatively
non-intuitive for a user to debug.  Logically there's also an
opportunity that any device could support migration while in D3 if it
indicates a soft reset is performed on D3->D0 transition, regardless of
underlying VMM support for the device to migrate.  So that doesn't
really feel like the right approach to me.

FWIW, the emulated igb device will enter D3 when idle and bound to
vfio-pci in the guest, so we should be able to test migration in
various states with purely emulated devices.  Thanks,

Alex

RE: [PATCH 0/5] PCI: Implement basic PCI PM capability backing

Posted by Duan, Zhenzhong 11 months, 2 weeks ago


>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: [PATCH 0/5] PCI: Implement basic PCI PM capability backing
>
>Eric recently identified an issue[1] where during graceful shutdown
>of a VM in a vIOMMU configuration, the guest driver places the device
>into the D3 power state, the vIOMMU is then disabled, triggering an
>AddressSpace update.  The device BARs are still mapped into the AS,
>but the vfio host driver refuses to DMA map the MMIO space due to the
>device power state.
>
>The proposed solution in [1] was to skip mappings based on the
>device power state.  Here we take a different approach.  The PCI spec
>defines that devices in D1/2/3 power state should respond only to
>configuration and message requests and all other requests should be
>handled as an Unsupported Request.  In other words, the memory and
>IO BARs are not accessible except when the device is in the D0 power
>state.
>
>To emulate this behavior, we can factor the device power state into
>the mapping state of the device BARs.  Therefore the BAR is marked
>as unmapped if either the respective command register enable bit is
>clear or the device is not in the D0 power state.
>
>In order to implement this, the PowerState field of the PMCSR
>register becomes writable, which allows the device to appear in
>lower power states.  This also therefore implements D3 support
>(insofar as the BAR behavior) for all devices implementing the PM
>capability.  The PCI spec requires D3 support.
>
>An aspect that needs attention here is whether this change in the
>wmask and PMCSR bits becomes a problem for migration, and how we
>might solve it.  For a guest migrating old->new, the device would
>always be in the D0 power state, but the register becomes writable.
>In the opposite direction, is it possible that a device could
>migrate in a low power state and be stuck there since the bits are
>read-only in old QEMU?  Do we need an option for this behavior and a
>machine state bump, or are there alternatives?
>
>Thanks,
>Alex
>
>[1]https://lore.kernel.org/all/20250219175941.135390-1-
>eric.auger@redhat.com/
>
>Alex Williamson (5):
>  hw/pci: Basic support for PCI power management
>  pci: Use PCI PM capability initializer
>  vfio/pci: Delete local pm_cap
>  pcie, virtio: Remove redundant pm_cap
>  hw/vfio/pci: Re-order pre-reset

For the whole series,

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Thanks
Zhenzhong

Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing

Posted by Michael S. Tsirkin 11 months, 3 weeks ago

On Thu, Feb 20, 2025 at 03:48:53PM -0700, Alex Williamson wrote:
> Eric recently identified an issue[1] where during graceful shutdown
> of a VM in a vIOMMU configuration, the guest driver places the device
> into the D3 power state, the vIOMMU is then disabled, triggering an
> AddressSpace update.  The device BARs are still mapped into the AS,
> but the vfio host driver refuses to DMA map the MMIO space due to the
> device power state.
> 
> The proposed solution in [1] was to skip mappings based on the
> device power state.  Here we take a different approach.  The PCI spec
> defines that devices in D1/2/3 power state should respond only to
> configuration and message requests and all other requests should be
> handled as an Unsupported Request.  In other words, the memory and
> IO BARs are not accessible except when the device is in the D0 power
> state.
> 
> To emulate this behavior, we can factor the device power state into
> the mapping state of the device BARs.  Therefore the BAR is marked
> as unmapped if either the respective command register enable bit is
> clear or the device is not in the D0 power state.
> 
> In order to implement this, the PowerState field of the PMCSR
> register becomes writable, which allows the device to appear in
> lower power states.  This also therefore implements D3 support
> (insofar as the BAR behavior) for all devices implementing the PM
> capability.  The PCI spec requires D3 support.
> 
> An aspect that needs attention here is whether this change in the
> wmask and PMCSR bits becomes a problem for migration, and how we
> might solve it.  For a guest migrating old->new, the device would
> always be in the D0 power state, but the register becomes writable.
> In the opposite direction, is it possible that a device could
> migrate in a low power state and be stuck there since the bits are
> read-only in old QEMU?  Do we need an option for this behavior and a
> machine state bump, or are there alternatives?
> 
> Thanks,
> Alex
> 
> [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/


PCI bits:

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

feel free to merge.

> Alex Williamson (5):
>   hw/pci: Basic support for PCI power management
>   pci: Use PCI PM capability initializer
>   vfio/pci: Delete local pm_cap
>   pcie, virtio: Remove redundant pm_cap
>   hw/vfio/pci: Re-order pre-reset
> 
>  hw/net/e1000e.c                 |  3 +-
>  hw/net/eepro100.c               |  4 +-
>  hw/net/igb.c                    |  3 +-
>  hw/nvme/ctrl.c                  |  3 +-
>  hw/pci-bridge/pcie_pci_bridge.c |  3 +-
>  hw/pci/pci.c                    | 83 ++++++++++++++++++++++++++++++++-
>  hw/pci/trace-events             |  2 +
>  hw/vfio/pci.c                   | 29 ++++++------
>  hw/vfio/pci.h                   |  1 -
>  hw/virtio/virtio-pci.c          | 11 ++---
>  include/hw/pci/pci.h            |  3 ++
>  include/hw/pci/pci_device.h     |  3 ++
>  include/hw/pci/pcie.h           |  2 -
>  13 files changed, 112 insertions(+), 38 deletions(-)
> 
> -- 
> 2.48.1

Re: [PATCH 0/5] PCI: Implement basic PCI PM capability backing

Posted by Cédric Le Goater 11 months, 2 weeks ago

On 2/20/25 23:54, Michael S. Tsirkin wrote:
> On Thu, Feb 20, 2025 at 03:48:53PM -0700, Alex Williamson wrote:
>> Eric recently identified an issue[1] where during graceful shutdown
>> of a VM in a vIOMMU configuration, the guest driver places the device
>> into the D3 power state, the vIOMMU is then disabled, triggering an
>> AddressSpace update.  The device BARs are still mapped into the AS,
>> but the vfio host driver refuses to DMA map the MMIO space due to the
>> device power state.
>>
>> The proposed solution in [1] was to skip mappings based on the
>> device power state.  Here we take a different approach.  The PCI spec
>> defines that devices in D1/2/3 power state should respond only to
>> configuration and message requests and all other requests should be
>> handled as an Unsupported Request.  In other words, the memory and
>> IO BARs are not accessible except when the device is in the D0 power
>> state.
>>
>> To emulate this behavior, we can factor the device power state into
>> the mapping state of the device BARs.  Therefore the BAR is marked
>> as unmapped if either the respective command register enable bit is
>> clear or the device is not in the D0 power state.
>>
>> In order to implement this, the PowerState field of the PMCSR
>> register becomes writable, which allows the device to appear in
>> lower power states.  This also therefore implements D3 support
>> (insofar as the BAR behavior) for all devices implementing the PM
>> capability.  The PCI spec requires D3 support.
>>
>> An aspect that needs attention here is whether this change in the
>> wmask and PMCSR bits becomes a problem for migration, and how we
>> might solve it.  For a guest migrating old->new, the device would
>> always be in the D0 power state, but the register becomes writable.
>> In the opposite direction, is it possible that a device could
>> migrate in a low power state and be stuck there since the bits are
>> read-only in old QEMU?  Do we need an option for this behavior and a
>> machine state bump, or are there alternatives?
>>
>> Thanks,
>> Alex
>>
>> [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/
> 
> 
> PCI bits:
> 
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> 
> feel free to merge.

Applied to vfio-next.

Thanks,

C.