hw/net/e1000e.c | 3 +- hw/net/eepro100.c | 4 +- hw/net/igb.c | 3 +- hw/nvme/ctrl.c | 3 +- hw/pci-bridge/pcie_pci_bridge.c | 3 +- hw/pci/pci.c | 93 ++++++++++++++++++++++++++++++++- hw/pci/trace-events | 2 + hw/vfio/pci.c | 34 ++++++------ hw/vfio/pci.h | 1 - hw/virtio/virtio-pci.c | 11 ++-- include/hw/pci/pci.h | 3 ++ include/hw/pci/pci_device.h | 3 ++ include/hw/pci/pcie.h | 2 - 13 files changed, 127 insertions(+), 38 deletions(-)
v2: Eric noted in v1 that one of the drivers had a redundant wmask setting since pci_pm_init() enabled writes to the power state field. This was added because vfio-pci was not setting wmask for this capability but is allowing writes to the PM state field through to the device. For vfio-pci, QEMU emulated config space is rather secondary to the config space through vfio. It turns out therefore, that vfio-pci is nearly unique in not already managing the wmask of the PM capability state and if we embrace that it's the pci_pm_init() caller's responsibility to manage the remaining contents and write-access of the capability, then I think we also solve the question of migration compatibility. The new infrastructure here is not changing whether any fields were previously writable, it's only effecting a mapping change based on the value found there. This requires only a slight change to patch 1/, removing setting of the wmask, but commit log is also updated and comments added. I also made the bad transition trace a little more obvious given Eric's comments. Patch 2/ is also updated so that vfio-pci effects the wmask change locally. The couple drivers that don't currently update wmask simply don't get this new BAR unmapped when not in D0 behavior. Incorporated reviews for the unmodified patches. Please re-review and report any noted issues. Thanks, Alex v1: https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/ Eric recently identified an issue[1] where during graceful shutdown of a VM in a vIOMMU configuration, the guest driver places the device into the D3 power state, the vIOMMU is then disabled, triggering an AddressSpace update. The device BARs are still mapped into the AS, but the vfio host driver refuses to DMA map the MMIO space due to the device power state. The proposed solution in [1] was to skip mappings based on the device power state. Here we take a different approach. The PCI spec defines that devices in D1/2/3 power state should respond only to configuration and message requests and all other requests should be handled as an Unsupported Request. In other words, the memory and IO BARs are not accessible except when the device is in the D0 power state. To emulate this behavior, we can factor the device power state into the mapping state of the device BARs. Therefore the BAR is marked as unmapped if either the respective command register enable bit is clear or the device is not in the D0 power state. In order to implement this, the PowerState field of the PMCSR register becomes writable, which allows the device to appear in lower power states. This also therefore implements D3 support (insofar as the BAR behavior) for all devices implementing the PM capability. The PCI spec requires D3 support. An aspect that needs attention here is whether this change in the wmask and PMCSR bits becomes a problem for migration, and how we might solve it. For a guest migrating old->new, the device would always be in the D0 power state, but the register becomes writable. In the opposite direction, is it possible that a device could migrate in a low power state and be stuck there since the bits are read-only in old QEMU? Do we need an option for this behavior and a machine state bump, or are there alternatives? Thanks, Alex [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/ Alex Williamson (5): hw/pci: Basic support for PCI power management pci: Use PCI PM capability initializer vfio/pci: Delete local pm_cap pcie, virtio: Remove redundant pm_cap hw/vfio/pci: Re-order pre-reset hw/net/e1000e.c | 3 +- hw/net/eepro100.c | 4 +- hw/net/igb.c | 3 +- hw/nvme/ctrl.c | 3 +- hw/pci-bridge/pcie_pci_bridge.c | 3 +- hw/pci/pci.c | 93 ++++++++++++++++++++++++++++++++- hw/pci/trace-events | 2 + hw/vfio/pci.c | 34 ++++++------ hw/vfio/pci.h | 1 - hw/virtio/virtio-pci.c | 11 ++-- include/hw/pci/pci.h | 3 ++ include/hw/pci/pci_device.h | 3 ++ include/hw/pci/pcie.h | 2 - 13 files changed, 127 insertions(+), 38 deletions(-) -- 2.48.1
Hello Michael, Could you please re-ack (or not) v2 ? Thanks C. On 2/25/25 22:52, Alex Williamson wrote: > v2: > > Eric noted in v1 that one of the drivers had a redundant wmask setting > since pci_pm_init() enabled writes to the power state field. This was > added because vfio-pci was not setting wmask for this capability but > is allowing writes to the PM state field through to the device. For > vfio-pci, QEMU emulated config space is rather secondary to the config > space through vfio. > > It turns out therefore, that vfio-pci is nearly unique in not already > managing the wmask of the PM capability state and if we embrace that > it's the pci_pm_init() caller's responsibility to manage the remaining > contents and write-access of the capability, then I think we also > solve the question of migration compatibility. The new infrastructure > here is not changing whether any fields were previously writable, it's > only effecting a mapping change based on the value found there. > > This requires only a slight change to patch 1/, removing setting of > the wmask, but commit log is also updated and comments added. I also > made the bad transition trace a little more obvious given Eric's > comments. Patch 2/ is also updated so that vfio-pci effects the wmask > change locally. The couple drivers that don't currently update wmask > simply don't get this new BAR unmapped when not in D0 behavior. > > Incorporated reviews for the unmodified patches. Please re-review and > report any noted issues. Thanks, > > Alex > > v1: > > https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/ > > Eric recently identified an issue[1] where during graceful shutdown > of a VM in a vIOMMU configuration, the guest driver places the device > into the D3 power state, the vIOMMU is then disabled, triggering an > AddressSpace update. The device BARs are still mapped into the AS, > but the vfio host driver refuses to DMA map the MMIO space due to the > device power state. > > The proposed solution in [1] was to skip mappings based on the > device power state. Here we take a different approach. The PCI spec > defines that devices in D1/2/3 power state should respond only to > configuration and message requests and all other requests should be > handled as an Unsupported Request. In other words, the memory and > IO BARs are not accessible except when the device is in the D0 power > state. > > To emulate this behavior, we can factor the device power state into > the mapping state of the device BARs. Therefore the BAR is marked > as unmapped if either the respective command register enable bit is > clear or the device is not in the D0 power state. > > In order to implement this, the PowerState field of the PMCSR > register becomes writable, which allows the device to appear in > lower power states. This also therefore implements D3 support > (insofar as the BAR behavior) for all devices implementing the PM > capability. The PCI spec requires D3 support. > > An aspect that needs attention here is whether this change in the > wmask and PMCSR bits becomes a problem for migration, and how we > might solve it. For a guest migrating old->new, the device would > always be in the D0 power state, but the register becomes writable. > In the opposite direction, is it possible that a device could > migrate in a low power state and be stuck there since the bits are > read-only in old QEMU? Do we need an option for this behavior and a > machine state bump, or are there alternatives? > > Thanks, > Alex > > [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/ > > > Alex Williamson (5): > hw/pci: Basic support for PCI power management > pci: Use PCI PM capability initializer > vfio/pci: Delete local pm_cap > pcie, virtio: Remove redundant pm_cap > hw/vfio/pci: Re-order pre-reset > > hw/net/e1000e.c | 3 +- > hw/net/eepro100.c | 4 +- > hw/net/igb.c | 3 +- > hw/nvme/ctrl.c | 3 +- > hw/pci-bridge/pcie_pci_bridge.c | 3 +- > hw/pci/pci.c | 93 ++++++++++++++++++++++++++++++++- > hw/pci/trace-events | 2 + > hw/vfio/pci.c | 34 ++++++------ > hw/vfio/pci.h | 1 - > hw/virtio/virtio-pci.c | 11 ++-- > include/hw/pci/pci.h | 3 ++ > include/hw/pci/pci_device.h | 3 ++ > include/hw/pci/pcie.h | 2 - > 13 files changed, 127 insertions(+), 38 deletions(-) >
On Tue, Mar 04, 2025 at 01:18:45PM +0100, Cédric Le Goater wrote: > Hello Michael, > > Could you please re-ack (or not) v2 ? > > Thanks > > C. pci things: Reviewed-by: Michael S. Tsirkin <mst@redhat.com> > On 2/25/25 22:52, Alex Williamson wrote: > > v2: > > > > Eric noted in v1 that one of the drivers had a redundant wmask setting > > since pci_pm_init() enabled writes to the power state field. This was > > added because vfio-pci was not setting wmask for this capability but > > is allowing writes to the PM state field through to the device. For > > vfio-pci, QEMU emulated config space is rather secondary to the config > > space through vfio. > > > > It turns out therefore, that vfio-pci is nearly unique in not already > > managing the wmask of the PM capability state and if we embrace that > > it's the pci_pm_init() caller's responsibility to manage the remaining > > contents and write-access of the capability, then I think we also > > solve the question of migration compatibility. The new infrastructure > > here is not changing whether any fields were previously writable, it's > > only effecting a mapping change based on the value found there. > > > > This requires only a slight change to patch 1/, removing setting of > > the wmask, but commit log is also updated and comments added. I also > > made the bad transition trace a little more obvious given Eric's > > comments. Patch 2/ is also updated so that vfio-pci effects the wmask > > change locally. The couple drivers that don't currently update wmask > > simply don't get this new BAR unmapped when not in D0 behavior. > > > > Incorporated reviews for the unmodified patches. Please re-review and > > report any noted issues. Thanks, > > > > Alex > > > > v1: > > > > https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/ > > > > Eric recently identified an issue[1] where during graceful shutdown > > of a VM in a vIOMMU configuration, the guest driver places the device > > into the D3 power state, the vIOMMU is then disabled, triggering an > > AddressSpace update. The device BARs are still mapped into the AS, > > but the vfio host driver refuses to DMA map the MMIO space due to the > > device power state. > > > > The proposed solution in [1] was to skip mappings based on the > > device power state. Here we take a different approach. The PCI spec > > defines that devices in D1/2/3 power state should respond only to > > configuration and message requests and all other requests should be > > handled as an Unsupported Request. In other words, the memory and > > IO BARs are not accessible except when the device is in the D0 power > > state. > > > > To emulate this behavior, we can factor the device power state into > > the mapping state of the device BARs. Therefore the BAR is marked > > as unmapped if either the respective command register enable bit is > > clear or the device is not in the D0 power state. > > > > In order to implement this, the PowerState field of the PMCSR > > register becomes writable, which allows the device to appear in > > lower power states. This also therefore implements D3 support > > (insofar as the BAR behavior) for all devices implementing the PM > > capability. The PCI spec requires D3 support. > > > > An aspect that needs attention here is whether this change in the > > wmask and PMCSR bits becomes a problem for migration, and how we > > might solve it. For a guest migrating old->new, the device would > > always be in the D0 power state, but the register becomes writable. > > In the opposite direction, is it possible that a device could > > migrate in a low power state and be stuck there since the bits are > > read-only in old QEMU? Do we need an option for this behavior and a > > machine state bump, or are there alternatives? > > > > Thanks, > > Alex > > > > [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/ > > > > > > Alex Williamson (5): > > hw/pci: Basic support for PCI power management > > pci: Use PCI PM capability initializer > > vfio/pci: Delete local pm_cap > > pcie, virtio: Remove redundant pm_cap > > hw/vfio/pci: Re-order pre-reset > > > > hw/net/e1000e.c | 3 +- > > hw/net/eepro100.c | 4 +- > > hw/net/igb.c | 3 +- > > hw/nvme/ctrl.c | 3 +- > > hw/pci-bridge/pcie_pci_bridge.c | 3 +- > > hw/pci/pci.c | 93 ++++++++++++++++++++++++++++++++- > > hw/pci/trace-events | 2 + > > hw/vfio/pci.c | 34 ++++++------ > > hw/vfio/pci.h | 1 - > > hw/virtio/virtio-pci.c | 11 ++-- > > include/hw/pci/pci.h | 3 ++ > > include/hw/pci/pci_device.h | 3 ++ > > include/hw/pci/pcie.h | 2 - > > 13 files changed, 127 insertions(+), 38 deletions(-) > >
On 2/25/25 22:52, Alex Williamson wrote: > v2: > > Eric noted in v1 that one of the drivers had a redundant wmask setting > since pci_pm_init() enabled writes to the power state field. This was > added because vfio-pci was not setting wmask for this capability but > is allowing writes to the PM state field through to the device. For > vfio-pci, QEMU emulated config space is rather secondary to the config > space through vfio. > > It turns out therefore, that vfio-pci is nearly unique in not already > managing the wmask of the PM capability state and if we embrace that > it's the pci_pm_init() caller's responsibility to manage the remaining > contents and write-access of the capability, then I think we also > solve the question of migration compatibility. The new infrastructure > here is not changing whether any fields were previously writable, it's > only effecting a mapping change based on the value found there. > > This requires only a slight change to patch 1/, removing setting of > the wmask, but commit log is also updated and comments added. I also > made the bad transition trace a little more obvious given Eric's > comments. Patch 2/ is also updated so that vfio-pci effects the wmask > change locally. The couple drivers that don't currently update wmask > simply don't get this new BAR unmapped when not in D0 behavior. > > Incorporated reviews for the unmodified patches. Please re-review and > report any noted issues. Thanks, > > Alex > > v1: > > https://lore.kernel.org/all/20250220224918.2520417-1-alex.williamson@redhat.com/ > > Eric recently identified an issue[1] where during graceful shutdown > of a VM in a vIOMMU configuration, the guest driver places the device > into the D3 power state, the vIOMMU is then disabled, triggering an > AddressSpace update. The device BARs are still mapped into the AS, > but the vfio host driver refuses to DMA map the MMIO space due to the > device power state. > > The proposed solution in [1] was to skip mappings based on the > device power state. Here we take a different approach. The PCI spec > defines that devices in D1/2/3 power state should respond only to > configuration and message requests and all other requests should be > handled as an Unsupported Request. In other words, the memory and > IO BARs are not accessible except when the device is in the D0 power > state. > > To emulate this behavior, we can factor the device power state into > the mapping state of the device BARs. Therefore the BAR is marked > as unmapped if either the respective command register enable bit is > clear or the device is not in the D0 power state. > > In order to implement this, the PowerState field of the PMCSR > register becomes writable, which allows the device to appear in > lower power states. This also therefore implements D3 support > (insofar as the BAR behavior) for all devices implementing the PM > capability. The PCI spec requires D3 support. > > An aspect that needs attention here is whether this change in the > wmask and PMCSR bits becomes a problem for migration, and how we > might solve it. For a guest migrating old->new, the device would > always be in the D0 power state, but the register becomes writable. > In the opposite direction, is it possible that a device could > migrate in a low power state and be stuck there since the bits are > read-only in old QEMU? Do we need an option for this behavior and a > machine state bump, or are there alternatives? > > Thanks, > Alex > > [1]https://lore.kernel.org/all/20250219175941.135390-1-eric.auger@redhat.com/ > > > Alex Williamson (5): > hw/pci: Basic support for PCI power management > pci: Use PCI PM capability initializer > vfio/pci: Delete local pm_cap > pcie, virtio: Remove redundant pm_cap > hw/vfio/pci: Re-order pre-reset > > hw/net/e1000e.c | 3 +- > hw/net/eepro100.c | 4 +- > hw/net/igb.c | 3 +- > hw/nvme/ctrl.c | 3 +- > hw/pci-bridge/pcie_pci_bridge.c | 3 +- > hw/pci/pci.c | 93 ++++++++++++++++++++++++++++++++- > hw/pci/trace-events | 2 + > hw/vfio/pci.c | 34 ++++++------ > hw/vfio/pci.h | 1 - > hw/virtio/virtio-pci.c | 11 ++-- > include/hw/pci/pci.h | 3 ++ > include/hw/pci/pci_device.h | 3 ++ > include/hw/pci/pcie.h | 2 - > 13 files changed, 127 insertions(+), 38 deletions(-) > Applied to vfio-next. Thanks, C.
© 2016 - 2025 Red Hat, Inc.