hw/arm/virt.c | 157 ++++- hw/pci/meson.build | 2 + hw/pci/pci-enumerate.c | 144 +++++ hw/pci/pci-enumerate.h | 15 + hw/pci/pci-resource.c | 1099 +++++++++++++++++++++++++++++++++++ hw/pci/pci-resource.h | 82 +++ hw/pci/pci.c | 108 ++++ include/hw/arm/virt.h | 3 + include/hw/pci/pci_device.h | 10 + 9 files changed, 1615 insertions(+), 5 deletions(-) create mode 100644 hw/pci/pci-enumerate.c create mode 100644 hw/pci/pci-enumerate.h create mode 100644 hw/pci/pci-resource.c create mode 100644 hw/pci/pci-resource.h
This RFC introduces a mechanism to specify Guest Physical Addresses
(GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
addresses to match host physical addresses for assigned devices.
On some platforms, P2P DMA is performed between devices within the same
IOMMU group. The PCI fabric ACS is configured to permit direct P2P
without going through the host bridge in order to achieve the required
performance.
To support this multi-device IOMMU group P2P scenario in virtualization,
the VM may need to use the same MMIO BAR addresses as the host physical
address layout.
This series implements a per-device PCI property, "fixed-bars", which
allows users to specify fixed BAR addresses. The property is generic
and is available on any PCI-capable machine. It is a comma-separated
list of BAR assignments:
barN@<addr>[,barM@<addr>]*
The virt machine builds on this with two additional machine properties:
pci-pre-enum
When enabled, QEMU performs PCI enumeration and resource assignment
before handing control to firmware (e.g. EDK2). This includes
programming 64-bit prefetchable BARs according to fixed-bars
assignments, and programming bridge prefetchable windows.
A "pci-enum-done" device-tree property is set so firmware preserves
the configuration.
pcie-mmio-window
Defines the MMIO64 window for PCIe devices. When using fixed-bars,
this allows the aperture to be resized or repositioned so all
assigned BARs fall within a valid address range.
Why QEMU programs PCI resources rather than EDK2:
To support fixed BAR placement, QEMU performs PCI bus enumeration and
resource assignment prior to firmware execution. EDK2 already provides
a PCD-controlled mechanism (PcdPciDisableBusEnumeration) that allows
the platform to skip PCI enumeration and resource allocation. This
series leverages that mechanism so that, when enabled, firmware runs in
a discovery-only mode and preserves the configuration established by
QEMU.
When pci-pre-enum is enabled, QEMU runs PCI enumeration and resource
allocation, prioritizing fixed BARs specified via fixed-bars. If
allocation fails due to alignment, overlap, or address space constraints,
QEMU terminates with an error. Otherwise, all BARs and bridge windows are
fully programmed before firmware execution.
There is certainly room for improvement, but this RFC aims to gather
feedback on the overall approach chosen to address this problem.
We use the virt machine in this series as the concrete example
consuming the fixed-BAR model. Other machines may require their own
machine-specific mechanism (such as pcie-mmio-window) if they want to
adopt the same approach.
Example usage:
-machine virt,...,pcie-mmio-window=0x400000000000:0x400000000000,pci-pre-enum=on \
-device vfio-pci,host=0009:06:00.0,id=dev0 \
-set device.dev0.fixed-bars=bar2@0x6b8000000000,bar4@0x6c8000000000
Testing:
This series was tested on NVIDIA GB300 platforms with a recent Linux
kernel. GPUDirect P2P between a GPU and a CX8 NIC requires a PCIe
topology in the VM that mirrors bare metal (e.g. both devices under the
same switch and ACS tuned for the minimal P2P paths needed for GPUDirect
RDMA).
TODO:
- The fixed BAR allocator handles 64-bit prefetchable BARs and related
bridge prefetch windows only. Programming PIO, 32-bit MMIO, and
64-bit non-prefetchable BARs, and sizing bridge windows for those
resource types, is left for follow-up patches.
- SR-IOV virtual functions are not included when sizing bridge prefetch
apertures and may require additional work.
- Add ACPI _DSM so the fixed BARs are preserved.
A git branch with this series applied is available at:
https://github.com/tdavenvidia/upstream-qemu/commits/upstream_May_08_26/
The related EDK2 change is available at:
https://github.com/tdavenvidia/edk2/commits/upstream_May_08_26/
Tushar Dave (8):
hw/pci: add fixed-bars property to allow fixed BAR addresses
hw/pci: enumerate PCI bus and program bridge bus numbers
hw/pci: introduce allocator for fixed BAR placement
hw/pci: pack remaining BARs and update bridge windows
hw/pci: allocate remaining BARs for buses without fixed BARs
hw/pci: finalize bridge prefetch windows after BAR allocation
hw/arm/virt: add pcie-mmio-window machine property
hw/arm/virt: add pci-pre-enum machine property
hw/arm/virt.c | 157 ++++-
hw/pci/meson.build | 2 +
hw/pci/pci-enumerate.c | 144 +++++
hw/pci/pci-enumerate.h | 15 +
hw/pci/pci-resource.c | 1099 +++++++++++++++++++++++++++++++++++
hw/pci/pci-resource.h | 82 +++
hw/pci/pci.c | 108 ++++
include/hw/arm/virt.h | 3 +
include/hw/pci/pci_device.h | 10 +
9 files changed, 1615 insertions(+), 5 deletions(-)
create mode 100644 hw/pci/pci-enumerate.c
create mode 100644 hw/pci/pci-enumerate.h
create mode 100644 hw/pci/pci-resource.c
create mode 100644 hw/pci/pci-resource.h
--
2.34.1
On Fri, May 08, 2026 at 01:37:09PM -0500, Tushar Dave wrote: > This RFC introduces a mechanism to specify Guest Physical Addresses > (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR > addresses to match host physical addresses for assigned devices. > > On some platforms, P2P DMA is performed between devices within the same > IOMMU group. The PCI fabric ACS is configured to permit direct P2P > without going through the host bridge in order to achieve the required > performance. Pass this info to guest firmware, let it set bars any way it wants?
On 5/11/2026 4:09 AM, Michael S. Tsirkin wrote: > On Fri, May 08, 2026 at 01:37:09PM -0500, Tushar Dave wrote: >> This RFC introduces a mechanism to specify Guest Physical Addresses >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR >> addresses to match host physical addresses for assigned devices. >> >> On some platforms, P2P DMA is performed between devices within the same >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P >> without going through the host bridge in order to achieve the required >> performance. > > Pass this info to guest firmware, let it set bars any way it wants? We are using firmware, relying on the existing EDK2-supported mode enabled by PcdPciDisableBusEnumeration, where firmware is expected to preserve the PCI topology and BAR programming established by the hypervisor. In our case, the hypervisor is QEMU, which performs PCI enumeration and resource assignment before handing control to firmware. EDK2 then explicitly refrains from re-enumerating or reallocating PCI BARs, as this is already a supported firmware behavior. -Tushar
On Mon, May 11, 2026 at 01:10:43PM -0500, Tushar Dave wrote: > > > On 5/11/2026 4:09 AM, Michael S. Tsirkin wrote: > > On Fri, May 08, 2026 at 01:37:09PM -0500, Tushar Dave wrote: > >> This RFC introduces a mechanism to specify Guest Physical Addresses > >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR > >> addresses to match host physical addresses for assigned devices. > >> > >> On some platforms, P2P DMA is performed between devices within the same > >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P > >> without going through the host bridge in order to achieve the required > >> performance. > > > > Pass this info to guest firmware, let it set bars any way it wants? > > We are using firmware, relying on the existing EDK2-supported mode > enabled by PcdPciDisableBusEnumeration, where firmware is expected > to preserve the PCI topology and BAR programming established by > the hypervisor. > > In our case, the hypervisor is QEMU, which performs PCI enumeration > and resource assignment before handing control to firmware. EDK2 > then explicitly refrains from re-enumerating or reallocating PCI > BARs, as this is already a supported firmware behavior. > > -Tushar I see no advantage in performing pci enumeration in qemu when firmware is already doing an adequate job of it. If you want firmware to map specific devices at specific addresses, pass that info along to it. -- MST
On Fri, 8 May 2026 at 19:37, Tushar Dave <tdave@nvidia.com> wrote: > > This RFC introduces a mechanism to specify Guest Physical Addresses > (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR > addresses to match host physical addresses for assigned devices. > > On some platforms, P2P DMA is performed between devices within the same > IOMMU group. The PCI fabric ACS is configured to permit direct P2P > without going through the host bridge in order to achieve the required > performance. > > To support this multi-device IOMMU group P2P scenario in virtualization, > the VM may need to use the same MMIO BAR addresses as the host physical > address layout. This feels like something's wrong in the design. A VM doesn't necessarily have the same memory layout as the host: the VM hardware is all about making that possible. > Why QEMU programs PCI resources rather than EDK2: > > To support fixed BAR placement, QEMU performs PCI bus enumeration and > resource assignment prior to firmware execution. EDK2 already provides > a PCD-controlled mechanism (PcdPciDisableBusEnumeration) that allows > the platform to skip PCI enumeration and resource allocation. This > series leverages that mechanism so that, when enabled, firmware runs in > a discovery-only mode and preserves the configuration established by > QEMU. I'm definitely not enthusiastic about having QEMU do PCI bus enumeration. This isn't the way the hardware does it, and it's a lot of code that's duplicating what the guest already has (there's over a thousand lines of code in this patchset). > We use the virt machine in this series as the concrete example > consuming the fixed-BAR model. Other machines may require their own > machine-specific mechanism (such as pcie-mmio-window) if they want to > adopt the same approach. > > Example usage: > > -machine virt,...,pcie-mmio-window=0x400000000000:0x400000000000,pci-pre-enum=on \ > -device vfio-pci,host=0009:06:00.0,id=dev0 \ > -set device.dev0.fixed-bars=bar2@0x6b8000000000,bar4@0x6c8000000000 ...and you end up with enormous command lines like this full of magic numbers relating to address space layout. I think it would be better to find a way of doing this that doesn't have the "VM address space layout has to match the host layout" restriction. -- PMM
Hello Tushar, On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote: > This RFC introduces a mechanism to specify Guest Physical Addresses > (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR > addresses to match host physical addresses for assigned devices. > > On some platforms, P2P DMA is performed between devices within the same > IOMMU group. The PCI fabric ACS is configured to permit direct P2P > without going through the host bridge in order to achieve the required > performance. > > To support this multi-device IOMMU group P2P scenario in virtualization, > the VM may need to use the same MMIO BAR addresses as the host physical > address layout. > Did you consider implementing this using Enhanced Allocation (EA)? If so, could you explain why it is not suitable here? Also, I think I understand what the intent is here, but could you describe the topology in a bit more detail? These are assigned physical PCIe endpoints behind an emulated host bridge, right? And the BAR needs to reside at an a priori fixed address so that another PCIe endpoint behind the same emulated host bridge can DMA straight into it? Doing PCIe enumeration at yet another level is not a feasible approach imo, having UEFI and Linux play nice together is already a bit of a challenge. Is there any way this could be handled by having special rules for inbound translation in the host bridge driver/implementation?
On 5/11/2026 6:43 AM, Ard Biesheuvel wrote:
> Hello Tushar,
>
> On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote:
>> This RFC introduces a mechanism to specify Guest Physical Addresses
>> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
>> addresses to match host physical addresses for assigned devices.
>>
>> On some platforms, P2P DMA is performed between devices within the same
>> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
>> without going through the host bridge in order to achieve the required
>> performance.
>>
>> To support this multi-device IOMMU group P2P scenario in virtualization,
>> the VM may need to use the same MMIO BAR addresses as the host physical
>> address layout.
>>
>
> Did you consider implementing this using Enhanced Allocation (EA)? If so,
> could you explain why it is not suitable here?
I have not evaluated EA for this design. When I looked at EDK2, I
chose PcdPciDisableBusEnumeration because it cleanly preserves fixed
BAR programming established by the hypervisor — at the cost of QEMU
performing PCI bus number and resource assignment.
I did a quick search and do not see EA support in EDK2. Any pointers
to EA being used in a similar fashion to achieve fixed BAR placement
would be appreciated.
>
> Also, I think I understand what the intent is here, but could you describe
> the topology in a bit more detail? These are assigned physical PCIe endpoints
> behind an emulated host bridge, right? And the BAR needs to reside at an
> a priori fixed address so that another PCIe endpoint behind the same emulated
> host bridge can DMA straight into it?
Yes, that is all correct.
-[0000:00]-+-00.0 Host bridge
+-01.0 Root Port
\-[0000:02]
+-00.0 Switch Upstream Port
+-01.0 Switch Downstream Port A
| \-[0000:04] Device A
+-02.0 Switch Downstream Port B
\-[0000:05] Device B
>
> Doing PCIe enumeration at yet another level is not a feasible approach imo,
> having UEFI and Linux play nice together is already a bit of a challenge.
I agree but to clarify, in this case QEMU performs PCI topology
initialization and resource assignment prior to firmware execution,
where EDK2 avoids full PCI bus re-enumeration. Linux sees a fully
enumerated bus from firmware just as it does today. There is no
duplicated enumeration step between firmware and Linux when we use
EDK2 with PcdPciDisableBusEnumeration.
>
> Is there any way this could be handled by having special rules for inbound
> translation in the host bridge driver/implementation?
Not that I can think of.
Thanks.
-Tushar
On Tue, 12 May 2026 12:25:45 -0500 Tushar Dave <tdave@nvidia.com> wrote: > On 5/11/2026 6:43 AM, Ard Biesheuvel wrote: > > Hello Tushar, > > > > On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote: > >> This RFC introduces a mechanism to specify Guest Physical Addresses > >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR > >> addresses to match host physical addresses for assigned devices. > >> > >> On some platforms, P2P DMA is performed between devices within the same > >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P > >> without going through the host bridge in order to achieve the required > >> performance. > >> > >> To support this multi-device IOMMU group P2P scenario in virtualization, > >> the VM may need to use the same MMIO BAR addresses as the host physical > >> address layout. > >> > > > > Did you consider implementing this using Enhanced Allocation (EA)? If so, > > could you explain why it is not suitable here? > > I have not evaluated EA for this design. When I looked at EDK2, I > chose PcdPciDisableBusEnumeration because it cleanly preserves fixed > BAR programming established by the hypervisor — at the cost of QEMU > performing PCI bus number and resource assignment. > > I did a quick search and do not see EA support in EDK2. Any pointers > to EA being used in a similar fashion to achieve fixed BAR placement > would be appreciated. EA wasn't on my radar either, but I did some research and chatted with Tushar and I think it could work. I'll sketch out a rough idea of what it might looks like. EA describes BAR equivalents (fixed base address, size, and type) in a separate capability while the corresponding device BAR registers appear unimplemented. Linux already consumes endpoint EA capabilities and marks the resulting resources IORESOURCE_PCI_FIXED. EDK2 doesn't know about EA (cap 0x14 isn't defined anywhere in MdePkg, and PciBusDxe never consults it afaict), but that turns out to be useful here rather than a problem. Starting at the QEMU device, for a vfio-pci device we'd need to virtualize the real BARs as unimplemented and surface that information via a synthesized EA capability instead. It's debatable whether this is a generic PCI mechanism or vfio-pci specific, whether HPA is automatically used as the base address for vfio-pci devices or user-specified, and the capability offset in config space. None of those fundamentally change the shape of the flow. For the absolute bare-minimum level of support (EA device on the root complex, EA resources don't overlap the VM address space or MMIO range, EDK2 firmware, Linux guest booted with pci=nocrs) I think this actually works with just adding the EA capability above. Let's walk through those constraints and how we relax them. At the firmware level we lean on the real BAR registers being unimplemented for EA devices, so EDK2 allocates no MMIO or IO resources for them. Only bus numbers get assigned if the EA device sits in a PCI hierarchy. That's exactly what we want, EDK2 doing conventional bus assignment but staying out of the EA resource flow entirely. Instead of firmware EA enlightenment we lean on the guest OS. Linux reads endpoint EA today, but the bridge aperture sizing path ignores those fixed resources. As Tushar's series demonstrates, generically handling mixed "fixed-BAR" and programmable-BAR devices in one hierarchy is hard. An incremental Linux enhancement that greatly simplifies the problem space would be to program bridge apertures only for hierarchies consisting entirely of fixed resources. The math becomes trivial (window spans min..max of fixed children, aligned to bridge granularity), and there's no regression risk, these hierarchies currently fail silently. The sizer ignores fixed children and the fixed-claim walk-up finds no containing parent. This enhancement, plus the homogeneous-hierarchy constraint, removes the root-complex constraint and lets us mirror the bare-metal topologies we need. Resource ranges are a bit messier. The extent of the EA device ranges could be determined in QEMU and the VM address map adjusted to prevent overlap. Tushar already has a similar user-specified machine option in this series. That range also needs to reach the guest as a CRS (to avoid pci=nocrs) but needs to stay distinct from the DT range passed to EDK2 for programmable BAR devices so EDK2 won't place a programmable BAR or bridge window into the EA region. So long as we keep EA and programmable devices in separate hierarchies, EDK2 only needs the programmable range via DT and we can add the EA range as additional CRS ranges visible only to the guest. In practice, EDK2 programs all the programmable devices and the EA devices live entirely in the additional CRS. A possibly cleaner alternative is additional PXB host bridges for the EA devices, each with its own CRS. That sidesteps the DT/CRS split entirely since the EA PXB has nothing for EDK2 to allocate anyway. If we agree that homogeneous hierarchies (no mixing of EA and programmable BARs) is a reasonable constraint, and possibly extend that to homogeneous per host bridge to simplify the CRS mapping, we have the following work items: * Extend Linux EA support to program bridge apertures for subordinate homogeneous EA hierarchies. * Develop options to virtualize programmable BARs as EA for vfio-pci devices, if not generically for the benefit of testing. * Implement a way to poke holes in the VM address space and plumb through to account for addresses used by EA devices. * Provide those same ranges to the guest via CRS (but not via DT to EDK2), or alternatively expose them through additional PXB host bridges. Does that shape roughly seem accurate? Are there additional gaps I've missed? Thanks, Alex
On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote: > On Tue, 12 May 2026 12:25:45 -0500 > Tushar Dave <tdave@nvidia.com> wrote: > > > On 5/11/2026 6:43 AM, Ard Biesheuvel wrote: > > > Hello Tushar, > > > > > > On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote: > > >> This RFC introduces a mechanism to specify Guest Physical Addresses > > >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR > > >> addresses to match host physical addresses for assigned devices. > > >> > > >> On some platforms, P2P DMA is performed between devices within the same > > >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P > > >> without going through the host bridge in order to achieve the required > > >> performance. > > >> > > >> To support this multi-device IOMMU group P2P scenario in virtualization, > > >> the VM may need to use the same MMIO BAR addresses as the host physical > > >> address layout. > > >> > > > > > > Did you consider implementing this using Enhanced Allocation (EA)? If so, > > > could you explain why it is not suitable here? > > > > I have not evaluated EA for this design. When I looked at EDK2, I > > chose PcdPciDisableBusEnumeration because it cleanly preserves fixed > > BAR programming established by the hypervisor — at the cost of QEMU > > performing PCI bus number and resource assignment. > > > > I did a quick search and do not see EA support in EDK2. Any pointers > > to EA being used in a similar fashion to achieve fixed BAR placement > > would be appreciated. > > EA wasn't on my radar either, but I did some research and chatted with > Tushar and I think it could work. I'll sketch out a rough idea of what > it might looks like. > > EA describes BAR equivalents (fixed base address, size, and type) in a > separate capability while the corresponding device BAR registers appear > unimplemented. Linux already consumes endpoint EA capabilities and > marks the resulting resources IORESOURCE_PCI_FIXED. EDK2 doesn't know > about EA (cap 0x14 isn't defined anywhere in MdePkg, and PciBusDxe > never consults it afaict), but that turns out to be useful here rather > than a problem. > > Starting at the QEMU device, for a vfio-pci device we'd need to > virtualize the real BARs as unimplemented and surface that information > via a synthesized EA capability instead. It's debatable whether this > is a generic PCI mechanism or vfio-pci specific, whether HPA is > automatically used as the base address for vfio-pci devices or > user-specified, and the capability offset in config space. None of > those fundamentally change the shape of the flow. > > For the absolute bare-minimum level of support (EA device on the root > complex, EA resources don't overlap the VM address space or MMIO range, > EDK2 firmware, Linux guest booted with pci=nocrs) I think this actually > works with just adding the EA capability above. Let's walk through > those constraints and how we relax them. > > At the firmware level we lean on the real BAR registers being > unimplemented for EA devices, so EDK2 allocates no MMIO or IO resources > for them. Only bus numbers get assigned if the EA device sits in a PCI > hierarchy. That's exactly what we want, EDK2 doing conventional bus > assignment but staying out of the EA resource flow entirely. > > Instead of firmware EA enlightenment we lean on the guest OS. Linux > reads endpoint EA today, but the bridge aperture sizing path ignores > those fixed resources. As Tushar's series demonstrates, generically > handling mixed "fixed-BAR" and programmable-BAR devices in one > hierarchy is hard. An incremental Linux enhancement that greatly > simplifies the problem space would be to program bridge apertures only > for hierarchies consisting entirely of fixed resources. The math > becomes trivial (window spans min..max of fixed children, aligned to > bridge granularity), and there's no regression risk, these hierarchies > currently fail silently. The sizer ignores fixed children and the > fixed-claim walk-up finds no containing parent. This enhancement, > plus the homogeneous-hierarchy constraint, removes the root-complex > constraint and lets us mirror the bare-metal topologies we need. > > Resource ranges are a bit messier. The extent of the EA device ranges > could be determined in QEMU and the VM address map adjusted to prevent > overlap. Tushar already has a similar user-specified machine option in > this series. That range also needs to reach the guest as a CRS (to > avoid pci=nocrs) but needs to stay distinct from the DT range passed to > EDK2 for programmable BAR devices so EDK2 won't place a programmable > BAR or bridge window into the EA region. So long as we keep EA and > programmable devices in separate hierarchies, EDK2 only needs the > programmable range via DT and we can add the EA range as additional CRS > ranges visible only to the guest. > > In practice, EDK2 programs all the programmable devices and the EA > devices live entirely in the additional CRS. A possibly cleaner > alternative is additional PXB host bridges for the EA devices, each > with its own CRS. That sidesteps the DT/CRS split entirely since the > EA PXB has nothing for EDK2 to allocate anyway. > > If we agree that homogeneous hierarchies (no mixing of EA and > programmable BARs) is a reasonable constraint, and possibly extend that > to homogeneous per host bridge to simplify the CRS mapping, we have the > following work items: > > * Extend Linux EA support to program bridge apertures for subordinate > homogeneous EA hierarchies. > > * Develop options to virtualize programmable BARs as EA for vfio-pci > devices, if not generically for the benefit of testing. > > * Implement a way to poke holes in the VM address space and plumb > through to account for addresses used by EA devices. > > * Provide those same ranges to the guest via CRS (but not via DT to > EDK2), or alternatively expose them through additional PXB host > bridges. > > Does that shape roughly seem accurate? Are there additional gaps I've > missed? Thanks, > > Alex just one question why not do it in firmware so windows is thinkably also handled?
On Tue, May 12, 2026, at 5:12 PM, Michael S. Tsirkin wrote: > On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote: >> If we agree that homogeneous hierarchies (no mixing of EA and >> programmable BARs) is a reasonable constraint, and possibly extend that >> to homogeneous per host bridge to simplify the CRS mapping, we have the >> following work items: >> >> * Extend Linux EA support to program bridge apertures for subordinate >> homogeneous EA hierarchies. >> >> * Develop options to virtualize programmable BARs as EA for vfio-pci >> devices, if not generically for the benefit of testing. >> >> * Implement a way to poke holes in the VM address space and plumb >> through to account for addresses used by EA devices. >> >> * Provide those same ranges to the guest via CRS (but not via DT to >> EDK2), or alternatively expose them through additional PXB host >> bridges. >> >> Does that shape roughly seem accurate? Are there additional gaps I've >> missed? Thanks, > > just one question why not do it in firmware so windows > is thinkably also handled? I suppose someone could chime in if they have a similar requirement for Windows guests. Otherwise, the incremental effort to extend Linux EA support seems smaller, though I also don't know what, if any support Windows has for EA to bother. Regardless, improving Linux EA support might help elsewhere and doesn't preclude edk2 support in the future. Thanks, Alex
On Wed, 13 May 2026, at 01:57, Alex Williamson wrote: > On Tue, May 12, 2026, at 5:12 PM, Michael S. Tsirkin wrote: >> On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote: >>> If we agree that homogeneous hierarchies (no mixing of EA and >>> programmable BARs) is a reasonable constraint, and possibly extend >>> that to homogeneous per host bridge to simplify the CRS mapping, we >>> have the following work items: >>> >>> * Extend Linux EA support to program bridge apertures for >>> subordinate homogeneous EA hierarchies. >>> >>> * Develop options to virtualize programmable BARs as EA for vfio- >>> pci devices, if not generically for the benefit of testing. >>> >>> * Implement a way to poke holes in the VM address space and plumb >>> through to account for addresses used by EA devices. >>> >>> * Provide those same ranges to the guest via CRS (but not via DT to >>> EDK2), or alternatively expose them through additional PXB host >>> bridges. >>> >>> Does that shape roughly seem accurate? Are there additional gaps >>> I've missed? Thanks, >> >> just one question why not do it in firmware so windows is thinkably >> also handled? > > I suppose someone could chime in if they have a similar requirement > for Windows guests. Otherwise, the incremental effort to extend Linux > EA support seems smaller, though I also don't know what, if any > support Windows has for EA to bother. Regardless, improving Linux EA > support might help elsewhere and doesn't preclude edk2 support in the > future. Thanks, > If EA is too much of a hassle to implement, another avenue that you might explore is EFI_INCOMPATIBLE_PCI_DEVICE_SUPPORT_PROTOCOL in edk2, which can be implemented by the platform to inform the PCI core about non-PCI compliant devices that have special requirements. While it is supposed to support this use case too, the PCI resource allocation code in EDK2 currently does not correctly support fixed resources that are reported by this protocol, but getting that fixed (and implementing the protocol in your firmware) might be a shorter path to getting this hardware supported under any OS (assuming EFI boot) than EA.
On 5/13/2026 9:25 AM, Ard Biesheuvel wrote: > > On Wed, 13 May 2026, at 01:57, Alex Williamson wrote: >> On Tue, May 12, 2026, at 5:12 PM, Michael S. Tsirkin wrote: >>> On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote: >>>> If we agree that homogeneous hierarchies (no mixing of EA and >>>> programmable BARs) is a reasonable constraint, and possibly extend >>>> that to homogeneous per host bridge to simplify the CRS mapping, we >>>> have the following work items: >>>> >>>> * Extend Linux EA support to program bridge apertures for >>>> subordinate homogeneous EA hierarchies. >>>> >>>> * Develop options to virtualize programmable BARs as EA for vfio- >>>> pci devices, if not generically for the benefit of testing. >>>> >>>> * Implement a way to poke holes in the VM address space and plumb >>>> through to account for addresses used by EA devices. >>>> >>>> * Provide those same ranges to the guest via CRS (but not via DT to >>>> EDK2), or alternatively expose them through additional PXB host >>>> bridges. >>>> >>>> Does that shape roughly seem accurate? Are there additional gaps >>>> I've missed? Thanks, >>> >>> just one question why not do it in firmware so windows is thinkably >>> also handled? >> >> I suppose someone could chime in if they have a similar requirement >> for Windows guests. Otherwise, the incremental effort to extend Linux >> EA support seems smaller, though I also don't know what, if any >> support Windows has for EA to bother. Regardless, improving Linux EA >> support might help elsewhere and doesn't preclude edk2 support in the >> future. Thanks, >> > > If EA is too much of a hassle to implement, another avenue that you > might explore is EFI_INCOMPATIBLE_PCI_DEVICE_SUPPORT_PROTOCOL in edk2, > which can be implemented by the platform to inform the PCI core about > non-PCI compliant devices that have special requirements. > > While it is supposed to support this use case too, the PCI resource > allocation code in EDK2 currently does not correctly support fixed > resources that are reported by this protocol, but getting that fixed > (and implementing the protocol in your firmware) might be a shorter > path to getting this hardware supported under any OS (assuming EFI > boot) than EA. Thanks for all the input. It seems that EFI_INCOMPATIBLE_PCI_DEVICE_SUPPORT_PROTOCOL is the path forward to explore for this design. At a surface level, this looks feasible, and I'll spend some time researching it and putting together a PoC before coming back with more details. -Tushar
Hi, > If EA is too much of a hassle to implement, another avenue that you > might explore is EFI_INCOMPATIBLE_PCI_DEVICE_SUPPORT_PROTOCOL in edk2, > which can be implemented by the platform to inform the PCI core about > non-PCI compliant devices that have special requirements. And OVMF already has a driver for that protocol, the pcie bridge window size hints from qemu are propagated to the edk2 pci core that way. take care, Gerd
On Tue, May 12, 2026 at 05:57:19PM -0600, Alex Williamson wrote: > On Tue, May 12, 2026, at 5:12 PM, Michael S. Tsirkin wrote: > > On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote: > >> If we agree that homogeneous hierarchies (no mixing of EA and > >> programmable BARs) is a reasonable constraint, and possibly extend that > >> to homogeneous per host bridge to simplify the CRS mapping, we have the > >> following work items: > >> > >> * Extend Linux EA support to program bridge apertures for subordinate > >> homogeneous EA hierarchies. > >> > >> * Develop options to virtualize programmable BARs as EA for vfio-pci > >> devices, if not generically for the benefit of testing. > >> > >> * Implement a way to poke holes in the VM address space and plumb > >> through to account for addresses used by EA devices. > >> > >> * Provide those same ranges to the guest via CRS (but not via DT to > >> EDK2), or alternatively expose them through additional PXB host > >> bridges. > >> > >> Does that shape roughly seem accurate? Are there additional gaps I've > >> missed? Thanks, > > > > just one question why not do it in firmware so windows > > is thinkably also handled? > > I suppose someone could chime in if they have a similar requirement > for Windows guests. Otherwise, the incremental effort to extend > Linux EA support seems smaller, though I also don't know what, if > any support Windows has for EA to bother. Regardless, improving > Linux EA support might help elsewhere and doesn't preclude edk2 > support in the future. Thanks, I think there are specific already deployed distros that need to work under qemu though - so I would discount anything that needs kernel changes to work Jason
© 2016 - 2026 Red Hat, Inc.