[RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges

Jonathan Cameron via posted 1 patch 1 year ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230421165037.2506-1-Jonathan.Cameron@huawei.com
Maintainers: Peter Maydell <peter.maydell@linaro.org>, "Michael S. Tsirkin" <mst@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
hw/arm/virt.c        | 214 +++++++++++++++++++++++++++++++++++++++++++
hw/pci/pci.c         |  21 +++++
include/hw/pci/pci.h |   1 +
3 files changed, 236 insertions(+)
[RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Jonathan Cameron via 1 year ago
This patch and the problem it is trying to resolve will form part of a talk
I will be giving next week at Linaro Connect. https://sched.co/1K85p

So in the spirit of giving people almost no warning... Plus being able to
say the discussion has kicked off here is the simplest solution I could
come up with. If we can agree on an approach to the sizing of memory
windows question by Thursday even better (I'll update the slides!) ;)

This is a precursor for CXL on arm-virt (for which DT support was
requested). CXL QEMU emulation uses pxb-cxl which inherits from the
slightly simpler pxb-pcie. ACPI support for these additional host bridges
has been available for some time but relies an interesting dance with
EDK2:
* During initial board creation the PXB buses are largely ignored
  and ACPI tables + DT passed to EDK2 only expose the root bus on
  which the section of the PXB instance that exists as a normal PCI EP
  may be discovered.
* EDK2 then carries out full PCI bus enumeration, including the buses
  below the PXB host bridges.
* Finally EDK2 calls back into QEMU to rebuild the tables via
  virt_acpi_build_update(), at which point the correct DSDT for the
  PXB host bridges is generated and an adjust DSDT section is generated
  for the primary host bridge (holes are bunched in the _CRS).

For device tree bindings there is no such dance with the firmware and as
such we need QEMU to directly present device tree entries for the primary
PCIe bus and the PXB instances.

This cannot be fully done either at initial PCIe instantiation, (PXB
instances not yet instantiated) or at in virt_machine_done() as it is
for ACPI (virtio-mem hot plug routines rely on presence of primary PCIe bus).
Thus the proposed solution is to set things up initially without being
careful and later update it with additional checks that we don't get
overlapping bus numbers.

It 'might' be possible to use hotplug handlers to update the DT as we
go along, but that would be rather challenging as each additional PXB
introduction would sometimes require an update of the dt for all other
instances.

Note: The linux,pci-domain dt element is more restrictive than the
equivalent in ACPI, so for now just put each PXB in it's own domain. I'll
look to address relaxing that on the kernel side, but then this code won't
initially work with older kernels - so this lack of alignment with ACPI
systems may have to persist (you end up with a single segment for ACPI
systems, and multiple for DT).

Another question is how to allocate the resources. A PXB instance uses
a section of the ECAM space (which is fine as that is defined by bus
numbers) and also part of each of the PCI windows. When EDK2 is involved,
the allocation of the windows is done after enumeration of connected
PCI devices / switches etc with heuristics adding extra space for
hotplug.  Thus the windows configured for the bridges can be easily
established and written into the ACPI _CRS entries.

I've considered two options:
1) (More or less) enumerate the full PCI topology in QEMU, just don't
   actually write the registers for routing. Apply similar heuristics
   (note that EDK2 and the kernel may have different ones and they
   have changed over time) to establish how much memory is needed
   in each window. This is a complex solution that may still provide
   allocations that aren't big enough for the kernel to then
   enumerate the PCI topology using it's algorithms.
2) Use the one thing we definitely control for PXB instances, the base
   bus number.  Based on how many buses are available to each PXB
   instance allocate a proportion of the available memory windows. Again,
   it's more than possible that this won't leave enough room to provide
   all of the necessary space to accommodate BARs, but is fairly simple to
   implement and fairly simple to understand when the allocation doesn't
   work. This option is what the current code does.

Note that either way any regressions should be limited to systems using
DT with PXB instances which would be odd given they don't currently work.

RFC being posted to get feedback on the general approach we should take.
One option would be to ignore pxb-pcie and only deal with pxb-cxl on
the basis that removes risk of regressions because we don't support
CXL on arm-virt yet.  However, code will end up much the same so we
can discuss this without the added fun that CXL will bring.

Other more minor nasties:
* EDK2 breaks if we pass it a DT that includes the PXB instances.
* Need to look closer at the interrupt mapping - or potentially make this
  MSI/MSIX only. Equivalent is broken on x86 / q35 :)

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 hw/arm/virt.c        | 214 +++++++++++++++++++++++++++++++++++++++++++
 hw/pci/pci.c         |  21 +++++
 include/hw/pci/pci.h |   1 +
 3 files changed, 236 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ac626b3bef..4648437bba 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -55,6 +55,8 @@
 #include "qemu/bitops.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/pci_bus.h"
 #include "hw/pci-host/gpex.h"
 #include "hw/virtio/virtio-pci.h"
 #include "hw/core/sysbus-fdt.h"
@@ -1648,6 +1650,210 @@ static void virt_build_smbios(VirtMachineState *vms)
     }
 }
 
+static void virt_add_pxb_dt(VirtMachineState *vms, PCIBus *pxb_bus,
+                            int max_bus_num, hwaddr mmio_per_bus,
+                            hwaddr mmio_high_per_bus,
+                            int pci_domain,
+                            Error **errp)
+{
+    void *fdt = MACHINE(vms)->fdt;
+    int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
+    char *nodename_pxb;
+    uint8_t bus_num = pci_bus_num(pxb_bus);
+    hwaddr base_mmio_pxb = vms->memmap[VIRT_PCIE_MMIO].base +
+        mmio_per_bus * bus_num;
+    hwaddr size_mmio_pxb;
+    hwaddr base_mmio_high_pxb = vms->memmap[VIRT_HIGH_PCIE_MMIO].base +
+        mmio_high_per_bus * bus_num;
+    hwaddr size_mmio_high_pxb;
+    hwaddr base_ecam_pxb, size_ecam_pxb;
+    int buses;
+
+    buses = pci_bus_count_buses(pxb_bus);
+    if (bus_num + buses >= max_bus_num) {
+        error_setg(errp,
+                   "Insufficient bus numbers (req %d > avail: %d) available for PXB topology",
+                   buses, max_bus_num - bus_num - 1);
+        return;
+    }
+
+    base_ecam_pxb = vms->memmap[ecam_id].base;
+    base_ecam_pxb += bus_num * PCIE_MMCFG_SIZE_MIN;
+    size_ecam_pxb = PCIE_MMCFG_SIZE_MIN * (max_bus_num - bus_num);
+    size_mmio_pxb = mmio_per_bus * (max_bus_num - bus_num);
+    size_mmio_high_pxb = mmio_high_per_bus * (max_bus_num - bus_num);
+    nodename_pxb = g_strdup_printf("/pcie@%" PRIx64, base_mmio_pxb);
+    qemu_fdt_add_subnode(fdt, nodename_pxb);
+    qemu_fdt_setprop_cell(fdt, nodename_pxb, "linux,pci-domain", pci_domain);
+    qemu_fdt_setprop_string(fdt, nodename_pxb, "device_type", "pci");
+    qemu_fdt_setprop_cell(fdt, nodename_pxb, "#address-cells", 3);
+    qemu_fdt_setprop_cell(fdt, nodename_pxb, "#size-cells", 2);
+    qemu_fdt_setprop_string(fdt, nodename_pxb,
+                            "compatible", "pci-host-ecam-generic");
+    /* I'm not sure what this should be. */
+    if (vms->msi_phandle) {
+        qemu_fdt_setprop_cells(fdt, nodename_pxb, "msi-map",
+                               0, vms->msi_phandle, 0, 0x10000);
+    }
+    qemu_fdt_setprop_cells(fdt, nodename_pxb, "bus-range",
+                           bus_num, max_bus_num - 1);
+    qemu_fdt_setprop_sized_cells(fdt, nodename_pxb, "reg", 2, base_ecam_pxb,
+                                 2, size_ecam_pxb);
+
+    if (vms->highmem_mmio) {
+        qemu_fdt_setprop_sized_cells(fdt, nodename_pxb, "ranges",
+                                     1, FDT_PCI_RANGE_IOPORT, 2, 0,
+                                     /* No PIO space for PXB */
+                                     2, 0, 2, 0,
+                                     1, FDT_PCI_RANGE_MMIO,
+                                     2, base_mmio_pxb,
+                                     2, base_mmio_pxb,
+                                     2, size_mmio_pxb,
+                                     1, FDT_PCI_RANGE_MMIO_64BIT,
+                                     2, base_mmio_high_pxb,
+                                     2, base_mmio_high_pxb,
+                                     2, size_mmio_high_pxb);
+    } else {
+        qemu_fdt_setprop_sized_cells(fdt, nodename_pxb, "ranges",
+                                     1, FDT_PCI_RANGE_IOPORT, 2, 0,
+                                     2, 0, 2, 0,
+                                     1, FDT_PCI_RANGE_MMIO,
+                                     2, base_mmio_pxb,
+                                     2, base_mmio_pxb,
+                                     2, size_mmio_pxb);
+    }
+}
+
+static void virt_update_pci_dt(VirtMachineState *vms, Error **errp)
+{
+    void *fdt = MACHINE(vms)->fdt;
+    char *nodename = vms->pciehb_nodename;
+    hwaddr base_mmio = vms->memmap[VIRT_PCIE_MMIO].base;
+    hwaddr base_mmio_high = vms->memmap[VIRT_HIGH_PCIE_MMIO].base;
+    hwaddr size_mmio, size_mmio_high;
+    hwaddr base_pio = vms->memmap[VIRT_PCIE_PIO].base;
+    hwaddr size_pio = vms->memmap[VIRT_PCIE_PIO].size;
+    int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
+    hwaddr base_ecam = vms->memmap[ecam_id].base;
+    int total_bus_nr = vms->memmap[ecam_id].size / PCIE_MMCFG_SIZE_MIN;
+    hwaddr mmio_per_bus, mmio_high_per_bus, size_ecam;
+    int pci_domain = 1; /* Main bus is 0 */
+    int buses, max_bus_nr;
+    PCIBus *bus;
+
+    /*
+     * EDK2 ACPI flow for PXB instances breaks if we modify the DT to incorporate
+     * them. So for now only enable DT for PXBs if acpi=off
+     */
+    if (virt_is_acpi_enabled(vms)) {
+        return;
+    }
+
+    /*
+     * Only support PCI Expander Bridges if highmem_ecam available.
+     * Hard to squeeze them into the smaller ecam.
+     */
+    if (!vms->highmem_ecam) {
+        return;
+    }
+
+    /* First update DT for the main PCI bus */
+    bus = vms->bus;
+    max_bus_nr = total_bus_nr;
+    /* Find the max bus nr as smallest of the PXB buses */
+    QLIST_FOREACH(bus, &bus->child, sibling) {
+        uint8_t bus_num;
+
+        if (!pci_bus_is_root(bus)) {
+            continue;
+        }
+        bus_num = pci_bus_num(bus);
+        if (bus_num < max_bus_nr) {
+            max_bus_nr = bus_num;
+        }
+    }
+
+    buses = pci_bus_count_buses(vms->bus);
+    if (buses >= max_bus_nr) {
+        error_setg(errp,
+                   "Insufficient bus numbers (req %d > avail: %d) available for primary PCIe toplogy",
+                   buses, max_bus_nr - 1);
+        return;
+    }
+
+    /*
+     * For other resources options are:
+     * 1) Divide them up based on bus number ranges.
+     * 2) Discover what is needed by doing part of bus enumeration.
+     *    This is complex, and we may not want that complexity in QEMU.
+     */
+
+    size_ecam = max_bus_nr * PCIE_MMCFG_SIZE_MIN;
+    if (size_ecam > vms->memmap[ecam_id].size) {
+        size_ecam = vms->memmap[ecam_id].size;
+    }
+
+    mmio_per_bus = vms->memmap[VIRT_PCIE_MMIO].size / total_bus_nr;
+    size_mmio = mmio_per_bus * max_bus_nr;
+    mmio_high_per_bus = vms->memmap[VIRT_HIGH_PCIE_MMIO].size / total_bus_nr;
+    size_mmio_high = mmio_high_per_bus * max_bus_nr;
+    qemu_fdt_setprop_cells(fdt, nodename, "bus-range", 0, max_bus_nr - 1);
+
+    qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", 2, base_ecam,
+                                 2, size_ecam);
+
+    /*
+     * Leave all of PIO space for the main GPEX  - avoids issues with
+     * regions that are too small to map in OS.
+     */
+    if (vms->highmem_mmio) {
+        qemu_fdt_setprop_sized_cells(fdt, nodename, "ranges",
+                                     1, FDT_PCI_RANGE_IOPORT, 2, 0,
+                                     2, base_pio, 2, size_pio,
+                                     1, FDT_PCI_RANGE_MMIO, 2, base_mmio,
+                                     2, base_mmio, 2, size_mmio,
+                                     1, FDT_PCI_RANGE_MMIO_64BIT,
+                                     2, base_mmio_high,
+                                     2, base_mmio_high, 2, size_mmio_high);
+    } else {
+        qemu_fdt_setprop_sized_cells(fdt, nodename, "ranges",
+                                     1, FDT_PCI_RANGE_IOPORT, 2, 0,
+                                     2, base_pio, 2, size_pio,
+                                     1, FDT_PCI_RANGE_MMIO, 2, base_mmio,
+                                     2, base_mmio, 2, size_mmio);
+    }
+
+    /* Now add the PCI Expander Bridges (PXB) */
+    bus = vms->bus;
+    QLIST_FOREACH(bus, &bus->child, sibling) {
+        uint16_t max_bus_num = 0x100;
+        PCIBus *bus_start = vms->bus;
+        int bus_num = pci_bus_num(bus);
+
+        if (!pci_bus_is_root(bus)) {
+            continue;
+        }
+
+        /* Find the minimum PXB bus number greater than this one */
+        QLIST_FOREACH(bus_start, &bus_start->child, sibling) {
+            uint8_t this_bus_num;
+
+            if (!pci_bus_is_root(bus_start)) {
+                continue;
+            }
+
+            this_bus_num = pci_bus_num(bus_start);
+            if (this_bus_num > bus_num &&
+                this_bus_num < max_bus_num) {
+                max_bus_num = this_bus_num;
+            }
+        }
+
+        virt_add_pxb_dt(vms, bus, max_bus_num, mmio_per_bus, mmio_high_per_bus,
+                        pci_domain++, errp);
+    }
+}
+
 static
 void virt_machine_done(Notifier *notifier, void *data)
 {
@@ -1658,6 +1864,14 @@ void virt_machine_done(Notifier *notifier, void *data)
     struct arm_boot_info *info = &vms->bootinfo;
     AddressSpace *as = arm_boot_address_space(cpu, info);
 
+    /*
+     * If PCI Expander Bridges pxb-pcie, have been added, the adjustments to
+     * the main PCIe DT entries and creation of those for the PXB host bridge
+     * entires, may only be created after all PCI devices are present as only
+     * at that time can resource requirements (bus numbers etc) be known.
+     */
+    virt_update_pci_dt(vms, &error_fatal);
+
     /*
      * If the user provided a dtb, we assume the dynamic sysbus nodes
      * already are integrated there. This corresponds to a use case where
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index def5000e7b..47fd581fe4 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -552,6 +552,27 @@ void pci_root_bus_cleanup(PCIBus *bus)
     qbus_unrealize(BUS(bus));
 }
 
+int pci_bus_count_buses(PCIBus *bus)
+{
+    int buses = 0;
+    int devfn;
+
+    for (devfn = 0; devfn < 256; devfn++) {
+        PCIDevice *d;
+
+        if (!bus->devices[devfn]) {
+            continue;
+        }
+        d = bus->devices[devfn];
+        if (!object_dynamic_cast(OBJECT(d), TYPE_PCI_BRIDGE)) {
+            continue;
+        }
+        buses += pci_bus_count_buses(&PCI_BRIDGE(d)->sec_bus);
+        buses++;
+    }
+    return buses;
+}
+
 void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq,
                   void *irq_opaque, int nirq)
 {
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index d5a40cd058..565a968c14 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -282,6 +282,7 @@ PCIBus *pci_root_bus_new(DeviceState *parent, const char *name,
                          MemoryRegion *address_space_io,
                          uint8_t devfn_min, const char *typename);
 void pci_root_bus_cleanup(PCIBus *bus);
+int pci_bus_count_buses(PCIBus *bus);
 void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq,
                   void *irq_opaque, int nirq);
 void pci_bus_map_irqs(PCIBus *bus, pci_map_irq_fn map_irq);
-- 
2.37.2
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Peter Maydell 1 year ago
On Fri, 21 Apr 2023 at 17:50, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> This patch and the problem it is trying to resolve will form part of a talk
> I will be giving next week at Linaro Connect. https://sched.co/1K85p
>
> So in the spirit of giving people almost no warning... Plus being able to
> say the discussion has kicked off here is the simplest solution I could
> come up with. If we can agree on an approach to the sizing of memory
> windows question by Thursday even better (I'll update the slides!) ;)
>
> This is a precursor for CXL on arm-virt (for which DT support was
> requested). CXL QEMU emulation uses pxb-cxl which inherits from the
> slightly simpler pxb-pcie. ACPI support for these additional host bridges
> has been available for some time but relies an interesting dance with
> EDK2:
> * During initial board creation the PXB buses are largely ignored
>   and ACPI tables + DT passed to EDK2 only expose the root bus on
>   which the section of the PXB instance that exists as a normal PCI EP
>   may be discovered.
> * EDK2 then carries out full PCI bus enumeration, including the buses
>   below the PXB host bridges.
> * Finally EDK2 calls back into QEMU to rebuild the tables via
>   virt_acpi_build_update(), at which point the correct DSDT for the
>   PXB host bridges is generated and an adjust DSDT section is generated
>   for the primary host bridge (holes are bunched in the _CRS).
>
> For device tree bindings there is no such dance with the firmware and as
> such we need QEMU to directly present device tree entries for the primary
> PCIe bus and the PXB instances.

So, not knowing anything about CXL, my naive question is:
this is PCI, why can't we handle it the way we handle everything
else PCI, i.e. present the PCI controller in the DTB and it's
the guest kernel's job to probe, enumerate, etc whatever it wants ?

thanks
-- PMM
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Jonathan Cameron via 1 year ago
On Mon, 24 Apr 2023 10:28:30 +0100
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Fri, 21 Apr 2023 at 17:50, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > This patch and the problem it is trying to resolve will form part of a talk
> > I will be giving next week at Linaro Connect. https://sched.co/1K85p
> >
> > So in the spirit of giving people almost no warning... Plus being able to
> > say the discussion has kicked off here is the simplest solution I could
> > come up with. If we can agree on an approach to the sizing of memory
> > windows question by Thursday even better (I'll update the slides!) ;)
> >
> > This is a precursor for CXL on arm-virt (for which DT support was
> > requested). CXL QEMU emulation uses pxb-cxl which inherits from the
> > slightly simpler pxb-pcie. ACPI support for these additional host bridges
> > has been available for some time but relies an interesting dance with
> > EDK2:
> > * During initial board creation the PXB buses are largely ignored
> >   and ACPI tables + DT passed to EDK2 only expose the root bus on
> >   which the section of the PXB instance that exists as a normal PCI EP
> >   may be discovered.
> > * EDK2 then carries out full PCI bus enumeration, including the buses
> >   below the PXB host bridges.
> > * Finally EDK2 calls back into QEMU to rebuild the tables via
> >   virt_acpi_build_update(), at which point the correct DSDT for the
> >   PXB host bridges is generated and an adjust DSDT section is generated
> >   for the primary host bridge (holes are bunched in the _CRS).
> >
> > For device tree bindings there is no such dance with the firmware and as
> > such we need QEMU to directly present device tree entries for the primary
> > PCIe bus and the PXB instances.  
> 
> So, not knowing anything about CXL, my naive question is:
> this is PCI, why can't we handle it the way we handle everything
> else PCI, i.e. present the PCI controller in the DTB and it's
> the guest kernel's job to probe, enumerate, etc whatever it wants ?

Absolutely the kernel will still do the enumeration.  But there's a problem
- it won't always work, unless there is 'enough room'.

If the aim is to do it in a similar fashion to how PCI Expander Bridges (PXB)
are handled today with ACPI (we could look at doing something different of course)

We have one set of memory windows for assigning PCI bars into etc. In the model
used for PXB, that set of resources is shared between different host bridges
including the main one (each one has separate DT description).

So for virt, VIRT_PCIE_MMIO, VIRT_PCIE_IO, VIRT_PCIE_ECAM, VIRT_HIGH_PCIE_ECAM
and VIRT_HIGH_PCIE_MMIO are split up between the host bridges.
Each host bridge has it's own DT description. 

For ACPI, how to split up available memory windows up depends on the requirements
of the PCIe devices below the host bridges.  For ACPI we 'know' the answer
as to what is required at the point of creating the description because EDK2
did the work for us.  For DT we currently don't.  What to do about that is the
question this RFC tries to address.

In theory we could change the kernel to support enumerating PXB instances, but
that's a very different model from today where they are just 'standard'
host bridges, using the generic bindings for ACPI (and after this patch for DT).

I'll add an example in a reply (on wrong machine and stuck in a meeting today).

Jonathan

> 
> thanks
> -- PMM
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Peter Maydell 1 year ago
On Mon, 24 Apr 2023 at 16:41, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Mon, 24 Apr 2023 10:28:30 +0100
> Peter Maydell <peter.maydell@linaro.org> wrote:
> > So, not knowing anything about CXL, my naive question is:
> > this is PCI, why can't we handle it the way we handle everything
> > else PCI, i.e. present the PCI controller in the DTB and it's
> > the guest kernel's job to probe, enumerate, etc whatever it wants ?
>
> Absolutely the kernel will still do the enumeration.  But there's a problem
> - it won't always work, unless there is 'enough room'.
>
> If the aim is to do it in a similar fashion to how PCI Expander Bridges (PXB)
> are handled today with ACPI (we could look at doing something different of course)
>
> We have one set of memory windows for assigning PCI bars into etc. In the model
> used for PXB, that set of resources is shared between different host bridges
> including the main one (each one has separate DT description).
>
> So for virt, VIRT_PCIE_MMIO, VIRT_PCIE_IO, VIRT_PCIE_ECAM, VIRT_HIGH_PCIE_ECAM
> and VIRT_HIGH_PCIE_MMIO are split up between the host bridges.
> Each host bridge has it's own DT description.
>
> For ACPI, how to split up available memory windows up depends on the requirements
> of the PCIe devices below the host bridges.  For ACPI we 'know' the answer
> as to what is required at the point of creating the description because EDK2
> did the work for us.  For DT we currently don't.  What to do about that is the
> question this RFC tries to address.
>
> In theory we could change the kernel to support enumerating PXB instances, but
> that's a very different model from today where they are just 'standard'
> host bridges, using the generic bindings for ACPI (and after this patch for DT).

On the other hand, having QEMU enumerate PCI devices is *also* a
very different model from today, where we assume that the guest
code is the one that is going to deal with enumerating PCI devices.
To my mind one of the major advantages of PCI is exactly that it
is entirely probeable and discoverable, so that there is no need
for the dtb to include a lot of information that the kernel can
find out for itself by directly asking the hardware...

thanks
-- PMM
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Jonathan Cameron via 1 year ago
On Mon, 24 Apr 2023 16:45:48 +0100
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Mon, 24 Apr 2023 at 16:41, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Mon, 24 Apr 2023 10:28:30 +0100
> > Peter Maydell <peter.maydell@linaro.org> wrote:  
> > > So, not knowing anything about CXL, my naive question is:
> > > this is PCI, why can't we handle it the way we handle everything
> > > else PCI, i.e. present the PCI controller in the DTB and it's
> > > the guest kernel's job to probe, enumerate, etc whatever it wants ?  
> >
> > Absolutely the kernel will still do the enumeration.  But there's a problem
> > - it won't always work, unless there is 'enough room'.
> >
> > If the aim is to do it in a similar fashion to how PCI Expander Bridges (PXB)
> > are handled today with ACPI (we could look at doing something different of course)
> >
> > We have one set of memory windows for assigning PCI bars into etc. In the model
> > used for PXB, that set of resources is shared between different host bridges
> > including the main one (each one has separate DT description).
> >
> > So for virt, VIRT_PCIE_MMIO, VIRT_PCIE_IO, VIRT_PCIE_ECAM, VIRT_HIGH_PCIE_ECAM
> > and VIRT_HIGH_PCIE_MMIO are split up between the host bridges.
> > Each host bridge has it's own DT description.
> >
> > For ACPI, how to split up available memory windows up depends on the requirements
> > of the PCIe devices below the host bridges.  For ACPI we 'know' the answer
> > as to what is required at the point of creating the description because EDK2
> > did the work for us.  For DT we currently don't.  What to do about that is the
> > question this RFC tries to address.
> >
> > In theory we could change the kernel to support enumerating PXB instances, but
> > that's a very different model from today where they are just 'standard'
> > host bridges, using the generic bindings for ACPI (and after this patch for DT).  
> 
> On the other hand, having QEMU enumerate PCI devices is *also* a
> very different model from today, where we assume that the guest
> code is the one that is going to deal with enumerating PCI devices.
> To my mind one of the major advantages of PCI is exactly that it
> is entirely probeable and discoverable, so that there is no need
> for the dtb to include a lot of information that the kernel can
> find out for itself by directly asking the hardware...

I absolutely agree that QEMU enumerating PCI seem silly level of complexity
to introduce. So easy route is to just use the bus numbers to partition
resources. We have those available without any complexity. It's not the
same as how it's done with ACPI, but then the alternatives are either
(though maybe they are closer).  Note current proposed algorithm may be
too simplistic (perhaps some alignment forcing should adjust the division
of the resources to start on round number addresses)

More complex route will be to push the whole problem to the kernel.
I doubt we'll get any agreement to add this to the generic host
bridge DT bindings given it's just to support QEMU specific host bridges
which the kernel has previously had no knowledge of beyond as generic
PCIe Host bridges. So we'd need QEMU to detect the presence of PXB
instances then use a new DT compatible for the main host bridge
and kick off a new type of host bridge discovery.
Probably also need to modify EDK2 to recognize this, which will give
us compatibility problems with existing ACPI usage (EDK2 is reading
the DT bindings so kick off it's PCIe enumeration so this gets fiddly)
May also have problems with the kernel having to do late discovery of
host bridges (not tried it but sounds 'interesting'.)

Anyhow, I promised some illustrative examples.

For reference, here's one of my test configurations
(slightly pathological as it exercises some of the heuristics in enumeration)

-+-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
 |           +-01.0-[01]--
 |           +-02.0  Red Hat, Inc. Virtio block device
 |           +-03.0  Red Hat, Inc. Virtio network device
 |           +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
 |           \-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
 +-[0000:0c]---00.0-[0d-0f]--+-00.0-[0e-0f]----00.0-[0f]----00.0  Red Hat, Inc. Virtio RNG
 |                           \-00.1  Red Hat, Inc. Virtio RNG
 \-[0000:30]---00.0-[31]----00.0  Red Hat, Inc. Virtio RNG

It's somewhat easier to see the current ACPI results vs what this
patch proposed for DT by looking at /proc/iomem than looking at the
firmware descriptions themselves. /proc/iomem includes
the firmware described host bridge windows and the PCI buses and devices
enumerated below them (EDK2 enumerated for ACPI, Linux kernel for DT)
There are some other subtle differences of how the memory is mapped, but
hopefully you can see the relationship between what happens with ACPI
(requiring enumeration) and the proposal in this RFC for DT which doesn't.

Relevant chunks of /proc/iomem for ACPI for 32 bit region (64 bit one similar.)
//FW described main HB
10000000-103fffff : PCI Bus 0000:00
  //Everything indented discovered by enumeration of PCI bus.
  10000000-101fffff : PCI Bus 0000:01
  10200000-1023ffff : 0000:00:03.0
  10240000-10240fff : 0000:00:01.0
  10241000-10241fff : 0000:00:02.0
  10242000-10242fff : 0000:00:03.0
//FW described. 1st PXB on bus 0xc
//Key is start is defined as previous enumerated device end + some heuristic driven padding.
10400000-10700fff : PCI Bus 0000:0c
  10400000-105fffff : PCI Bus 0000:0d
    10400000-10400fff : PCI Bus 0000:0e
      10400000-10400fff : PCI Bus 0000:0f
        10400000-10400fff : 0000:0f:00.0
    10401000-10401fff : 0000:0d:00.1
  10600000-10600fff : 0000:0c:00.0
//FW described spare bit assigned to main HB
10701000-107fffff : PCI Bus 0000:00
//FW described. 2nd PXB on bus 0x30
//Key is start is defined as previous enumerated device end + some heuristic driven padding.
10800000-10a00fff : PCI Bus 0000:30
  10800000-108fffff : PCI Bus 0000:31
    10800000-10800fff : 0000:31:00.0
  10900000-10900fff : 0000:30:00.0
10a01000-3efeffff : PCI Bus 0000:00

DT version of equivalent from this patch.  Mem split up based only bus numbers.
//FW described.
10000000-1233f3ff : pcie@10000000
  //Everything indented is enumerated by the linux kernel
  10000000-101fffff : PCI Bus 0000:01
  10200000-1023ffff : 0000:00:03.0
  10240000-10240fff : 0000:00:01.0
  10241000-10241fff : 0000:00:02.0
  10242000-10242fff : 0000:00:03.0
//lots of space here

//FW described 1st PXB - note way above end of previous enumerated region.
1233f400-18cfcfff : pcie@1233f400
  12340000-12340fff : 0001:0c:00.0
  12400000-126fffff : PCI Bus 0001:0d
    12400000-125fffff : PCI Bus 0001:0e
      12400000-125fffff : PCI Bus 0001:0f
        12400000-12400fff : 0001:0f:00.0
  12600000-12600fff : 0001:0d:00.1
//lots of space here..

//FW described 2nd PXB - note way above en of previous enumerated region.
18cfd000-3efeffff : pcie@18cfd000
  18cfd000-18cfdfff : 0002:30:00.0
  18d00000-18efffff : PCI Bus 0002:31
    18d00000-18d00fff : 0002:31:00.0
//lots of space here.


Note the main bridge CRS and hence /proc/iomem entries have entries after
the PXBs because QEMU provides the left over bits
of the window, even though there is no easy way to use them. For the DT
scheme they end up in the space for which ever PXB has the highest bus
number.

In this case we have masses of room in the DT scheme as all the devices
have small bars. I have test cases that fail to have enough room
but those are even worse to read and require hacks to the PCI test device
driver.

I'm not great at reading DT binary from sysfs so some dt version of this
ripped from /proc/iomem above rather than directly

//Only first entry of each _CRS type is actually useful.
//In DT node: pcie@10000000
Device (PCI0)
{
    Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
    Name (_CID, "PNP0A03" /* PCI Bus */)  // _CID: Compatible ID
    Name (_SEG, Zero)  // _SEG: PCI Segment
    Name (_BBN, Zero)  // _BBN: BIOS Bus Number
...
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
//Bus numbers 0 to B :
// DT bus-range = [0, b]
        WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
            0x0000,             // Granularity
            0x0000,             // Range Minimum
            0x000B,             // Range Maximum
            0x0000,             // Translation Offset
            0x000C,             // Length
            ,, )
//First part to cover devices enumeratd + space from heuristics (bit of padding)
// DT ranges first entry 10000000-1233f3ff : pcie@10000000
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10000000,         // Range Minimum
            0x103FFFFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x00400000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//A hole from enumeration of the PXBs.  Doesn't exist in DT form...
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10701000,         // Range Minimum
            0x107FFFFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x000FF000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//Left over above PXBs. Doesn't exist in DT form where we assign 'the rest' to
//the PXB with the highest bus number.
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10A01000,         // Range Minimum
            0x3EFEFFFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x2E5EF000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//First part to cover devices enumerated (+ space from heuristics)
//I left whole range for IO on DT version.  Don't really care about this being available
//for modern PCIe devices below PXBs anyway.
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00000000,         // Range Minimum
            0x00000FFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x00001000,         // Length
            ,, , TypeStatic, DenseTranslation)
//Left over of original window - no used by anything
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00003000,         // Range Minimum
            0x0000FFFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x0000D000,         // Length
            ,, , TypeStatic, DenseTranslation)
//First part to cover the devices enumerated.
// DT proportional version: 8000000000-85ffffffff : pcie@10000000
// Note that in this example all the devices have small BARs so with enumeration
// they don't need as much space as I gave with DT version.
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000000000, // Range Minimum
            0x00000080000FFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000000000100000, // Length
            ,, , AddressRangeMemory, TypeStatic)
//Left over of original window - not used by anything.
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000400000, // Range Minimum
            0x000000FFFFFFFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000007FFFC00000, // Length
            ,, , AddressRangeMemory, TypeStatic)
    })
...

Device (PC0C)
{
    Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
    Name (_CID, "PNP0A03" /* PCI Bus */)  // _CID: Compatible ID
    Name (_BBN, 0x0C)  // _BBN: BIOS Bus Number
    Name (_UID, 0x0C)  // _UID: Unique ID
... 
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
// IO space big enough for enumerated devices on first PXB
// For DT I gave no IO space to PXBs for now.
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00001000,         // Range Minimum
            0x00001FFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x00001000,         // Length
            ,, , TypeStatic, DenseTranslation)
//Lower mem region for enumerated devices on first PXB
//DT : 1233f400-18cfcfff : pcie@1233f400
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10400000,         // Range Minimum
            0x10700FFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x00301000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//High mem region for enumerated devices on first PXB
//DT: 8600000000-97ffffffff : pcie@1233f400
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000100000, // Range Minimum
            0x00000080002FFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000000000200000, // Length
            ,, , AddressRangeMemory, TypeStatic)
//PXB covers bus numbers 0xc to 0xf
        WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
            0x0000,             // Granularity
            0x000C,             // Range Minimum
            0x000F,             // Range Maximum
            0x0000,             // Translation Offset
            0x0004,             // Length
            ,, )
    })
Device (PC30)
{
    Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
    Name (_CID, "PNP0A03" /* PCI Bus */)  // _CID: Compatible ID
    Name (_BBN, 0x30)  // _BBN: BIOS Bus Number
    Name (_UID, 0x30)  // _UID: Unique ID
    Name (_STR, Unicode ("pxb Device"))  // _STR: Description String
    Name (_CCA, One)  // _CCA: Cache Coherency Attribute
    Name (_PRT, Package (0x80)  // _PRT: PCI Routing Table
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
//IO space for 2nd PXB - punched out of the main bus space.
//DT: Again no IO space.
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00002000,         // Range Minimum
            0x00002FFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x00001000,         // Length
            ,, , TypeStatic, DenseTranslation)
//low mem space for 2nd PXB punched out of main host bridge space.
// DT: 18cfd000-3efeffff : pcie@18cfd000
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10800000,         // Range Minimum
            0x10A00FFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x00201000,         // Length
            ,, , AddressRangeMemory, TypeStatic)#
//high mem space for 2nd PXB punched out main host bridge space.
// DT: 9800000000-ffffffffff : pcie@18cfd000
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000300000, // Range Minimum
            0x00000080003FFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000000000100000, // Length
            ,, , AddressRangeMemory, TypeStatic)
// Just 2 bus numbers (directly connected device)
        WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
            0x0000,             // Granularity
            0x0030,             // Range Minimum
            0x0031,             // Range Maximum
            0x0000,             // Translation Offset
            0x0002,             // Length
            ,, )
        })

Maybe this mess will help illustrate the issues and approaches.

I personally think the rough approach of this patch is most sensible.
Don't enumerate the bus at all for DT case, just do partitioning based
on bus numbers.  In corner cases that will mean the kernel enumeration
fails when things would have worked with current ACPI / EDK2 dance
to allow enumeration before the ACPI tables the kernel uses are written.

Jonathan

> 
> thanks
> -- PMM
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Peter Maydell 1 year ago
On Mon, 24 Apr 2023 at 22:56, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Mon, 24 Apr 2023 16:45:48 +0100
> Peter Maydell <peter.maydell@linaro.org> wrote:
> > On the other hand, having QEMU enumerate PCI devices is *also* a
> > very different model from today, where we assume that the guest
> > code is the one that is going to deal with enumerating PCI devices.
> > To my mind one of the major advantages of PCI is exactly that it
> > is entirely probeable and discoverable, so that there is no need
> > for the dtb to include a lot of information that the kernel can
> > find out for itself by directly asking the hardware...
>
> I absolutely agree that QEMU enumerating PCI seem silly level of complexity
> to introduce. So easy route is to just use the bus numbers to partition
> resources. We have those available without any complexity. It's not the
> same as how it's done with ACPI, but then the alternatives are either
> (though maybe they are closer).  Note current proposed algorithm may be
> too simplistic (perhaps some alignment forcing should adjust the division
> of the resources to start on round number addresses)

I think we definitely need to talk about this later this week,
but my initial view is that if:
 (1) the guest kernel can get the information it needs to do this
     by probing the hardware
 (2) doing it in QEMU gives you "this isn't a great allocation"
     "we don't really have the info we need to do it optimally"
     "this is more of a policy decision" effects
     (which is what it's sounding like to me)

this is a really strong argument for "guest software should be
doing this". DTB-booting kernels has always meant the kernel
doing a lot of work that under ACPI/UEFI/x86-PC is typically
done by firmware, and this seems similar to me.

thanks
-- PMM
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Jonathan Cameron via 1 year ago
On Tue, 25 Apr 2023 09:28:44 +0100
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Mon, 24 Apr 2023 at 22:56, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Mon, 24 Apr 2023 16:45:48 +0100
> > Peter Maydell <peter.maydell@linaro.org> wrote:  
> > > On the other hand, having QEMU enumerate PCI devices is *also* a
> > > very different model from today, where we assume that the guest
> > > code is the one that is going to deal with enumerating PCI devices.
> > > To my mind one of the major advantages of PCI is exactly that it
> > > is entirely probeable and discoverable, so that there is no need
> > > for the dtb to include a lot of information that the kernel can
> > > find out for itself by directly asking the hardware...  
> >
> > I absolutely agree that QEMU enumerating PCI seem silly level of complexity
> > to introduce. So easy route is to just use the bus numbers to partition
> > resources. We have those available without any complexity. It's not the
> > same as how it's done with ACPI, but then the alternatives are either
> > (though maybe they are closer).  Note current proposed algorithm may be
> > too simplistic (perhaps some alignment forcing should adjust the division
> > of the resources to start on round number addresses)  
> 
> I think we definitely need to talk about this later this week,
> but my initial view is that if:
>  (1) the guest kernel can get the information it needs to do this
>      by probing the hardware

Not currently (from a quick look) - see below. But we could potentially
make it visible by augmenting the config space of the PCIe devices
that are discoverable.  Or we could in theory ignore the bus numbers
that we do provide as QEMU parameters, but that would be rather
surprising for users.

>  (2) doing it in QEMU gives you "this isn't a great allocation"
>      "we don't really have the info we need to do it optimally"
>      "this is more of a policy decision" effects
>      (which is what it's sounding like to me)
> 
> this is a really strong argument for "guest software should be
> doing this". DTB-booting kernels has always meant the kernel
> doing a lot of work that under ACPI/UEFI/x86-PC is typically
> done by firmware, and this seems similar to me.

We could explore only solving the problem for pxb-cxl for now.
However, we would still be talking infrastructure in kernel only
to support emulated CXL devices and I can see that being
controversial. A normal CXL host bridge is not something
we can enumerate.

I guess this may call for a PoC on the kernel side of things to
see how bad it looks. I suspect very ugly indeed :( but I could
be wrong.

I think we'll also have to augment the PXB PCI devices
that appear on the main bus to provide discoverability they don't
currently have (such as bus numbers)  Probably a DVSEC entry
in PCI extended space.

Current dump of what is there:

root@debian:~# lspci -s 05.0  -xxxx -vvv
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
        Subsystem: Red Hat, Inc. QEMU PCIe Expander bridge
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
00: 36 1b 0b 00 00 00 a0 00 00 00 00 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 f4 1a 00 11
30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Which is pretty light on info.

Jonathan


> 
> thanks
> -- PMM
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Peter Maydell 1 year ago
On Tue, 25 Apr 2023 at 18:37, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
> We could explore only solving the problem for pxb-cxl for now.
> However, we would still be talking infrastructure in kernel only
> to support emulated CXL devices and I can see that being
> controversial. A normal CXL host bridge is not something
> we can enumerate.

Hmm, so what is real hardware doing that our emulation is not?

-- PMM
Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Posted by Jonathan Cameron via 1 year ago
On Tue, 25 Apr 2023 21:15:25 +0100
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Tue, 25 Apr 2023 at 18:37, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> > We could explore only solving the problem for pxb-cxl for now.
> > However, we would still be talking infrastructure in kernel only
> > to support emulated CXL devices and I can see that being
> > controversial. A normal CXL host bridge is not something
> > we can enumerate.  
> 
> Hmm, so what is real hardware doing that our emulation is not?

For real hardware, the host bridges would not typically share
the various windows.  Various options, but most likely they
would be in different PCI segments, with separate ECAM space,
and separate space into which the BARs would be mapped.
That separation would be needed as the Physical Address
routing in the host would need to direct the reads and
writes to the correct bit of hardware and that tends to
be done with very simple address decoders on the appropriate
interconnects - those internal routing tables are normally
effectively fixed for a given system.  We could add another
full host bridge to arm-virt.  That would require separate
emulation as I don't think we can reuse the pxb-cxl stuff.

The PXB approach is to ignore the normal restrictions on
routing being fairly fixed, by building a fixed configuration
before the OS sees it - allowing different host bridges to use
different parts of a single region.  Arguably very early
boot firmware could do that with a physical system but it
would involve some nasty impdef programming of address decoder
logic.  This would be similar to what firmware does to
change the routing dependent on whether it finds a 2 socket
confirmation or a 4 socket configuration in servers. Want
entity would do this is implementation defined - may well
be before any application processor boots.

Jonathan

> 
> -- PMM