[v2] (v)PCI: extended capability handling

[PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Jan Beulich 2 weeks, 6 days ago

Legacy PCI devices don't have any extended config space. Reading any part
thereof may return all ones or other arbitrary data, e.g. in some cases
base config space contents repeatedly.

Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
determination of device type; in particular some comments are taken
verbatim from there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
exit path?

The warning near the bottom of pci_check_extcfg() may be issued multiple
times for a single device now. Should we try to avoid that?

Note that no vPCI adjustments are done here, but they're going to be
needed: Whatever requires extended capabilities will need re-
evaluating / newly establishing / tearing down in case an invocation of
PHYSDEVOP_pci_mmcfg_reserved alters global state.

Linux also has CONFIG_PCI_QUIRKS, allowing to compile out the slightly
risky code (as reads may in principle have side effects). Should we gain
such, too?
---
v2: Major re-work to also check upon PHYSDEVOP_pci_mmcfg_reserved
    invocation.

--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -22,6 +22,8 @@ int physdev_map_pirq(struct domain *d, i
                      struct msi_info *msi);
 int physdev_unmap_pirq(struct domain *d, int pirq);
 
+int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg);
+
 #include "x86_64/mmconfig.h"
 
 #ifndef COMPAT
@@ -160,6 +162,17 @@ int physdev_unmap_pirq(struct domain *d,
 
     return ret;
 }
+
+int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg)
+{
+    const struct physdev_pci_mmcfg_reserved *info = arg;
+
+    ASSERT(pdev->seg == info->segment);
+    if ( pdev->bus >= info->start_bus && pdev->bus <= info->end_bus )
+        pci_check_extcfg(pdev);
+
+    return 0;
+}
 #endif /* COMPAT */
 
 ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
@@ -511,6 +524,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
 
         ret = pci_mmcfg_reserved(info.address, info.segment,
                                  info.start_bus, info.end_bus, info.flags);
+
+        if ( !ret )
+            ret = pci_segment_iterate(info.segment, physdev_check_pci_extcfg,
+                                      &info);
+
         if ( !ret && has_vpci(currd) && (info.flags & XEN_PCI_MMCFG_RESERVED) )
         {
             /*
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -422,6 +422,9 @@ static struct pci_dev *alloc_pdev(struct
     }
 
     apply_quirks(pdev);
+
+    pci_check_extcfg(pdev);
+
     check_pdev(pdev);
 
     return pdev;
@@ -718,6 +721,11 @@ int pci_add_device(u16 seg, u8 bus, u8 d
 
                 list_add(&pdev->vf_list, &pf_pdev->vf_list);
             }
+
+            if ( !pdev->ext_cfg )
+                printk(XENLOG_WARNING
+                       "%pp: VF without extended config space?\n",
+                       &pdev->sbdf);
         }
     }
 
@@ -1041,6 +1049,75 @@ enum pdev_type pdev_type(u16 seg, u8 bus
     return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
 }
 
+void pci_check_extcfg(struct pci_dev *pdev)
+{
+    unsigned int pos, sig;
+
+    pdev->ext_cfg = false;
+
+    switch ( pdev->type )
+    {
+    case DEV_TYPE_PCIe_ENDPOINT:
+    case DEV_TYPE_PCIe_BRIDGE:
+    case DEV_TYPE_PCI_HOST_BRIDGE:
+    case DEV_TYPE_PCIe2PCI_BRIDGE:
+    case DEV_TYPE_PCI2PCIe_BRIDGE:
+        break;
+
+    case DEV_TYPE_LEGACY_PCI_BRIDGE:
+    case DEV_TYPE_PCI:
+        pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_PCIX);
+        if ( !pos ||
+             !(pci_conf_read32(pdev->sbdf, pos + PCI_X_STATUS) &
+               (PCI_X_STATUS_266MHZ | PCI_X_STATUS_533MHZ)) )
+            return;
+        break;
+
+    default:
+        return;
+    }
+
+    /*
+     * Regular PCI devices have 256 bytes, but PCI-X 2 and PCI Express devices
+     * have 4096 bytes.  Even if the device is capable, that doesn't mean we
+     * can access it.  Maybe we don't have a way to generate extended config
+     * space accesses, or the device is behind a reverse Express bridge.  So
+     * we try reading the dword at PCI_CFG_SPACE_SIZE which must either be 0
+     * or a valid extended capability header.
+     */
+    if ( pci_conf_read32(pdev->sbdf, PCI_CFG_SPACE_SIZE) == 0xffffffffU )
+        return;
+
+    /*
+     * PCI Express to PCI/PCI-X Bridge Specification, rev 1.0, 4.1.4 says that
+     * when forwarding a type1 configuration request the bridge must check
+     * that the extended register address field is zero.  The bridge is not
+     * permitted to forward the transactions and must handle it as an
+     * Unsupported Request.  Some bridges do not follow this rule and simply
+     * drop the extended register bits, resulting in the standard config space
+     * being aliased, every 256 bytes across the entire configuration space.
+     * Test for this condition by comparing the first dword of each potential
+     * alias to the vendor/device ID.
+     * Known offenders:
+     *   ASM1083/1085 PCIe-to-PCI Reversible Bridge (1b21:1080, rev 01 & 03)
+     *   AMD/ATI SBx00 PCI to PCI Bridge (1002:4384, rev 40)
+     */
+    sig = pci_conf_read32(pdev->sbdf, PCI_VENDOR_ID);
+    for ( pos = PCI_CFG_SPACE_SIZE;
+          pos < PCI_CFG_SPACE_EXP_SIZE; pos += PCI_CFG_SPACE_SIZE )
+        if ( pci_conf_read32(pdev->sbdf, pos) != sig )
+            break;
+
+    if ( pos >= PCI_CFG_SPACE_EXP_SIZE )
+    {
+        printk(XENLOG_WARNING "%pp: extended config space aliases base one\n",
+               &pdev->sbdf);
+        return;
+    }
+
+    pdev->ext_cfg = true;
+}
+
 /*
  * find the upstream PCIe-to-PCI/PCIX bridge or PCI legacy bridge
  * return 0: the device is integrated PCI device or PCIe
@@ -1841,6 +1918,29 @@ int pci_iterate_devices(int (*handler)(s
     return pci_segments_iterate(iterate_all, &iter) ?: iter.rc;
 }
 
+/* Iterate a single PCI segment, with locking but without preemption. */
+int pci_segment_iterate(unsigned int segment,
+                        int (*handler)(struct pci_dev *pdev, void *arg),
+                        void *arg)
+{
+    struct pci_seg *seg = get_pseg(segment);
+    struct segment_iter iter = {
+        .handler = handler,
+        .arg = arg,
+    };
+
+    if ( !seg )
+        return -ENODEV;
+
+    pcidevs_lock();
+
+    iter.rc = iterate_all(seg, &iter) ?: iter.rc;
+
+    pcidevs_unlock();
+
+    return iter.rc;
+}
+
 /*
  * Local variables:
  * mode: C
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -126,6 +126,9 @@ struct pci_dev {
 
     nodeid_t node; /* NUMA node */
 
+    /* Whether the device has (accessible) extended config space. */
+    bool ext_cfg;
+
     /* Device to be quarantined, don't automatically re-assign to dom0 */
     bool quarantine;
 
@@ -242,6 +245,11 @@ void pci_check_disable_device(u16 seg, u
 int pci_iterate_devices(int (*handler)(struct pci_dev *pdev, void *arg),
                         void *arg);
 
+/* Iterate a single PCI segment, with locking but without preemption. */
+int pci_segment_iterate(unsigned int segment,
+                        int (*handler)(struct pci_dev *pdev, void *arg),
+                        void *arg);
+
 uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg);
 uint16_t pci_conf_read16(pci_sbdf_t sbdf, unsigned int reg);
 uint32_t pci_conf_read32(pci_sbdf_t sbdf, unsigned int reg);
@@ -260,6 +268,7 @@ unsigned int pci_find_next_cap_ttl(pci_s
                                    unsigned int *ttl);
 unsigned int pci_find_next_cap(pci_sbdf_t sbdf, unsigned int pos,
                                unsigned int cap);
+void pci_check_extcfg(struct pci_dev *pdev);
 unsigned int pci_find_ext_capability(const struct pci_dev *pdev,
                                      unsigned int cap);
 unsigned int pci_find_next_ext_capability(const struct pci_dev *pdev,

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Roger Pau Monné 1 week, 4 days ago

On Mon, Jan 19, 2026 at 03:46:55PM +0100, Jan Beulich wrote:
> Legacy PCI devices don't have any extended config space. Reading any part
> thereof may return all ones or other arbitrary data, e.g. in some cases
> base config space contents repeatedly.
> 
> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
> determination of device type; in particular some comments are taken
> verbatim from there.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
> exit path?

Possibly - we expect no change in that case.  However it would need
to propagate some extra information into the callers.  I could see
that as a followup optimization.

> The warning near the bottom of pci_check_extcfg() may be issued multiple
> times for a single device now. Should we try to avoid that?

Yeah, I've made some comments about that below.  Not sure how common
those broken bridges are, the comment just mentions two specific
models.  Adding yet another boolean to track that is cumbersome, and
what we would like to do is mark the bridge as broken, instead of
every device behind it.

> Note that no vPCI adjustments are done here, but they're going to be
> needed: Whatever requires extended capabilities will need re-
> evaluating / newly establishing / tearing down in case an invocation of
> PHYSDEVOP_pci_mmcfg_reserved alters global state.

Hm, you probably want to do something similar to re-scanning the
capability list, but avoid tearing down and re-setting the vPCI header
logic to prevent unneeded p2m manipulations.  We have no easy way to
preempt this rescanning from the context of a
PHYSDEVOP_pci_mmcfg_reserved call.

> Linux also has CONFIG_PCI_QUIRKS, allowing to compile out the slightly
> risky code (as reads may in principle have side effects). Should we gain
> such, too?

I would be fine with just a command line to disable the newly added
behavior in case it causes issues.

> ---
> v2: Major re-work to also check upon PHYSDEVOP_pci_mmcfg_reserved
>     invocation.
> 
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -22,6 +22,8 @@ int physdev_map_pirq(struct domain *d, i
>                       struct msi_info *msi);
>  int physdev_unmap_pirq(struct domain *d, int pirq);
>  
> +int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg);

I'm not sure why you need the forward declaration here, the function
(in this patch) is just used after it's already defined.

> +
>  #include "x86_64/mmconfig.h"
>  
>  #ifndef COMPAT
> @@ -160,6 +162,17 @@ int physdev_unmap_pirq(struct domain *d,
>  
>      return ret;
>  }
> +
> +int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg)

You can make this static I think?

> +{
> +    const struct physdev_pci_mmcfg_reserved *info = arg;
> +
> +    ASSERT(pdev->seg == info->segment);
> +    if ( pdev->bus >= info->start_bus && pdev->bus <= info->end_bus )
> +        pci_check_extcfg(pdev);
> +
> +    return 0;
> +}
>  #endif /* COMPAT */
>  
>  ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> @@ -511,6 +524,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
>  
>          ret = pci_mmcfg_reserved(info.address, info.segment,
>                                   info.start_bus, info.end_bus, info.flags);
> +
> +        if ( !ret )
> +            ret = pci_segment_iterate(info.segment, physdev_check_pci_extcfg,
> +                                      &info);
> +
>          if ( !ret && has_vpci(currd) && (info.flags & XEN_PCI_MMCFG_RESERVED) )
>          {
>              /*
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -422,6 +422,9 @@ static struct pci_dev *alloc_pdev(struct
>      }
>  
>      apply_quirks(pdev);
> +
> +    pci_check_extcfg(pdev);
> +
>      check_pdev(pdev);
>  
>      return pdev;
> @@ -718,6 +721,11 @@ int pci_add_device(u16 seg, u8 bus, u8 d
>  
>                  list_add(&pdev->vf_list, &pf_pdev->vf_list);
>              }
> +
> +            if ( !pdev->ext_cfg )
> +                printk(XENLOG_WARNING
> +                       "%pp: VF without extended config space?\n",
> +                       &pdev->sbdf);

You possibly also want to check that the PF (pf_pdev in this context I
think) also has ext_cfg == true.

>          }
>      }
>  
> @@ -1041,6 +1049,75 @@ enum pdev_type pdev_type(u16 seg, u8 bus
>      return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
>  }
>  
> +void pci_check_extcfg(struct pci_dev *pdev)
> +{
> +    unsigned int pos, sig;
> +
> +    pdev->ext_cfg = false;

I think I would prefer if the ext_cfg field is only modified once Xen
know the correct value to put there.  It would also be nice to detect
cases where the device has pdev->ext_cfg == true but a new scan makes
it switch to false.  Which would signal something has likely gone very
wrong, and we should print a warning.

> +
> +    switch ( pdev->type )
> +    {
> +    case DEV_TYPE_PCIe_ENDPOINT:
> +    case DEV_TYPE_PCIe_BRIDGE:
> +    case DEV_TYPE_PCI_HOST_BRIDGE:
> +    case DEV_TYPE_PCIe2PCI_BRIDGE:
> +    case DEV_TYPE_PCI2PCIe_BRIDGE:
> +        break;
> +
> +    case DEV_TYPE_LEGACY_PCI_BRIDGE:
> +    case DEV_TYPE_PCI:
> +        pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_PCIX);
> +        if ( !pos ||
> +             !(pci_conf_read32(pdev->sbdf, pos + PCI_X_STATUS) &
> +               (PCI_X_STATUS_266MHZ | PCI_X_STATUS_533MHZ)) )
> +            return;
> +        break;
> +
> +    default:
> +        return;
> +    }
> +
> +    /*
> +     * Regular PCI devices have 256 bytes, but PCI-X 2 and PCI Express devices
> +     * have 4096 bytes.  Even if the device is capable, that doesn't mean we
> +     * can access it.  Maybe we don't have a way to generate extended config
> +     * space accesses, or the device is behind a reverse Express bridge.  So
> +     * we try reading the dword at PCI_CFG_SPACE_SIZE which must either be 0
> +     * or a valid extended capability header.
> +     */
> +    if ( pci_conf_read32(pdev->sbdf, PCI_CFG_SPACE_SIZE) == 0xffffffffU )
> +        return;
> +
> +    /*
> +     * PCI Express to PCI/PCI-X Bridge Specification, rev 1.0, 4.1.4 says that
> +     * when forwarding a type1 configuration request the bridge must check
> +     * that the extended register address field is zero.  The bridge is not
> +     * permitted to forward the transactions and must handle it as an
> +     * Unsupported Request.  Some bridges do not follow this rule and simply
> +     * drop the extended register bits, resulting in the standard config space
> +     * being aliased, every 256 bytes across the entire configuration space.
> +     * Test for this condition by comparing the first dword of each potential
> +     * alias to the vendor/device ID.
> +     * Known offenders:
> +     *   ASM1083/1085 PCIe-to-PCI Reversible Bridge (1b21:1080, rev 01 & 03)
> +     *   AMD/ATI SBx00 PCI to PCI Bridge (1002:4384, rev 40)
> +     */
> +    sig = pci_conf_read32(pdev->sbdf, PCI_VENDOR_ID);
> +    for ( pos = PCI_CFG_SPACE_SIZE;
> +          pos < PCI_CFG_SPACE_EXP_SIZE; pos += PCI_CFG_SPACE_SIZE )
> +        if ( pci_conf_read32(pdev->sbdf, pos) != sig )
> +            break;
> +
> +    if ( pos >= PCI_CFG_SPACE_EXP_SIZE )
> +    {
> +        printk(XENLOG_WARNING "%pp: extended config space aliases base one\n",
> +               &pdev->sbdf);

Hm, I think this shouldn't be very common as it seems limited to a
short list of bridges.  However every device under such bridge would
be affected and repeatedly print the message.  I wonder whether we
should make this XENLOG_DEBUG instead, there isn't much the user can
do to fix it.  More a rant than a request though.

Thanks, Roger.

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Jan Beulich 1 week, 3 days ago

On 28.01.2026 18:49, Roger Pau Monné wrote:
> On Mon, Jan 19, 2026 at 03:46:55PM +0100, Jan Beulich wrote:
>> Legacy PCI devices don't have any extended config space. Reading any part
>> thereof may return all ones or other arbitrary data, e.g. in some cases
>> base config space contents repeatedly.
>>
>> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
>> determination of device type; in particular some comments are taken
>> verbatim from there.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
>> exit path?
> 
> Possibly - we expect no change in that case.  However it would need
> to propagate some extra information into the callers.  I could see
> that as a followup optimization.

Okay, with Stewart also saying so I'll make this a follow-on then.

>> Note that no vPCI adjustments are done here, but they're going to be
>> needed: Whatever requires extended capabilities will need re-
>> evaluating / newly establishing / tearing down in case an invocation of
>> PHYSDEVOP_pci_mmcfg_reserved alters global state.
> 
> Hm, you probably want to do something similar to re-scanning the
> capability list, but avoid tearing down and re-setting the vPCI header
> logic to prevent unneeded p2m manipulations.  We have no easy way to
> preempt this rescanning from the context of a
> PHYSDEVOP_pci_mmcfg_reserved call.

Yes, definitely only re-evaluation of extended capabilities. Note, however,
that once we expose more of them, there might be a knock-on effects on the
P2M.

>> Linux also has CONFIG_PCI_QUIRKS, allowing to compile out the slightly
>> risky code (as reads may in principle have side effects). Should we gain
>> such, too?
> 
> I would be fine with just a command line to disable the newly added
> behavior in case it causes issues.

Can do. Will need to get creative as to the name of such an option.

>> --- a/xen/arch/x86/physdev.c
>> +++ b/xen/arch/x86/physdev.c
>> @@ -22,6 +22,8 @@ int physdev_map_pirq(struct domain *d, i
>>                       struct msi_info *msi);
>>  int physdev_unmap_pirq(struct domain *d, int pirq);
>>  
>> +int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg);
> 
> I'm not sure why you need the forward declaration here, the function
> (in this patch) is just used after it's already defined.

Well, this is needed for the same reason that the two decls just above are:
The file is also used for the COMPAT variant of the hypercall, and hence
the declaration needs to be visible there, while ...

>> @@ -160,6 +162,17 @@ int physdev_unmap_pirq(struct domain *d,
>>  
>>      return ret;
>>  }
>> +
>> +int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg)
> 
> You can make this static I think?

... the definition doesn't need building a 2nd time (which hence also
can't be static).

>> @@ -718,6 +721,11 @@ int pci_add_device(u16 seg, u8 bus, u8 d
>>  
>>                  list_add(&pdev->vf_list, &pf_pdev->vf_list);
>>              }
>> +
>> +            if ( !pdev->ext_cfg )
>> +                printk(XENLOG_WARNING
>> +                       "%pp: VF without extended config space?\n",
>> +                       &pdev->sbdf);
> 
> You possibly also want to check that the PF (pf_pdev in this context I
> think) also has ext_cfg == true.

I don't think so. No extended config space on a PF means no PF in that sense
in the first place, for then there not being any SR-IOV capability.

>> @@ -1041,6 +1049,75 @@ enum pdev_type pdev_type(u16 seg, u8 bus
>>      return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
>>  }
>>  
>> +void pci_check_extcfg(struct pci_dev *pdev)
>> +{
>> +    unsigned int pos, sig;
>> +
>> +    pdev->ext_cfg = false;
> 
> I think I would prefer if the ext_cfg field is only modified once Xen
> know the correct value to put there.

Well, my main point of doing it this way is that the code ends up being a
little easier to follow. Especially without the optimization talked about
near the top, there inevitably will be a window in time where what the
field says is wrong. With the optimization there'll be two main cases:
- MCFG becoming newly available: The field starts out false in this case,
  i.e. the write above is a no-op.
- MCFG disappearing (largely hypothetical, I think): The field may start
  out true in this case, but will go false unless we have another access
  mechanism for extended config space. It then can as well be set to
  false as early as possible.

>  It would also be nice to detect
> cases where the device has pdev->ext_cfg == true but a new scan makes
> it switch to false.  Which would signal something has likely gone very
> wrong, and we should print a warning.

Why "very wrong"? If Dom0 tells us that MCFG shouldn't be used, there's
nothing "very wrong" with that. It's simply what firmware / ACPI are
telling us.

>> +    /*
>> +     * PCI Express to PCI/PCI-X Bridge Specification, rev 1.0, 4.1.4 says that
>> +     * when forwarding a type1 configuration request the bridge must check
>> +     * that the extended register address field is zero.  The bridge is not
>> +     * permitted to forward the transactions and must handle it as an
>> +     * Unsupported Request.  Some bridges do not follow this rule and simply
>> +     * drop the extended register bits, resulting in the standard config space
>> +     * being aliased, every 256 bytes across the entire configuration space.
>> +     * Test for this condition by comparing the first dword of each potential
>> +     * alias to the vendor/device ID.
>> +     * Known offenders:
>> +     *   ASM1083/1085 PCIe-to-PCI Reversible Bridge (1b21:1080, rev 01 & 03)
>> +     *   AMD/ATI SBx00 PCI to PCI Bridge (1002:4384, rev 40)
>> +     */
>> +    sig = pci_conf_read32(pdev->sbdf, PCI_VENDOR_ID);
>> +    for ( pos = PCI_CFG_SPACE_SIZE;
>> +          pos < PCI_CFG_SPACE_EXP_SIZE; pos += PCI_CFG_SPACE_SIZE )
>> +        if ( pci_conf_read32(pdev->sbdf, pos) != sig )
>> +            break;
>> +
>> +    if ( pos >= PCI_CFG_SPACE_EXP_SIZE )
>> +    {
>> +        printk(XENLOG_WARNING "%pp: extended config space aliases base one\n",
>> +               &pdev->sbdf);
> 
> Hm, I think this shouldn't be very common as it seems limited to a
> short list of bridges.  However every device under such bridge would
> be affected and repeatedly print the message.  I wonder whether we
> should make this XENLOG_DEBUG instead, there isn't much the user can
> do to fix it.  More a rant than a request though.

XENLOG_DEBUG feels too weak for indicating a potential problem with a device.
I also don't see us marking bridges to limit the verbosity here, as the
issue may or may not be due to a bridge in between. Imo we can defer thinking
about limiting verbosity here until we see reports of this actually getting
overly verbose.

Jan

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Roger Pau Monné 1 week, 3 days ago

On Thu, Jan 29, 2026 at 09:15:30AM +0100, Jan Beulich wrote:
> On 28.01.2026 18:49, Roger Pau Monné wrote:
> > On Mon, Jan 19, 2026 at 03:46:55PM +0100, Jan Beulich wrote:
> >> Legacy PCI devices don't have any extended config space. Reading any part
> >> thereof may return all ones or other arbitrary data, e.g. in some cases
> >> base config space contents repeatedly.
> >>
> >> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
> >> determination of device type; in particular some comments are taken
> >> verbatim from there.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> >> ---
> >> Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
> >> exit path?
> > 
> > Possibly - we expect no change in that case.  However it would need
> > to propagate some extra information into the callers.  I could see
> > that as a followup optimization.
> 
> Okay, with Stewart also saying so I'll make this a follow-on then.
> 
> >> Note that no vPCI adjustments are done here, but they're going to be
> >> needed: Whatever requires extended capabilities will need re-
> >> evaluating / newly establishing / tearing down in case an invocation of
> >> PHYSDEVOP_pci_mmcfg_reserved alters global state.
> > 
> > Hm, you probably want to do something similar to re-scanning the
> > capability list, but avoid tearing down and re-setting the vPCI header
> > logic to prevent unneeded p2m manipulations.  We have no easy way to
> > preempt this rescanning from the context of a
> > PHYSDEVOP_pci_mmcfg_reserved call.
> 
> Yes, definitely only re-evaluation of extended capabilities. Note, however,
> that once we expose more of them, there might be a knock-on effects on the
> P2M.

Preemption in that case will be complicated, as we would have to defer
p2m operations from multiple devices in the context of an hypercall.
I guess we will cross that bridge when we get there.

> >> Linux also has CONFIG_PCI_QUIRKS, allowing to compile out the slightly
> >> risky code (as reads may in principle have side effects). Should we gain
> >> such, too?
> > 
> > I would be fine with just a command line to disable the newly added
> > behavior in case it causes issues.
> 
> Can do. Will need to get creative as to the name of such an option.

pci=check-ext-cfg=<bool>?  Kind of a mouthful.

> >> --- a/xen/arch/x86/physdev.c
> >> +++ b/xen/arch/x86/physdev.c
> >> @@ -22,6 +22,8 @@ int physdev_map_pirq(struct domain *d, i
> >>                       struct msi_info *msi);
> >>  int physdev_unmap_pirq(struct domain *d, int pirq);
> >>  
> >> +int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg);
> > 
> > I'm not sure why you need the forward declaration here, the function
> > (in this patch) is just used after it's already defined.
> 
> Well, this is needed for the same reason that the two decls just above are:
> The file is also used for the COMPAT variant of the hypercall, and hence
> the declaration needs to be visible there, while ...
> 
> >> @@ -160,6 +162,17 @@ int physdev_unmap_pirq(struct domain *d,
> >>  
> >>      return ret;
> >>  }
> >> +
> >> +int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg)
> > 
> > You can make this static I think?
> 
> ... the definition doesn't need building a 2nd time (which hence also
> can't be static).

Oh, I see.

> >> @@ -718,6 +721,11 @@ int pci_add_device(u16 seg, u8 bus, u8 d
> >>  
> >>                  list_add(&pdev->vf_list, &pf_pdev->vf_list);
> >>              }
> >> +
> >> +            if ( !pdev->ext_cfg )
> >> +                printk(XENLOG_WARNING
> >> +                       "%pp: VF without extended config space?\n",
> >> +                       &pdev->sbdf);
> > 
> > You possibly also want to check that the PF (pf_pdev in this context I
> > think) also has ext_cfg == true.
> 
> I don't think so. No extended config space on a PF means no PF in that sense
> in the first place, for then there not being any SR-IOV capability.

Right, but won't it be possible for Xen to not be aware of the
ECAM region for that device, yet the hardware domain somehow managed
to enable SR-IOV it and request to register a VF?

I'm not saying it's common, but it might be a useful sanity check.

> >> @@ -1041,6 +1049,75 @@ enum pdev_type pdev_type(u16 seg, u8 bus
> >>      return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
> >>  }
> >>  
> >> +void pci_check_extcfg(struct pci_dev *pdev)
> >> +{
> >> +    unsigned int pos, sig;
> >> +
> >> +    pdev->ext_cfg = false;
> > 
> > I think I would prefer if the ext_cfg field is only modified once Xen
> > know the correct value to put there.
> 
> Well, my main point of doing it this way is that the code ends up being a
> little easier to follow. Especially without the optimization talked about
> near the top, there inevitably will be a window in time where what the
> field says is wrong. With the optimization there'll be two main cases:
> - MCFG becoming newly available: The field starts out false in this case,
>   i.e. the write above is a no-op.
> - MCFG disappearing (largely hypothetical, I think): The field may start
>   out true in this case, but will go false unless we have another access
>   mechanism for extended config space. It then can as well be set to
>   false as early as possible.

Yes, with the optimization to not re-parse existing MMCFGs there's no
transient windows where the filed is wrongly set.

I also think the registering of MMCFG ares with Xen should be done
ahead of the OS attempting to access the config space, and hence it's
not possible for there to be in-flight accesses that could see
transient invalid pdev->ext_cfg values.

> >  It would also be nice to detect
> > cases where the device has pdev->ext_cfg == true but a new scan makes
> > it switch to false.  Which would signal something has likely gone very
> > wrong, and we should print a warning.
> 
> Why "very wrong"? If Dom0 tells us that MCFG shouldn't be used, there's
> nothing "very wrong" with that. It's simply what firmware / ACPI are
> telling us.

There's also a message printed by `pci_mmcfg_arch_disable()` when the
MMCFG is disabled, so likely we don't need a message printed by each
device.

> >> +    /*
> >> +     * PCI Express to PCI/PCI-X Bridge Specification, rev 1.0, 4.1.4 says that
> >> +     * when forwarding a type1 configuration request the bridge must check
> >> +     * that the extended register address field is zero.  The bridge is not
> >> +     * permitted to forward the transactions and must handle it as an
> >> +     * Unsupported Request.  Some bridges do not follow this rule and simply
> >> +     * drop the extended register bits, resulting in the standard config space
> >> +     * being aliased, every 256 bytes across the entire configuration space.
> >> +     * Test for this condition by comparing the first dword of each potential
> >> +     * alias to the vendor/device ID.
> >> +     * Known offenders:
> >> +     *   ASM1083/1085 PCIe-to-PCI Reversible Bridge (1b21:1080, rev 01 & 03)
> >> +     *   AMD/ATI SBx00 PCI to PCI Bridge (1002:4384, rev 40)
> >> +     */
> >> +    sig = pci_conf_read32(pdev->sbdf, PCI_VENDOR_ID);
> >> +    for ( pos = PCI_CFG_SPACE_SIZE;
> >> +          pos < PCI_CFG_SPACE_EXP_SIZE; pos += PCI_CFG_SPACE_SIZE )
> >> +        if ( pci_conf_read32(pdev->sbdf, pos) != sig )
> >> +            break;
> >> +
> >> +    if ( pos >= PCI_CFG_SPACE_EXP_SIZE )
> >> +    {
> >> +        printk(XENLOG_WARNING "%pp: extended config space aliases base one\n",
> >> +               &pdev->sbdf);
> > 
> > Hm, I think this shouldn't be very common as it seems limited to a
> > short list of bridges.  However every device under such bridge would
> > be affected and repeatedly print the message.  I wonder whether we
> > should make this XENLOG_DEBUG instead, there isn't much the user can
> > do to fix it.  More a rant than a request though.
> 
> XENLOG_DEBUG feels too weak for indicating a potential problem with a device.
> I also don't see us marking bridges to limit the verbosity here, as the
> issue may or may not be due to a bridge in between. Imo we can defer thinking
> about limiting verbosity here until we see reports of this actually getting
> overly verbose.

OK, let's try with the current level then.

Thanks, Roger.

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Jan Beulich 1 week, 3 days ago

On 29.01.2026 11:33, Roger Pau Monné wrote:
> On Thu, Jan 29, 2026 at 09:15:30AM +0100, Jan Beulich wrote:
>> On 28.01.2026 18:49, Roger Pau Monné wrote:
>>> On Mon, Jan 19, 2026 at 03:46:55PM +0100, Jan Beulich wrote:
>>>> Legacy PCI devices don't have any extended config space. Reading any part
>>>> thereof may return all ones or other arbitrary data, e.g. in some cases
>>>> base config space contents repeatedly.
>>>>
>>>> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
>>>> determination of device type; in particular some comments are taken
>>>> verbatim from there.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks. There'll be a v3 though in any event, to add the command line option
that you asked for. I'll take the liberty to retain the ack (whereas I've
dropped Stewart's R-b).

>>>> Linux also has CONFIG_PCI_QUIRKS, allowing to compile out the slightly
>>>> risky code (as reads may in principle have side effects). Should we gain
>>>> such, too?
>>>
>>> I would be fine with just a command line to disable the newly added
>>> behavior in case it causes issues.
>>
>> Can do. Will need to get creative as to the name of such an option.
> 
> pci=check-ext-cfg=<bool>?  Kind of a mouthful.

As we already have "pci=", I've added a "pci=no-quirks" sub-option there.

>>>> @@ -718,6 +721,11 @@ int pci_add_device(u16 seg, u8 bus, u8 d
>>>>  
>>>>                  list_add(&pdev->vf_list, &pf_pdev->vf_list);
>>>>              }
>>>> +
>>>> +            if ( !pdev->ext_cfg )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "%pp: VF without extended config space?\n",
>>>> +                       &pdev->sbdf);
>>>
>>> You possibly also want to check that the PF (pf_pdev in this context I
>>> think) also has ext_cfg == true.
>>
>> I don't think so. No extended config space on a PF means no PF in that sense
>> in the first place, for then there not being any SR-IOV capability.
> 
> Right, but won't it be possible for Xen to not be aware of the
> ECAM region for that device, yet the hardware domain somehow managed
> to enable SR-IOV it and request to register a VF?

Then we're screwed elsewhere, when we try to read the SR-IOV capability
ourselves.

>>>> @@ -1041,6 +1049,75 @@ enum pdev_type pdev_type(u16 seg, u8 bus
>>>>      return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
>>>>  }
>>>>  
>>>> +void pci_check_extcfg(struct pci_dev *pdev)
>>>> +{
>>>> +    unsigned int pos, sig;
>>>> +
>>>> +    pdev->ext_cfg = false;
>>>
>>> I think I would prefer if the ext_cfg field is only modified once Xen
>>> know the correct value to put there.
>>
>> Well, my main point of doing it this way is that the code ends up being a
>> little easier to follow. Especially without the optimization talked about
>> near the top, there inevitably will be a window in time where what the
>> field says is wrong. With the optimization there'll be two main cases:
>> - MCFG becoming newly available: The field starts out false in this case,
>>   i.e. the write above is a no-op.
>> - MCFG disappearing (largely hypothetical, I think): The field may start
>>   out true in this case, but will go false unless we have another access
>>   mechanism for extended config space. It then can as well be set to
>>   false as early as possible.
> 
> Yes, with the optimization to not re-parse existing MMCFGs there's no
> transient windows where the filed is wrongly set.

When there's no change. When there is a change, there'll still be such a
window. Unavoidably, though.

> I also think the registering of MMCFG ares with Xen should be done
> ahead of the OS attempting to access the config space, and hence it's
> not possible for there to be in-flight accesses that could see
> transient invalid pdev->ext_cfg values.

Yes, for Dom0 alone all should be fine. My worry here is with dom0less
and Hyperlaunch.

Jan

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Stewart Hildebrand 2 weeks, 2 days ago

On 1/19/26 09:46, Jan Beulich wrote:
> Legacy PCI devices don't have any extended config space. Reading any part
> thereof may return all ones or other arbitrary data, e.g. in some cases
> base config space contents repeatedly.
> 
> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
> determination of device type; in particular some comments are taken
> verbatim from there.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

> ---
> Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
> exit path?

I don't have a strong opinion here, though I'm leaning toward it's OK as is.

> 
> The warning near the bottom of pci_check_extcfg() may be issued multiple
> times for a single device now. Should we try to avoid that?

If I'm reading Linux drivers/xen/pci.c correctly, my understanding is that dom0
will only invoke PHYSDEVOP_pci_mmcfg_reserved once per mmcfg segment, so in
practice it's unlikely to spam. In other words, I think it's OK as is.

> 
> Note that no vPCI adjustments are done here, but they're going to be
> needed: Whatever requires extended capabilities will need re-
> evaluating / newly establishing / tearing down in case an invocation of
> PHYSDEVOP_pci_mmcfg_reserved alters global state.

Agreed. The current patch is prerequisite for this work. Hm, perhaps we could
create a gitlab ticket for it?

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Jan Beulich 1 week, 6 days ago

On 23.01.2026 23:24, Stewart Hildebrand wrote:
> On 1/19/26 09:46, Jan Beulich wrote:
>> Legacy PCI devices don't have any extended config space. Reading any part
>> thereof may return all ones or other arbitrary data, e.g. in some cases
>> base config space contents repeatedly.
>>
>> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
>> determination of device type; in particular some comments are taken
>> verbatim from there.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Thanks, but see below (as that may change your take on it).

>> ---
>> Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
>> exit path?
> 
> I don't have a strong opinion here, though I'm leaning toward it's OK as is.

Maybe I need to add more context here. Not short-circuiting means that for
a brief moment ->ext_cfg for a device can be wrong - between
pci_check_extcfg() clearing it and then setting it again once all checks
have passed. As long as only Dom0 is executing at that time, and assuming
Dom0 actually issues the notification ahead of itself playing with
individual devices covered by it, all is going to be fine. With
hyperlaunch, however, DomU-s can't be told "not to fiddle" with devices
they've been assigned.

With the yet-to-be-written vPCI counterpart changes the window is actually
going to get bigger for DomU-s using vPCI.

For hyperlaunch this is going to be interesting anyway, on systems like
the one you mentioned. First, without Dom0 / hwdom, how would we even
learn we can use MCFG? And even with hwdom, how would we keep DomU-s from
accessing the devices they were passed until ->ext_cfg has obtained its
final state for them (and vPCI reached proper state, too)?

>> The warning near the bottom of pci_check_extcfg() may be issued multiple
>> times for a single device now. Should we try to avoid that?
> 
> If I'm reading Linux drivers/xen/pci.c correctly, my understanding is that dom0
> will only invoke PHYSDEVOP_pci_mmcfg_reserved once per mmcfg segment, so in
> practice it's unlikely to spam. In other words, I think it's OK as is.

Yes, it ought to be no more than twice, but that's still one time more
than strictly needed. Hence my (mild) concern.

>> Note that no vPCI adjustments are done here, but they're going to be
>> needed: Whatever requires extended capabilities will need re-
>> evaluating / newly establishing / tearing down in case an invocation of
>> PHYSDEVOP_pci_mmcfg_reserved alters global state.
> 
> Agreed. The current patch is prerequisite for this work. Hm, perhaps we could
> create a gitlab ticket for it?

Personally I'm not a fan of gitlab tickets, and just in case no-one else
gets to it I have this on my todo list already anyway.

Jan

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Stewart Hildebrand 1 week, 5 days ago

On 1/26/26 03:58, Jan Beulich wrote:
> On 23.01.2026 23:24, Stewart Hildebrand wrote:
>> On 1/19/26 09:46, Jan Beulich wrote:
>>> Legacy PCI devices don't have any extended config space. Reading any part
>>> thereof may return all ones or other arbitrary data, e.g. in some cases
>>> base config space contents repeatedly.
>>>
>>> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
>>> determination of device type; in particular some comments are taken
>>> verbatim from there.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> Reviewed-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> Thanks, but see below (as that may change your take on it).
> 
>>> ---
>>> Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
>>> exit path?
>>
>> I don't have a strong opinion here, though I'm leaning toward it's OK as is.
> 
> Maybe I need to add more context here. Not short-circuiting means that for
> a brief moment ->ext_cfg for a device can be wrong - between
> pci_check_extcfg() clearing it and then setting it again once all checks
> have passed. As long as only Dom0 is executing at that time, and assuming
> Dom0 actually issues the notification ahead of itself playing with
> individual devices covered by it, all is going to be fine. With
> hyperlaunch, however, DomU-s can't be told "not to fiddle" with devices
> they've been assigned.
> 
> With the yet-to-be-written vPCI counterpart changes the window is actually
> going to get bigger for DomU-s using vPCI.
> 
> For hyperlaunch this is going to be interesting anyway, on systems like
> the one you mentioned. First, without Dom0 / hwdom, how would we even
> learn we can use MCFG? And even with hwdom, how would we keep DomU-s from
> accessing the devices they were passed until ->ext_cfg has obtained its
> final state for them (and vPCI reached proper state, too)?
Ah, I see. Thanks for the additional context.

First of all, to re-answer the original question, it still feels more of a
nice-to-have optimization than a necessity since we don't have hyperlaunch PCI
passthrough upstream yet. Of course, skipping re-evaluating ext_cfg would be a
welcome change if you're up for it. An alternative approach might be to
implement pci_check_extcfg() such that it only modifies ->ext_cfg if it needs to
be changed, but again, I don't have an issue with it as is.

With that said, what do you think if we took the stance that ->ext_cfg shouldn't
be re-evaluated for a pdev while it's assigned to a domU with vPCI? I.e. we
would return an error from the pci_mmcfg_reserved hypercall in this case.

If I understand things correctly, conceptually speaking, from a system
perspective, setting up mcfg is something that *should* be done at boot, not
ad-hoc during runtime. In the hyperlaunch model that I'm envisioning, there will
also be hardware/control domain separation, and we will want to limit the
hardware domain's ability to interfere with other domains. So I'd consider
disabling the mmcfg_reserved hypercall anyway in such a configuration. The
assumption with this model is that we would not need rely on dom0 to enable mcfg
the system/platform of choice.

Longer term, if we really think we need to support hyperlaunch while relying on
a dom0 to initialize mcfg, we could potentially delay assigning pdevs to
hyperlaunch domUs until ->ext_cfg has been initialized and is not expected to
change. This would imply implementing hotplug for PVH domUs (also needed for
"xl pci-attach" with PVH domUs). I wrote some patches in an internal branch to
expose an emulated bridge with pcie hotplug capability, laying some of the
groundwork to support this, and I'll plan to eventually send this work upstream.

In the scenario without a dom0, I don't have a good answer at the moment for how
Xen would learn that mcfg could be used.

Re: [PATCH v2 2/4] PCI: determine whether a device has extended config space

Posted by Jan Beulich 1 week, 5 days ago

On 27.01.2026 05:13, Stewart Hildebrand wrote:
> On 1/26/26 03:58, Jan Beulich wrote:
>> On 23.01.2026 23:24, Stewart Hildebrand wrote:
>>> On 1/19/26 09:46, Jan Beulich wrote:
>>>> Legacy PCI devices don't have any extended config space. Reading any part
>>>> thereof may return all ones or other arbitrary data, e.g. in some cases
>>>> base config space contents repeatedly.
>>>>
>>>> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
>>>> determination of device type; in particular some comments are taken
>>>> verbatim from there.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> Reviewed-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>
>> Thanks, but see below (as that may change your take on it).
>>
>>>> ---
>>>> Should we skip re-evaluation when pci_mmcfg_arch_enable() takes its early
>>>> exit path?
>>>
>>> I don't have a strong opinion here, though I'm leaning toward it's OK as is.
>>
>> Maybe I need to add more context here. Not short-circuiting means that for
>> a brief moment ->ext_cfg for a device can be wrong - between
>> pci_check_extcfg() clearing it and then setting it again once all checks
>> have passed. As long as only Dom0 is executing at that time, and assuming
>> Dom0 actually issues the notification ahead of itself playing with
>> individual devices covered by it, all is going to be fine. With
>> hyperlaunch, however, DomU-s can't be told "not to fiddle" with devices
>> they've been assigned.
>>
>> With the yet-to-be-written vPCI counterpart changes the window is actually
>> going to get bigger for DomU-s using vPCI.
>>
>> For hyperlaunch this is going to be interesting anyway, on systems like
>> the one you mentioned. First, without Dom0 / hwdom, how would we even
>> learn we can use MCFG? And even with hwdom, how would we keep DomU-s from
>> accessing the devices they were passed until ->ext_cfg has obtained its
>> final state for them (and vPCI reached proper state, too)?
> Ah, I see. Thanks for the additional context.
> 
> First of all, to re-answer the original question, it still feels more of a
> nice-to-have optimization than a necessity since we don't have hyperlaunch PCI
> passthrough upstream yet.

My fear here is that an aspect like this one may easily be forgotten when
later doing the actual hyperlaunch work, or when finally making PCI properly
supported on Arm64 (where then dom0less would be equally affected, unless
Arm has found a way to avoid the dependency on Dom0's ACPI AML parsing).

> Of course, skipping re-evaluating ext_cfg would be a
> welcome change if you're up for it.

We can surely keep this as an incremental change to be made. I guess I want
to give Roger a chance to comment before deciding whether to commit the
patch here as-is.

> An alternative approach might be to
> implement pci_check_extcfg() such that it only modifies ->ext_cfg if it needs to
> be changed, but again, I don't have an issue with it as is.

That wouldn't help much imo, as there's then still a time window where what
the field says is wrong relative to what we already have accounted for in
our MCFG handling.

> With that said, what do you think if we took the stance that ->ext_cfg shouldn't
> be re-evaluated for a pdev while it's assigned to a domU with vPCI? I.e. we
> would return an error from the pci_mmcfg_reserved hypercall in this case.

I don't like this idea, as it's functionally limiting (if MCFG becomes
available only later) or functionally wrong (if, for whatever reason, MCFG
becomes unavailable later).

In no event would I consider returning an error from that hypercall. If
anything I'd see us ignore it.

> If I understand things correctly, conceptually speaking, from a system
> perspective, setting up mcfg is something that *should* be done at boot, not
> ad-hoc during runtime.

Yes, and that concept simply collides with hyperlaunch's plan to launch
more than just Dom0 right at boot. Dom0 booting is part of the system
booting, after all.

> In the hyperlaunch model that I'm envisioning, there will
> also be hardware/control domain separation, and we will want to limit the
> hardware domain's ability to interfere with other domains. So I'd consider
> disabling the mmcfg_reserved hypercall anyway in such a configuration. The
> assumption with this model is that we would not need rely on dom0 to enable mcfg
> the system/platform of choice.

But you need to work with the hardware you've got. For customized systems
it certainly is an option to arrange for firmware to suitably report what
Xen needs to be independent of Dom0. But for general purpose systems this
won't necessarily fly.

> Longer term, if we really think we need to support hyperlaunch while relying on
> a dom0 to initialize mcfg, we could potentially delay assigning pdevs to
> hyperlaunch domUs until ->ext_cfg has been initialized and is not expected to
> change. This would imply implementing hotplug for PVH domUs (also needed for
> "xl pci-attach" with PVH domUs). I wrote some patches in an internal branch to
> expose an emulated bridge with pcie hotplug capability, laying some of the
> groundwork to support this, and I'll plan to eventually send this work upstream.

Which isn't quite what I understand one of hyperlaunch's goals is (to have all
domains be statically configured, and hence be in final, usable shape right
when their booting completes).

Jan

[PATCH v2 1/4] PCI: handle PCI->PCIe bridges as well in alloc_pdev()
[PATCH v2 2/4] PCI: determine whether a device has extended config space
[PATCH v2 3/4] PCI: don't look for ext-caps when there's no extended cfg space
[PATCH v2 4/4] vPCI/DomU: really no ext-caps without extended config space