[PATCH 2/2] vfio/pci: Support multifunction atomic ops capability

Manojlo Pekovic posted 2 patches 5 days, 23 hours ago
[PATCH 2/2] vfio/pci: Support multifunction atomic ops capability
Posted by Manojlo Pekovic 5 days, 23 hours ago
From: Manojlo Pekovic <mpekovic@amd.com>

Extend PCIe Atomic Ops completer support to multifunction vfio-pci
devices by calculating the common subset of atomic capabilities across
all functions.

For multifunction devices, we now:
1. Calculate the intersection of atomic capabilities across all functions
2. Only enable capabilities that ALL functions with atomic support share
3. Functions without atomic support are skipped (not penalized)

During vfio_realize(), other functions in a multifunction device may not
be realized yet, making it impossible to enumerate all functions. We
defer atomic ops configuration to machine_done time for multifunction
devices, ensuring all functions are realized before calculating common
capabilities.

Signed-off-by: Manojlo Pekovic <mpekovic@amd.com>
---
 hw/vfio/pci.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/pci.h |   1 +
 2 files changed, 137 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6a6c8f1807..89f171f431 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1931,6 +1931,67 @@ static uint32_t vfio_get_atomic_cap(VFIOPCIDevice *vdev)
     return mask;
 }
 
+static void vfio_pci_machine_done(Notifier *notifier, void *data);
+
+/*
+ * This function iterates through all possible functions (0-7)
+ * at the same slot and returns the intersection of their capabilities.
+ *
+ * Returns: Bitmask of atomic capabilities supported by functions on slot
+ *          (PCI_EXP_DEVCAP2_ATOMIC_COMP32/64/128), or 0 if:
+ *          - Mixed device types found (vfio-pci + emulated)
+ *          - No common capabilities exist
+ */
+static uint32_t vfio_get_multifunction_atomic_cap(VFIOPCIDevice *vdev,
+                                                    PCIBus *bus)
+{
+    PCIDevice *func_dev;
+    VFIOPCIDevice *func_vdev;
+    int slot = PCI_SLOT(vdev->pdev.devfn);
+    int target_devfn;
+    uint32_t func_cap;
+    uint32_t common_mask = PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
+                           PCI_EXP_DEVCAP2_ATOMIC_COMP64 |
+                           PCI_EXP_DEVCAP2_ATOMIC_COMP128;
+
+    /* Iterate through all possible functions at this slot */
+    for (int fn = 0; fn < PCI_FUNC_MAX; fn++) {
+        target_devfn = PCI_DEVFN(slot, fn);
+        func_dev = pci_find_device(bus, pci_bus_num(bus), target_devfn);
+
+        if (!func_dev) {
+            continue;
+        }
+
+        /*
+         * We can only calculate atomic caps for vfio-pci devices. If we
+         * encounter a mixed multifunction device (e.g., vfio-pci function 0
+         * + emulated e1000 function 1), we cannot safely determine the atomic
+         * capabilities. Taking the conservative approach: disable all atomic
+         * ops for mixed configurations.
+         */
+
+        func_vdev = (VFIOPCIDevice *)object_dynamic_cast(OBJECT(func_dev),
+                                                        TYPE_VFIO_PCI);
+        if (!func_vdev) {
+            return 0;
+        }
+
+        /*
+         * A function with no atomic completer support simply cannot
+         * receive and execute atomic operations.
+         * This does NOT mean other functions in the same multifunction
+         * device should be prevented from supporting atomics.
+         */
+        func_cap = vfio_get_atomic_cap(func_vdev);
+        if (func_cap != 0) {
+            common_mask &= func_cap;
+        }
+    }
+
+    return common_mask;
+}
+
 static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
 {
     PCIBus *bus = pci_get_bus(&vdev->pdev);
@@ -1948,8 +2009,7 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
     if (pci_bus_is_root(bus) || !parent || !parent->exp.exp_cap ||
         pcie_cap_get_type(parent) != PCI_EXP_TYPE_ROOT_PORT ||
         pcie_cap_get_version(parent) != PCI_EXP_FLAGS_VER2 ||
-        vdev->pdev.devfn ||
-        vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+        vdev->pdev.devfn) {
         return;
     }
 
@@ -1962,6 +2022,25 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
         return;
     }
 
+    /*
+     * Single-function: All device info is available now, configure immediately
+     * Multifunction: Other functions may not be realized yet, defer to
+     *                machine_done time when all devices guaranteed to exist
+     */
+    if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+        /*
+         * Register machine_done notifier if not already registered.
+         * The notifier will be called after all devices are realized, at which
+         * point we can safely enumerate all functions and calculate the common
+         * atomic capabilities.
+         */
+        if (!vdev->machine_done.notify) {
+            vdev->machine_done.notify = vfio_pci_machine_done;
+            qemu_add_machine_init_done_notifier(&vdev->machine_done);
+        }
+        return;  /* Actual work deferred to vfio_pci_machine_done() */
+    }
+
     mask = vfio_get_atomic_cap(vdev);
 
     if (!mask) {
@@ -1972,6 +2051,51 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
     vdev->clear_parent_atomics_on_exit = true;
 }
 
+/*
+ * Called at machine_done time, after ALL devices are realized
+ * but before the VM starts running.
+ *
+ * For multifunction devices, we need to calculate atomic capabilities
+ * across all functions. During realize of function 0, the other functions
+ * (1-7) may not exist yet. By deferring to machine_done time, we ensure
+ * all functions are realized and can be safely enumerated.
+ */
+static void vfio_pci_machine_done(Notifier *notifier, void *data)
+{
+    VFIOPCIDevice *vdev = container_of(notifier, VFIOPCIDevice, machine_done);
+    PCIBus *bus = pci_get_bus(&vdev->pdev);
+    PCIDevice *parent = pci_bridge_get_device(bus);
+    uint32_t mask;
+    uint8_t *pos;
+
+    /*
+     * Sanity checks: Should always pass since vfio_pci_enable_rp_atomics()
+     * already validated these conditions before registering this notifier.
+     */
+    if (!parent || !parent->exp.exp_cap) {
+        return;
+    }
+
+    pos = parent->config + parent->exp.exp_cap + PCI_EXP_DEVCAP2;
+
+    /* Abort if there'a already an Atomic Ops configuration on the root port */
+
+    if (pci_get_long(pos) & (PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
+                             PCI_EXP_DEVCAP2_ATOMIC_COMP64 |
+                             PCI_EXP_DEVCAP2_ATOMIC_COMP128)) {
+        return;
+    }
+
+    mask = vfio_get_multifunction_atomic_cap(vdev, bus);
+
+    if (!mask) {
+        return;
+    }
+
+    pci_long_test_and_set_mask(pos, mask);
+    vdev->clear_parent_atomics_on_exit = true;
+}
+
 static void vfio_pci_disable_rp_atomics(VFIOPCIDevice *vdev)
 {
     if (vdev->clear_parent_atomics_on_exit) {
@@ -3281,6 +3405,16 @@ static void vfio_exitfn(PCIDevice *pdev)
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
 
+    /*
+     * Remove machine_done notifier if it was registered.
+     * This happens for multifunction devices where function 0 registered
+     * a notifier to defer atomic ops configuration to machine_done time.
+     */
+    if (vdev->machine_done.notify) {
+        qemu_remove_machine_init_done_notifier(&vdev->machine_done);
+        vdev->machine_done.notify = NULL;
+    }
+
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 5ad090a229..9182ab0053 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -180,6 +180,7 @@ struct VFIOPCIDevice {
     bool skip_vsc_check;
     VFIODisplay *dpy;
     Notifier irqchip_change_notifier;
+    Notifier machine_done;
 };
 
 /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
-- 
2.43.0
Re: [PATCH 2/2] vfio/pci: Support multifunction atomic ops capability
Posted by Alex Williamson 5 days, 10 hours ago
On Tue, 3 Feb 2026 16:41:12 +0100
Manojlo Pekovic <manojlo.pekovic@amd.com> wrote:

> From: Manojlo Pekovic <mpekovic@amd.com>
> 
> Extend PCIe Atomic Ops completer support to multifunction vfio-pci
> devices by calculating the common subset of atomic capabilities across
> all functions.
> 
> For multifunction devices, we now:
> 1. Calculate the intersection of atomic capabilities across all functions
> 2. Only enable capabilities that ALL functions with atomic support share
> 3. Functions without atomic support are skipped (not penalized)
> 
> During vfio_realize(), other functions in a multifunction device may not
> be realized yet, making it impossible to enumerate all functions. We
> defer atomic ops configuration to machine_done time for multifunction
> devices, ensuring all functions are realized before calculating common
> capabilities.

Exactly what are we trying to enable with this support?  Why do we need
multifunction AtomicOps support versus multiple single function
assignments under separate root ports?
 
> Signed-off-by: Manojlo Pekovic <mpekovic@amd.com>
> ---
>  hw/vfio/pci.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/pci.h |   1 +
>  2 files changed, 137 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6a6c8f1807..89f171f431 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1931,6 +1931,67 @@ static uint32_t vfio_get_atomic_cap(VFIOPCIDevice *vdev)
>      return mask;
>  }
>  
> +static void vfio_pci_machine_done(Notifier *notifier, void *data);

Why aren't these ordered to avoid the forward declaration?

> +
> +/*
> + * This function iterates through all possible functions (0-7)
> + * at the same slot and returns the intersection of their capabilities.
> + *
> + * Returns: Bitmask of atomic capabilities supported by functions on slot
> + *          (PCI_EXP_DEVCAP2_ATOMIC_COMP32/64/128), or 0 if:
> + *          - Mixed device types found (vfio-pci + emulated)
> + *          - No common capabilities exist
> + */
> +static uint32_t vfio_get_multifunction_atomic_cap(VFIOPCIDevice *vdev,
> +                                                    PCIBus *bus)
> +{
> +    PCIDevice *func_dev;
> +    VFIOPCIDevice *func_vdev;
> +    int slot = PCI_SLOT(vdev->pdev.devfn);
> +    int target_devfn;
> +    uint32_t func_cap;
> +    uint32_t common_mask = PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
> +                           PCI_EXP_DEVCAP2_ATOMIC_COMP64 |
> +                           PCI_EXP_DEVCAP2_ATOMIC_COMP128;
> +
> +    /* Iterate through all possible functions at this slot */
> +    for (int fn = 0; fn < PCI_FUNC_MAX; fn++) {
> +        target_devfn = PCI_DEVFN(slot, fn);
> +        func_dev = pci_find_device(bus, pci_bus_num(bus), target_devfn);
> +
> +        if (!func_dev) {
> +            continue;
> +        }
> +
> +        /*
> +         * We can only calculate atomic caps for vfio-pci devices. If we
> +         * encounter a mixed multifunction device (e.g., vfio-pci function 0
> +         * + emulated e1000 function 1), we cannot safely determine the atomic
> +         * capabilities. Taking the conservative approach: disable all atomic
> +         * ops for mixed configurations.
> +         */
> +
> +        func_vdev = (VFIOPCIDevice *)object_dynamic_cast(OBJECT(func_dev),
> +                                                        TYPE_VFIO_PCI);
> +        if (!func_vdev) {
> +            return 0;
> +        }
> +
> +        /*
> +         * A function with no atomic completer support simply cannot
> +         * receive and execute atomic operations.
> +         * This does NOT mean other functions in the same multifunction
> +         * device should be prevented from supporting atomics.
> +         */
> +        func_cap = vfio_get_atomic_cap(func_vdev);
> +        if (func_cap != 0) {
> +            common_mask &= func_cap;
> +        }

It's not clear to me that this is a valid interpretation:

PCIe 7.0, 6.15.2:

	If any Function in a Multi-Function Device supports AtomicOp
	Completer or AtomicOp routing capability, all Functions with
	Memory Space BARs in that device must decode properly formed
	AtomicOp Requests and handle any they don’t support as an
	Unsupported Request (UR). Note that in such devices, Functions
	lacking AtomicOp Completer capability are forbidden to handle
	properly formed AtomicOp Requests as Malformed TLPs.

If we're adding an arbitrary device that may not support AtomicOps into
a virtual multifunction package, how can we know that a device that
doesn't have an AtomicOps cap wouldn't respond with a Malformed TLP
error?

> +    }
> +
> +    return common_mask;
> +}
> +
>  static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>  {
>      PCIBus *bus = pci_get_bus(&vdev->pdev);
> @@ -1948,8 +2009,7 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>      if (pci_bus_is_root(bus) || !parent || !parent->exp.exp_cap ||
>          pcie_cap_get_type(parent) != PCI_EXP_TYPE_ROOT_PORT ||
>          pcie_cap_get_version(parent) != PCI_EXP_FLAGS_VER2 ||
> -        vdev->pdev.devfn ||
> -        vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        vdev->pdev.devfn) {
>          return;
>      }
>  
> @@ -1962,6 +2022,25 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>          return;
>      }
>  
> +    /*
> +     * Single-function: All device info is available now, configure immediately
> +     * Multifunction: Other functions may not be realized yet, defer to
> +     *                machine_done time when all devices guaranteed to exist
> +     */

This also means that multifunction devices can only support cold-plug
whereas single function devices support AtomicOps with cold-plug and
hot-plug.  The cold-plug device can be unplugged and re-added and it
will have different behavior.

Additionally, it seems we only enable AtomicOps if function 0 supports
AtomicOps, what's the basis for that decision?  Based on the previous
comment, the more conservative approach might be that all functions
need to support AtomicOps, but...

Maybe what we really want is for each function to attempt to initialize
the AtomicOps bits on the root port.  The last realized function should
get it correct regardless of the order the functions are realized,
though we'll need to track our 'Abort if already configured' strategy
differently.

I tossed out this machine_done notifier as a possibility, but I'd
rather avoid it.


> +    if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        /*
> +         * Register machine_done notifier if not already registered.
> +         * The notifier will be called after all devices are realized, at which
> +         * point we can safely enumerate all functions and calculate the common
> +         * atomic capabilities.
> +         */
> +        if (!vdev->machine_done.notify) {
> +            vdev->machine_done.notify = vfio_pci_machine_done;
> +            qemu_add_machine_init_done_notifier(&vdev->machine_done);
> +        }
> +        return;  /* Actual work deferred to vfio_pci_machine_done() */
> +    }
> +
>      mask = vfio_get_atomic_cap(vdev);
>  
>      if (!mask) {
> @@ -1972,6 +2051,51 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>      vdev->clear_parent_atomics_on_exit = true;
>  }
>  
> +/*
> + * Called at machine_done time, after ALL devices are realized
> + * but before the VM starts running.
> + *
> + * For multifunction devices, we need to calculate atomic capabilities
> + * across all functions. During realize of function 0, the other functions
> + * (1-7) may not exist yet. By deferring to machine_done time, we ensure
> + * all functions are realized and can be safely enumerated.
> + */
> +static void vfio_pci_machine_done(Notifier *notifier, void *data)
> +{
> +    VFIOPCIDevice *vdev = container_of(notifier, VFIOPCIDevice, machine_done);
> +    PCIBus *bus = pci_get_bus(&vdev->pdev);
> +    PCIDevice *parent = pci_bridge_get_device(bus);
> +    uint32_t mask;
> +    uint8_t *pos;
> +
> +    /*
> +     * Sanity checks: Should always pass since vfio_pci_enable_rp_atomics()
> +     * already validated these conditions before registering this notifier.
> +     */
> +    if (!parent || !parent->exp.exp_cap) {
> +        return;
> +    }
> +
> +    pos = parent->config + parent->exp.exp_cap + PCI_EXP_DEVCAP2;
> +
> +    /* Abort if there'a already an Atomic Ops configuration on the root port */
> +
> +    if (pci_get_long(pos) & (PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
> +                             PCI_EXP_DEVCAP2_ATOMIC_COMP64 |
> +                             PCI_EXP_DEVCAP2_ATOMIC_COMP128)) {
> +        return;
> +    }
> +
> +    mask = vfio_get_multifunction_atomic_cap(vdev, bus);
> +
> +    if (!mask) {
> +        return;
> +    }
> +
> +    pci_long_test_and_set_mask(pos, mask);
> +    vdev->clear_parent_atomics_on_exit = true;
> +}

Duplication relative to vfio_pci_enable_rp_atomics() should be
refactored if we keep it.  Thanks,

Alex

> +
>  static void vfio_pci_disable_rp_atomics(VFIOPCIDevice *vdev)
>  {
>      if (vdev->clear_parent_atomics_on_exit) {
> @@ -3281,6 +3405,16 @@ static void vfio_exitfn(PCIDevice *pdev)
>      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>      VFIODevice *vbasedev = &vdev->vbasedev;
>  
> +    /*
> +     * Remove machine_done notifier if it was registered.
> +     * This happens for multifunction devices where function 0 registered
> +     * a notifier to defer atomic ops configuration to machine_done time.
> +     */
> +    if (vdev->machine_done.notify) {
> +        qemu_remove_machine_init_done_notifier(&vdev->machine_done);
> +        vdev->machine_done.notify = NULL;
> +    }
> +
>      vfio_unregister_req_notifier(vdev);
>      vfio_unregister_err_notifier(vdev);
>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 5ad090a229..9182ab0053 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -180,6 +180,7 @@ struct VFIOPCIDevice {
>      bool skip_vsc_check;
>      VFIODisplay *dpy;
>      Notifier irqchip_change_notifier;
> +    Notifier machine_done;
>  };
>  
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
RE: [PATCH 2/2] vfio/pci: Support multifunction atomic ops capability
Posted by Pekovic, Manojlo 4 days, 20 hours ago
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Alex,
before I submit a new patch according to Cedric's email, I want to make some clarifications

On Tue, 3 Feb 2026 16:41:12 +0100
Manojlo Pekovic <manojlo.pekovic@amd.com> wrote:

> From: Manojlo Pekovic <mpekovic@amd.com>
>
> Extend PCIe Atomic Ops completer support to multifunction vfio-pci
> devices by calculating the common subset of atomic capabilities across
> all functions.
>
> For multifunction devices, we now:
> 1. Calculate the intersection of atomic capabilities across all
> functions 2. Only enable capabilities that ALL functions with atomic
> support share 3. Functions without atomic support are skipped (not
> penalized)
>
> During vfio_realize(), other functions in a multifunction device may
> not be realized yet, making it impossible to enumerate all functions.
> We defer atomic ops configuration to machine_done time for
> multifunction devices, ensuring all functions are realized before
> calculating common capabilities.

Exactly what are we trying to enable with this support?  Why do we need multifunction AtomicOps support versus multiple single function assignments under separate root ports?

++ While systems do allow separate assignments, we have come up on more than one occasion where the topology of the bare metal was mimicked by VM's configuration.
++ Current solutions were hacking and forcing atomics ops on the root port, or changing the topology of the VM (which no longer represents the same as bare metal)
++ Right now, I cannot confirm that doing either has created any faults, however, from UX standpoint, the correct way is that user shouldn't think about it
++ I understand your concerns, and I can either
++ - Defer this patch and investigate further
++ - Proceed with it for completeness, given its low risk
++ What's your preference?

> Signed-off-by: Manojlo Pekovic <mpekovic@amd.com>
> ---
>  hw/vfio/pci.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/pci.h |   1 +
>  2 files changed, 137 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index
> 6a6c8f1807..89f171f431 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1931,6 +1931,67 @@ static uint32_t vfio_get_atomic_cap(VFIOPCIDevice *vdev)
>      return mask;
>  }
>
> +static void vfio_pci_machine_done(Notifier *notifier, void *data);

Why aren't these ordered to avoid the forward declaration?

> +
> +/*
> + * This function iterates through all possible functions (0-7)
> + * at the same slot and returns the intersection of their capabilities.
> + *
> + * Returns: Bitmask of atomic capabilities supported by functions on slot
> + *          (PCI_EXP_DEVCAP2_ATOMIC_COMP32/64/128), or 0 if:
> + *          - Mixed device types found (vfio-pci + emulated)
> + *          - No common capabilities exist
> + */
> +static uint32_t vfio_get_multifunction_atomic_cap(VFIOPCIDevice *vdev,
> +                                                    PCIBus *bus) {
> +    PCIDevice *func_dev;
> +    VFIOPCIDevice *func_vdev;
> +    int slot = PCI_SLOT(vdev->pdev.devfn);
> +    int target_devfn;
> +    uint32_t func_cap;
> +    uint32_t common_mask = PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
> +                           PCI_EXP_DEVCAP2_ATOMIC_COMP64 |
> +                           PCI_EXP_DEVCAP2_ATOMIC_COMP128;
> +
> +    /* Iterate through all possible functions at this slot */
> +    for (int fn = 0; fn < PCI_FUNC_MAX; fn++) {
> +        target_devfn = PCI_DEVFN(slot, fn);
> +        func_dev = pci_find_device(bus, pci_bus_num(bus),
> + target_devfn);
> +
> +        if (!func_dev) {
> +            continue;
> +        }
> +
> +        /*
> +         * We can only calculate atomic caps for vfio-pci devices. If we
> +         * encounter a mixed multifunction device (e.g., vfio-pci function 0
> +         * + emulated e1000 function 1), we cannot safely determine the atomic
> +         * capabilities. Taking the conservative approach: disable all atomic
> +         * ops for mixed configurations.
> +         */
> +
> +        func_vdev = (VFIOPCIDevice *)object_dynamic_cast(OBJECT(func_dev),
> +                                                        TYPE_VFIO_PCI);
> +        if (!func_vdev) {
> +            return 0;
> +        }
> +
> +        /*
> +         * A function with no atomic completer support simply cannot
> +         * receive and execute atomic operations.
> +         * This does NOT mean other functions in the same multifunction
> +         * device should be prevented from supporting atomics.
> +         */
> +        func_cap = vfio_get_atomic_cap(func_vdev);
> +        if (func_cap != 0) {
> +            common_mask &= func_cap;
> +        }

It's not clear to me that this is a valid interpretation:

PCIe 7.0, 6.15.2:

        If any Function in a Multi-Function Device supports AtomicOp
        Completer or AtomicOp routing capability, all Functions with
        Memory Space BARs in that device must decode properly formed
        AtomicOp Requests and handle any they don’t support as an
        Unsupported Request (UR). Note that in such devices, Functions
        lacking AtomicOp Completer capability are forbidden to handle
        properly formed AtomicOp Requests as Malformed TLPs.

If we're adding an arbitrary device that may not support AtomicOps into a virtual multifunction package, how can we know that a device that doesn't have an AtomicOps cap wouldn't respond with a Malformed TLP error?

++ My mistake in wording the comment. It is not that completer cannot receive operations, but rather that the true state of the hardware should be passed through, having same functionality on both VM and Bare Metal
++ If the devices that doesn't support AtomicOps is passed through, it will have same capabilities as actual hardware, having same properties as bare metal (and same faults if hardware cannot respond with UR)
++ Because of the check that we do for VFIOPCIDevice in lines before, if non-vfio device is passed we will simply return

> +    }
> +
> +    return common_mask;
> +}
> +
>  static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)  {
>      PCIBus *bus = pci_get_bus(&vdev->pdev); @@ -1948,8 +2009,7 @@
> static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>      if (pci_bus_is_root(bus) || !parent || !parent->exp.exp_cap ||
>          pcie_cap_get_type(parent) != PCI_EXP_TYPE_ROOT_PORT ||
>          pcie_cap_get_version(parent) != PCI_EXP_FLAGS_VER2 ||
> -        vdev->pdev.devfn ||
> -        vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        vdev->pdev.devfn) {
>          return;
>      }
>
> @@ -1962,6 +2022,25 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>          return;
>      }
>
> +    /*
> +     * Single-function: All device info is available now, configure immediately
> +     * Multifunction: Other functions may not be realized yet, defer to
> +     *                machine_done time when all devices guaranteed to exist
> +     */

This also means that multifunction devices can only support cold-plug whereas single function devices support AtomicOps with cold-plug and hot-plug.  The cold-plug device can be unplugged and re-added and it will have different behavior.

Additionally, it seems we only enable AtomicOps if function 0 supports AtomicOps, what's the basis for that decision?  Based on the previous comment, the more conservative approach might be that all functions need to support AtomicOps, but...

Maybe what we really want is for each function to attempt to initialize the AtomicOps bits on the root port.  The last realized function should get it correct regardless of the order the functions are realized, though we'll need to track our 'Abort if already configured' strategy differently.

I tossed out this machine_done notifier as a possibility, but I'd rather avoid it.

++ Machine_done was something I hesitantly did, trying to not change as much functionality as possible of the current code.
++ If its appropriate, I can try with changing 'Abort if already configured' logic, to support this approach
++ What is your suggestion on this?


> +    if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        /*
> +         * Register machine_done notifier if not already registered.
> +         * The notifier will be called after all devices are realized, at which
> +         * point we can safely enumerate all functions and calculate the common
> +         * atomic capabilities.
> +         */
> +        if (!vdev->machine_done.notify) {
> +            vdev->machine_done.notify = vfio_pci_machine_done;
> +            qemu_add_machine_init_done_notifier(&vdev->machine_done);
> +        }
> +        return;  /* Actual work deferred to vfio_pci_machine_done() */
> +    }
> +
>      mask = vfio_get_atomic_cap(vdev);
>
>      if (!mask) {
> @@ -1972,6 +2051,51 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>      vdev->clear_parent_atomics_on_exit = true;  }
>
> +/*
> + * Called at machine_done time, after ALL devices are realized
> + * but before the VM starts running.
> + *
> + * For multifunction devices, we need to calculate atomic
> +capabilities
> + * across all functions. During realize of function 0, the other
> +functions
> + * (1-7) may not exist yet. By deferring to machine_done time, we
> +ensure
> + * all functions are realized and can be safely enumerated.
> + */
> +static void vfio_pci_machine_done(Notifier *notifier, void *data) {
> +    VFIOPCIDevice *vdev = container_of(notifier, VFIOPCIDevice, machine_done);
> +    PCIBus *bus = pci_get_bus(&vdev->pdev);
> +    PCIDevice *parent = pci_bridge_get_device(bus);
> +    uint32_t mask;
> +    uint8_t *pos;
> +
> +    /*
> +     * Sanity checks: Should always pass since vfio_pci_enable_rp_atomics()
> +     * already validated these conditions before registering this notifier.
> +     */
> +    if (!parent || !parent->exp.exp_cap) {
> +        return;
> +    }
> +
> +    pos = parent->config + parent->exp.exp_cap + PCI_EXP_DEVCAP2;
> +
> +    /* Abort if there'a already an Atomic Ops configuration on the
> + root port */
> +
> +    if (pci_get_long(pos) & (PCI_EXP_DEVCAP2_ATOMIC_COMP32 |
> +                             PCI_EXP_DEVCAP2_ATOMIC_COMP64 |
> +                             PCI_EXP_DEVCAP2_ATOMIC_COMP128)) {
> +        return;
> +    }
> +
> +    mask = vfio_get_multifunction_atomic_cap(vdev, bus);
> +
> +    if (!mask) {
> +        return;
> +    }
> +
> +    pci_long_test_and_set_mask(pos, mask);
> +    vdev->clear_parent_atomics_on_exit = true; }

Duplication relative to vfio_pci_enable_rp_atomics() should be refactored if we keep it.  Thanks,

Alex

> +
>  static void vfio_pci_disable_rp_atomics(VFIOPCIDevice *vdev)  {
>      if (vdev->clear_parent_atomics_on_exit) { @@ -3281,6 +3405,16 @@
> static void vfio_exitfn(PCIDevice *pdev)
>      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>      VFIODevice *vbasedev = &vdev->vbasedev;
>
> +    /*
> +     * Remove machine_done notifier if it was registered.
> +     * This happens for multifunction devices where function 0 registered
> +     * a notifier to defer atomic ops configuration to machine_done time.
> +     */
> +    if (vdev->machine_done.notify) {
> +        qemu_remove_machine_init_done_notifier(&vdev->machine_done);
> +        vdev->machine_done.notify = NULL;
> +    }
> +
>      vfio_unregister_req_notifier(vdev);
>      vfio_unregister_err_notifier(vdev);
>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL); diff
> --git a/hw/vfio/pci.h b/hw/vfio/pci.h index 5ad090a229..9182ab0053
> 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -180,6 +180,7 @@ struct VFIOPCIDevice {
>      bool skip_vsc_check;
>      VFIODisplay *dpy;
>      Notifier irqchip_change_notifier;
> +    Notifier machine_done;
>  };
>
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot
> match hw */


Thanks,
Manojlo
Re: [PATCH 2/2] vfio/pci: Support multifunction atomic ops capability
Posted by Alex Williamson 4 days, 11 hours ago
On Wed, Feb 4, 2026, at 11:18 AM, Pekovic, Manojlo wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Alex,
> before I submit a new patch according to Cedric's email, I want to make 
> some clarifications
>
> On Tue, 3 Feb 2026 16:41:12 +0100
> Manojlo Pekovic <manojlo.pekovic@amd.com> wrote:
>
>> From: Manojlo Pekovic <mpekovic@amd.com>
>>
>> Extend PCIe Atomic Ops completer support to multifunction vfio-pci
>> devices by calculating the common subset of atomic capabilities across
>> all functions.
>>
>> For multifunction devices, we now:
>> 1. Calculate the intersection of atomic capabilities across all
>> functions 2. Only enable capabilities that ALL functions with atomic
>> support share 3. Functions without atomic support are skipped (not
>> penalized)
>>
>> During vfio_realize(), other functions in a multifunction device may
>> not be realized yet, making it impossible to enumerate all functions.
>> We defer atomic ops configuration to machine_done time for
>> multifunction devices, ensuring all functions are realized before
>> calculating common capabilities.
>
> Exactly what are we trying to enable with this support?  Why do we need 
> multifunction AtomicOps support versus multiple single function 
> assignments under separate root ports?
>
> ++ While systems do allow separate assignments, we have come up on more 
> than one occasion where the topology of the bare metal was mimicked by 
> VM's configuration.
> ++ Current solutions were hacking and forcing atomics ops on the root 
> port, or changing the topology of the VM (which no longer represents 
> the same as bare metal)
> ++ Right now, I cannot confirm that doing either has created any 
> faults, however, from UX standpoint, the correct way is that user 
> shouldn't think about it
> ++ I understand your concerns, and I can either
> ++ - Defer this patch and investigate further
> ++ - Proceed with it for completeness, given its low risk
> ++ What's your preference?

Atomic ops routing is complicated though.  The current automatic setup focuses on a specific use case.  The vfio interface *only* report atomic ops support relative to the root bus, therefore we can effectively only enable host->device or device->host atomic ops, thus automatic configuration is limited to the single function downstream of a root port, without atomic ops routing enabled.

It's been a while since I implemented this and I should have left better breadcrumbs as to the single function restriction, but I think it was not simply determining the common set of capabilities between the devices, but actually the implications relative to device->device atomic ops.

For example, imagine that we want to create a virtual topology that enables atomic ops between two single function devices, each downstream of a separate root port.  Before we can set the atomic ops routing flag on the root ports we need to evaluate whether the host system has atomic ops routing on every physical interconnect between the devices.  The current vfio interface doesn't tell us this.

Now, instead of a virtual topology placing those devices under separate root ports with explicit routing flags, what if we lump them together into a multifunction device.  Do these functions implicitly support device->device atomic ops?  Such paths are typically implementation details beyond the scope of the spec.

Therefore, it's not really clear what gremlins we're unleashing in this "low risk" extension.

Back to the point, atomic ops routing is complicated, QEMU currently kicks anything beyond the trivial case back to the VM administrator.  If the VM administrator doesn't want to think about it, analyze the host topology, create a compatible VM topology, and manually set appropriate atomic ops bits, then the burden probably needs to go in the direction of VM builders and management tools rather than pushed down into QEMU.  QEMU doesn't have the visibility to determine host routing and is forced to work with the topology that's been specified.

>> +        /*
>> +         * A function with no atomic completer support simply cannot
>> +         * receive and execute atomic operations.
>> +         * This does NOT mean other functions in the same multifunction
>> +         * device should be prevented from supporting atomics.
>> +         */
>> +        func_cap = vfio_get_atomic_cap(func_vdev);
>> +        if (func_cap != 0) {
>> +            common_mask &= func_cap;
>> +        }
>
> It's not clear to me that this is a valid interpretation:
>
> PCIe 7.0, 6.15.2:
>
>         If any Function in a Multi-Function Device supports AtomicOp
>         Completer or AtomicOp routing capability, all Functions with
>         Memory Space BARs in that device must decode properly formed
>         AtomicOp Requests and handle any they don’t support as an
>         Unsupported Request (UR). Note that in such devices, Functions
>         lacking AtomicOp Completer capability are forbidden to handle
>         properly formed AtomicOp Requests as Malformed TLPs.
>
> If we're adding an arbitrary device that may not support AtomicOps into 
> a virtual multifunction package, how can we know that a device that 
> doesn't have an AtomicOps cap wouldn't respond with a Malformed TLP 
> error?
>
> ++ My mistake in wording the comment. It is not that completer cannot 
> receive operations, but rather that the true state of the hardware 
> should be passed through, having same functionality on both VM and Bare 
> Metal

But we cannot enforce that in QEMU.

> ++ If the devices that doesn't support AtomicOps is passed through, it 
> will have same capabilities as actual hardware, having same properties 
> as bare metal (and same faults if hardware cannot respond with UR)

Only if the function is part of the same multifunction device on bare metal.  QEMU cannot enforce that.

> ++ Because of the check that we do for VFIOPCIDevice in lines before, 
> if non-vfio device is passed we will simply return

We can compose multifunction devices in QEMU from separate multi- and single- function devices on bare metal.  The fact that they are VFIOPCIDevices tells us nothing about the physical association.

>> +    }
>> +
>> +    return common_mask;
>> +}
>> +
>>  static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)  {
>>      PCIBus *bus = pci_get_bus(&vdev->pdev); @@ -1948,8 +2009,7 @@
>> static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>>      if (pci_bus_is_root(bus) || !parent || !parent->exp.exp_cap ||
>>          pcie_cap_get_type(parent) != PCI_EXP_TYPE_ROOT_PORT ||
>>          pcie_cap_get_version(parent) != PCI_EXP_FLAGS_VER2 ||
>> -        vdev->pdev.devfn ||
>> -        vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
>> +        vdev->pdev.devfn) {
>>          return;
>>      }
>>
>> @@ -1962,6 +2022,25 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>>          return;
>>      }
>>
>> +    /*
>> +     * Single-function: All device info is available now, configure immediately
>> +     * Multifunction: Other functions may not be realized yet, defer to
>> +     *                machine_done time when all devices guaranteed to exist
>> +     */
>
> This also means that multifunction devices can only support cold-plug 
> whereas single function devices support AtomicOps with cold-plug and 
> hot-plug.  The cold-plug device can be unplugged and re-added and it 
> will have different behavior.
>
> Additionally, it seems we only enable AtomicOps if function 0 supports 
> AtomicOps, what's the basis for that decision?  Based on the previous 
> comment, the more conservative approach might be that all functions 
> need to support AtomicOps, but...
>
> Maybe what we really want is for each function to attempt to initialize 
> the AtomicOps bits on the root port.  The last realized function should 
> get it correct regardless of the order the functions are realized, 
> though we'll need to track our 'Abort if already configured' strategy 
> differently.
>
> I tossed out this machine_done notifier as a possibility, but I'd 
> rather avoid it.
>
> ++ Machine_done was something I hesitantly did, trying to not change as 
> much functionality as possible of the current code.
> ++ If its appropriate, I can try with changing 'Abort if already 
> configured' logic, to support this approach
> ++ What is your suggestion on this?

I'm not convinced it's QEMU's job, or that QEMU is even capable of serving the intended goal here.  Thanks,

Alex
RE: [PATCH 2/2] vfio/pci: Support multifunction atomic ops capability
Posted by Pekovic, Manojlo 51 minutes ago
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Alex,

I appreciate you taking the time with the detailed explanation of the atomic ops routing complexity.
I got a much better understanding of the stakes of this feature.
Given mentioned constraints, I'll withdraw the patch and continue handling this at VM management layer

Thanks,
Manojlo Pekovic
Software Development Engineer 2
Cloud Software Team



-----Original Message-----
From: Alex Williamson <alex@shazbot.org>
Sent: Thursday, February 5, 2026 4:49 AM
To: Pekovic, Manojlo <Manojlo.Pekovic@amd.com>
Cc: qemu-devel@nongnu.org; Cédric Le Goater <clg@redhat.com>; Prica, Nikola <Nikola.Prica@amd.com>; Andjelkovic, Dejan <Dejan.Andjelkovic@amd.com>
Subject: Re: [PATCH 2/2] vfio/pci: Support multifunction atomic ops capability

On Wed, Feb 4, 2026, at 11:18 AM, Pekovic, Manojlo wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Alex,
> before I submit a new patch according to Cedric's email, I want to
> make some clarifications
>
> On Tue, 3 Feb 2026 16:41:12 +0100
> Manojlo Pekovic <manojlo.pekovic@amd.com> wrote:
>
>> From: Manojlo Pekovic <mpekovic@amd.com>
>>
>> Extend PCIe Atomic Ops completer support to multifunction vfio-pci
>> devices by calculating the common subset of atomic capabilities
>> across all functions.
>>
>> For multifunction devices, we now:
>> 1. Calculate the intersection of atomic capabilities across all
>> functions 2. Only enable capabilities that ALL functions with atomic
>> support share 3. Functions without atomic support are skipped (not
>> penalized)
>>
>> During vfio_realize(), other functions in a multifunction device may
>> not be realized yet, making it impossible to enumerate all functions.
>> We defer atomic ops configuration to machine_done time for
>> multifunction devices, ensuring all functions are realized before
>> calculating common capabilities.
>
> Exactly what are we trying to enable with this support?  Why do we
> need multifunction AtomicOps support versus multiple single function
> assignments under separate root ports?
>
> ++ While systems do allow separate assignments, we have come up on
> ++ more
> than one occasion where the topology of the bare metal was mimicked by
> VM's configuration.
> ++ Current solutions were hacking and forcing atomics ops on the root
> port, or changing the topology of the VM (which no longer represents
> the same as bare metal)
> ++ Right now, I cannot confirm that doing either has created any
> faults, however, from UX standpoint, the correct way is that user
> shouldn't think about it
> ++ I understand your concerns, and I can either
> ++ - Defer this patch and investigate further
> ++ - Proceed with it for completeness, given its low risk What's your
> ++ preference?

Atomic ops routing is complicated though.  The current automatic setup focuses on a specific use case.  The vfio interface *only* report atomic ops support relative to the root bus, therefore we can effectively only enable host->device or device->host atomic ops, thus automatic configuration is limited to the single function downstream of a root port, without atomic ops routing enabled.

It's been a while since I implemented this and I should have left better breadcrumbs as to the single function restriction, but I think it was not simply determining the common set of capabilities between the devices, but actually the implications relative to device->device atomic ops.

For example, imagine that we want to create a virtual topology that enables atomic ops between two single function devices, each downstream of a separate root port.  Before we can set the atomic ops routing flag on the root ports we need to evaluate whether the host system has atomic ops routing on every physical interconnect between the devices.  The current vfio interface doesn't tell us this.

Now, instead of a virtual topology placing those devices under separate root ports with explicit routing flags, what if we lump them together into a multifunction device.  Do these functions implicitly support device->device atomic ops?  Such paths are typically implementation details beyond the scope of the spec.

Therefore, it's not really clear what gremlins we're unleashing in this "low risk" extension.

Back to the point, atomic ops routing is complicated, QEMU currently kicks anything beyond the trivial case back to the VM administrator.  If the VM administrator doesn't want to think about it, analyze the host topology, create a compatible VM topology, and manually set appropriate atomic ops bits, then the burden probably needs to go in the direction of VM builders and management tools rather than pushed down into QEMU.  QEMU doesn't have the visibility to determine host routing and is forced to work with the topology that's been specified.

>> +        /*
>> +         * A function with no atomic completer support simply cannot
>> +         * receive and execute atomic operations.
>> +         * This does NOT mean other functions in the same multifunction
>> +         * device should be prevented from supporting atomics.
>> +         */
>> +        func_cap = vfio_get_atomic_cap(func_vdev);
>> +        if (func_cap != 0) {
>> +            common_mask &= func_cap;
>> +        }
>
> It's not clear to me that this is a valid interpretation:
>
> PCIe 7.0, 6.15.2:
>
>         If any Function in a Multi-Function Device supports AtomicOp
>         Completer or AtomicOp routing capability, all Functions with
>         Memory Space BARs in that device must decode properly formed
>         AtomicOp Requests and handle any they don’t support as an
>         Unsupported Request (UR). Note that in such devices, Functions
>         lacking AtomicOp Completer capability are forbidden to handle
>         properly formed AtomicOp Requests as Malformed TLPs.
>
> If we're adding an arbitrary device that may not support AtomicOps
> into a virtual multifunction package, how can we know that a device
> that doesn't have an AtomicOps cap wouldn't respond with a Malformed
> TLP error?
>
> ++ My mistake in wording the comment. It is not that completer cannot
> receive operations, but rather that the true state of the hardware
> should be passed through, having same functionality on both VM and
> Bare Metal

But we cannot enforce that in QEMU.

> ++ If the devices that doesn't support AtomicOps is passed through, it
> will have same capabilities as actual hardware, having same properties
> as bare metal (and same faults if hardware cannot respond with UR)

Only if the function is part of the same multifunction device on bare metal.  QEMU cannot enforce that.

> ++ Because of the check that we do for VFIOPCIDevice in lines before,
> if non-vfio device is passed we will simply return

We can compose multifunction devices in QEMU from separate multi- and single- function devices on bare metal.  The fact that they are VFIOPCIDevices tells us nothing about the physical association.

>> +    }
>> +
>> +    return common_mask;
>> +}
>> +
>>  static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)  {
>>      PCIBus *bus = pci_get_bus(&vdev->pdev); @@ -1948,8 +2009,7 @@
>> static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>>      if (pci_bus_is_root(bus) || !parent || !parent->exp.exp_cap ||
>>          pcie_cap_get_type(parent) != PCI_EXP_TYPE_ROOT_PORT ||
>>          pcie_cap_get_version(parent) != PCI_EXP_FLAGS_VER2 ||
>> -        vdev->pdev.devfn ||
>> -        vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
>> +        vdev->pdev.devfn) {
>>          return;
>>      }
>>
>> @@ -1962,6 +2022,25 @@ static void vfio_pci_enable_rp_atomics(VFIOPCIDevice *vdev)
>>          return;
>>      }
>>
>> +    /*
>> +     * Single-function: All device info is available now, configure immediately
>> +     * Multifunction: Other functions may not be realized yet, defer to
>> +     *                machine_done time when all devices guaranteed to exist
>> +     */
>
> This also means that multifunction devices can only support cold-plug
> whereas single function devices support AtomicOps with cold-plug and
> hot-plug.  The cold-plug device can be unplugged and re-added and it
> will have different behavior.
>
> Additionally, it seems we only enable AtomicOps if function 0 supports
> AtomicOps, what's the basis for that decision?  Based on the previous
> comment, the more conservative approach might be that all functions
> need to support AtomicOps, but...
>
> Maybe what we really want is for each function to attempt to initialize
> the AtomicOps bits on the root port.  The last realized function should
> get it correct regardless of the order the functions are realized,
> though we'll need to track our 'Abort if already configured' strategy
> differently.
>
> I tossed out this machine_done notifier as a possibility, but I'd
> rather avoid it.
>
> ++ Machine_done was something I hesitantly did, trying to not change as
> much functionality as possible of the current code.
> ++ If its appropriate, I can try with changing 'Abort if already
> configured' logic, to support this approach
> ++ What is your suggestion on this?

I'm not convinced it's QEMU's job, or that QEMU is even capable of serving the intended goal here.  Thanks,

Alex