[PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration

Zhenzhong Duan posted 5 patches 5 months ago
Maintainers: Yi Liu <yi.l.liu@intel.com>, Eric Auger <eric.auger@redhat.com>, Zhenzhong Duan <zhenzhong.duan@intel.com>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, "Clément Mathieu--Drif" <clement.mathieu--drif@eviden.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, Alex Williamson <alex.williamson@redhat.com>, "Cédric Le Goater" <clg@redhat.com>
There is a newer version of this series
[PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Zhenzhong Duan 5 months ago
If a VFIO device in guest switches from IOMMU domain to block domain,
vtd_address_space_unmap() is called to unmap whole address space.

If that happens during migration, migration fails with legacy VFIO
backend as below:

Status: failed (vfio_container_dma_unmap(0x561bbbd92d90, 0x100000000000, 0x100000000000) = -7 (Argument list too long))

Because legacy VFIO limits maximum bitmap size to 256MB which maps to 8TB on
4K page system, when 16TB sized UNMAP notification is sent, unmap_bitmap
ioctl fails.

There is no such limitation with iommufd backend, but it's still not optimal
to allocate large bitmap.

Optimize it by iterating over DMAMap list to unmap each range with active
mapping when migration is active. If migration is not active, unmapping the
whole address space in one go is optimal.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
---
 hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83c5e44413..6876dae727 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -37,6 +37,7 @@
 #include "system/system.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm/kvm_i386.h"
+#include "migration/misc.h"
 #include "migration/vmstate.h"
 #include "trace.h"
 
@@ -4423,6 +4424,42 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
     vtd_iommu_unlock(s);
 }
 
+/*
+ * Unmapping a large range in one go is not optimal during migration because
+ * a large dirty bitmap needs to be allocated while there may be only small
+ * mappings, iterate over DMAMap list to unmap each range with active mapping.
+ */
+static void vtd_address_space_unmap_in_migration(VTDAddressSpace *as,
+                                                 IOMMUNotifier *n)
+{
+    const DMAMap *map;
+    const DMAMap target = {
+        .iova = n->start,
+        .size = n->end,
+    };
+    IOVATree *tree = as->iova_tree;
+
+    /*
+     * DMAMap is created during IOMMU page table sync, it's either 4KB or huge
+     * page size and always a power of 2 in size. So the range of DMAMap could
+     * be used for UNMAP notification directly.
+     */
+    while ((map = iova_tree_find(tree, &target))) {
+        IOMMUTLBEvent event;
+
+        event.type = IOMMU_NOTIFIER_UNMAP;
+        event.entry.iova = map->iova;
+        event.entry.addr_mask = map->size;
+        event.entry.target_as = &address_space_memory;
+        event.entry.perm = IOMMU_NONE;
+        /* This field is meaningless for unmap */
+        event.entry.translated_addr = 0;
+        memory_region_notify_iommu_one(n, &event);
+
+        iova_tree_remove(tree, *map);
+    }
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -4432,6 +4469,11 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     IntelIOMMUState *s = as->iommu_state;
     DMAMap map;
 
+    if (migration_is_running()) {
+        vtd_address_space_unmap_in_migration(as, n);
+        return;
+    }
+
     /*
      * Note: all the codes in this function has a assumption that IOVA
      * bits are no more than VTD_MGAW bits (which is restricted by
-- 
2.47.1
Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Yi Liu 3 months, 3 weeks ago
On 2025/9/10 10:37, Zhenzhong Duan wrote:
> If a VFIO device in guest switches from IOMMU domain to block domain,
> vtd_address_space_unmap() is called to unmap whole address space.
> 
> If that happens during migration, migration fails with legacy VFIO
> backend as below:
> 
> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90, 0x100000000000, 0x100000000000) = -7 (Argument list too long))

this should be a giant and busy VM. right? Is a fix tag needed by the way?

> 
> Because legacy VFIO limits maximum bitmap size to 256MB which maps to 8TB on
> 4K page system, when 16TB sized UNMAP notification is sent, unmap_bitmap
> ioctl fails.
> 
> There is no such limitation with iommufd backend, but it's still not optimal
> to allocate large bitmap.
> 
> Optimize it by iterating over DMAMap list to unmap each range with active
> mapping when migration is active. If migration is not active, unmapping the
> whole address space in one go is optimal.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
> ---
>   hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 42 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 83c5e44413..6876dae727 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -37,6 +37,7 @@
>   #include "system/system.h"
>   #include "hw/i386/apic_internal.h"
>   #include "kvm/kvm_i386.h"
> +#include "migration/misc.h"
>   #include "migration/vmstate.h"
>   #include "trace.h"
>   
> @@ -4423,6 +4424,42 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>       vtd_iommu_unlock(s);
>   }
>   
> +/*
> + * Unmapping a large range in one go is not optimal during migration because
> + * a large dirty bitmap needs to be allocated while there may be only small
> + * mappings, iterate over DMAMap list to unmap each range with active mapping.
> + */
> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace *as,
> +                                                 IOMMUNotifier *n)
> +{
> +    const DMAMap *map;
> +    const DMAMap target = {
> +        .iova = n->start,
> +        .size = n->end,
> +    };
> +    IOVATree *tree = as->iova_tree;
> +
> +    /*
> +     * DMAMap is created during IOMMU page table sync, it's either 4KB or huge
> +     * page size and always a power of 2 in size. So the range of DMAMap could
> +     * be used for UNMAP notification directly.
> +     */
> +    while ((map = iova_tree_find(tree, &target))) {

how about an empty iova_tree? If guest has not mapped anything for the
device, the tree is empty. And it is fine to not unmap anyting. While,
if the device is attached to an identify domain, the iova_tree is empty
as well. Are we sure that we need not to unmap anything here? It looks
the answer is yes. But I'm suspecting the unmap failure will happen in 
the vfio side? If yes, need to consider a complete fix. :)

> +        IOMMUTLBEvent event;
> +
> +        event.type = IOMMU_NOTIFIER_UNMAP;
> +        event.entry.iova = map->iova;
> +        event.entry.addr_mask = map->size;
> +        event.entry.target_as = &address_space_memory;
> +        event.entry.perm = IOMMU_NONE;
> +        /* This field is meaningless for unmap */
> +        event.entry.translated_addr = 0;
> +        memory_region_notify_iommu_one(n, &event);
> +
> +        iova_tree_remove(tree, *map);
> +    }
> +}
> +
>   /* Unmap the whole range in the notifier's scope. */
>   static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>   {
> @@ -4432,6 +4469,11 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>       IntelIOMMUState *s = as->iommu_state;
>       DMAMap map;
>   
> +    if (migration_is_running()) {

If the range is not big enough, it is still better to unmap in one-go.
right? If so, might add a check on the range here to go to the iova_tee
iteration conditionally.

> +        vtd_address_space_unmap_in_migration(as, n);
> +        return;
> +    }
> +
>       /*
>        * Note: all the codes in this function has a assumption that IOVA
>        * bits are no more than VTD_MGAW bits (which is restricted by

Regards,
Yi Liu
RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Duan, Zhenzhong 3 months, 3 weeks ago

>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/9/10 10:37, Zhenzhong Duan wrote:
>> If a VFIO device in guest switches from IOMMU domain to block domain,
>> vtd_address_space_unmap() is called to unmap whole address space.
>>
>> If that happens during migration, migration fails with legacy VFIO
>> backend as below:
>>
>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>0x100000000000, 0x100000000000) = -7 (Argument list too long))
>
>this should be a giant and busy VM. right? Is a fix tag needed by the way?

VM size is unrelated, it's not a bug, just current code doesn't work well with migration.

When device switches from IOMMU domain to block domain, the whole iommu
memory region is disabled, this trigger the unmap on the whole iommu memory
region, no matter how many or how large the mappings are in the iommu MR.

>
>>
>> Because legacy VFIO limits maximum bitmap size to 256MB which maps to
>8TB on
>> 4K page system, when 16TB sized UNMAP notification is sent,
>unmap_bitmap
>> ioctl fails.
>>
>> There is no such limitation with iommufd backend, but it's still not optimal
>> to allocate large bitmap.
>>
>> Optimize it by iterating over DMAMap list to unmap each range with active
>> mapping when migration is active. If migration is not active, unmapping the
>> whole address space in one go is optimal.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>> ---
>>   hw/i386/intel_iommu.c | 42
>++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 42 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 83c5e44413..6876dae727 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -37,6 +37,7 @@
>>   #include "system/system.h"
>>   #include "hw/i386/apic_internal.h"
>>   #include "kvm/kvm_i386.h"
>> +#include "migration/misc.h"
>>   #include "migration/vmstate.h"
>>   #include "trace.h"
>>
>> @@ -4423,6 +4424,42 @@ static void
>vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>       vtd_iommu_unlock(s);
>>   }
>>
>> +/*
>> + * Unmapping a large range in one go is not optimal during migration
>because
>> + * a large dirty bitmap needs to be allocated while there may be only small
>> + * mappings, iterate over DMAMap list to unmap each range with active
>mapping.
>> + */
>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>*as,
>> +
>IOMMUNotifier *n)
>> +{
>> +    const DMAMap *map;
>> +    const DMAMap target = {
>> +        .iova = n->start,
>> +        .size = n->end,
>> +    };
>> +    IOVATree *tree = as->iova_tree;
>> +
>> +    /*
>> +     * DMAMap is created during IOMMU page table sync, it's either 4KB
>or huge
>> +     * page size and always a power of 2 in size. So the range of
>DMAMap could
>> +     * be used for UNMAP notification directly.
>> +     */
>> +    while ((map = iova_tree_find(tree, &target))) {
>
>how about an empty iova_tree? If guest has not mapped anything for the
>device, the tree is empty. And it is fine to not unmap anyting. While,
>if the device is attached to an identify domain, the iova_tree is empty
>as well. Are we sure that we need not to unmap anything here? It looks
>the answer is yes. But I'm suspecting the unmap failure will happen in
>the vfio side? If yes, need to consider a complete fix. :)

Not get what failure will happen, could you elaborate?
In case of identity domain, IOMMU memory region is disabled, no iommu
notifier will ever be triggered. vfio_listener monitors memory address space,
if any memory region is disabled, vfio_listener will catch it and do dirty tracking.


>
>> +        IOMMUTLBEvent event;
>> +
>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>> +        event.entry.iova = map->iova;
>> +        event.entry.addr_mask = map->size;
>> +        event.entry.target_as = &address_space_memory;
>> +        event.entry.perm = IOMMU_NONE;
>> +        /* This field is meaningless for unmap */
>> +        event.entry.translated_addr = 0;
>> +        memory_region_notify_iommu_one(n, &event);
>> +
>> +        iova_tree_remove(tree, *map);
>> +    }
>> +}
>> +
>>   /* Unmap the whole range in the notifier's scope. */
>>   static void vtd_address_space_unmap(VTDAddressSpace *as,
>IOMMUNotifier *n)
>>   {
>> @@ -4432,6 +4469,11 @@ static void
>vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>       IntelIOMMUState *s = as->iommu_state;
>>       DMAMap map;
>>
>> +    if (migration_is_running()) {
>
>If the range is not big enough, it is still better to unmap in one-go.
>right? If so, might add a check on the range here to go to the iova_tee
>iteration conditionally.

We don't want to ditry track IOVA holes between IOVA ranges because it's time consuming and useless work. The hole may be large depending on guest behavior.
Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking the exact iova ranges that may be dirty actually.

Thanks
Zhenzhong

Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Yi Liu 3 months, 3 weeks ago
On 2025/10/13 10:50, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>> migration
>>
>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>> If a VFIO device in guest switches from IOMMU domain to block domain,
>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>
>>> If that happens during migration, migration fails with legacy VFIO
>>> backend as below:
>>>
>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>
>> this should be a giant and busy VM. right? Is a fix tag needed by the way?
> 
> VM size is unrelated, it's not a bug, just current code doesn't work well with migration.
> 
> When device switches from IOMMU domain to block domain, the whole iommu
> memory region is disabled, this trigger the unmap on the whole iommu memory
> region,

I got this part.

> no matter how many or how large the mappings are in the iommu MR.

hmmm. A more explicit question: does this error happen with 4G VM memory
as well?

>>
>>>
>>> Because legacy VFIO limits maximum bitmap size to 256MB which maps to
>> 8TB on
>>> 4K page system, when 16TB sized UNMAP notification is sent,
>> unmap_bitmap
>>> ioctl fails.
>>>
>>> There is no such limitation with iommufd backend, but it's still not optimal
>>> to allocate large bitmap.
>>>
>>> Optimize it by iterating over DMAMap list to unmap each range with active
>>> mapping when migration is active. If migration is not active, unmapping the
>>> whole address space in one go is optimal.
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>> ---
>>>    hw/i386/intel_iommu.c | 42
>> ++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 42 insertions(+)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 83c5e44413..6876dae727 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -37,6 +37,7 @@
>>>    #include "system/system.h"
>>>    #include "hw/i386/apic_internal.h"
>>>    #include "kvm/kvm_i386.h"
>>> +#include "migration/misc.h"
>>>    #include "migration/vmstate.h"
>>>    #include "trace.h"
>>>
>>> @@ -4423,6 +4424,42 @@ static void
>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>        vtd_iommu_unlock(s);
>>>    }
>>>
>>> +/*
>>> + * Unmapping a large range in one go is not optimal during migration
>> because
>>> + * a large dirty bitmap needs to be allocated while there may be only small
>>> + * mappings, iterate over DMAMap list to unmap each range with active
>> mapping.
>>> + */
>>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>> *as,
>>> +
>> IOMMUNotifier *n)
>>> +{
>>> +    const DMAMap *map;
>>> +    const DMAMap target = {
>>> +        .iova = n->start,
>>> +        .size = n->end,
>>> +    };
>>> +    IOVATree *tree = as->iova_tree;
>>> +
>>> +    /*
>>> +     * DMAMap is created during IOMMU page table sync, it's either 4KB
>> or huge
>>> +     * page size and always a power of 2 in size. So the range of
>> DMAMap could
>>> +     * be used for UNMAP notification directly.
>>> +     */
>>> +    while ((map = iova_tree_find(tree, &target))) {
>>
>> how about an empty iova_tree? If guest has not mapped anything for the
>> device, the tree is empty. And it is fine to not unmap anyting. While,
>> if the device is attached to an identify domain, the iova_tree is empty
>> as well. Are we sure that we need not to unmap anything here? It looks
>> the answer is yes. But I'm suspecting the unmap failure will happen in
>> the vfio side? If yes, need to consider a complete fix. :)
> 
> Not get what failure will happen, could you elaborate?
> In case of identity domain, IOMMU memory region is disabled, no iommu
> notifier will ever be triggered. vfio_listener monitors memory address space,
> if any memory region is disabled, vfio_listener will catch it and do dirty tracking.

My question comes from the reason why DMA unmap fails. It is due to
a big range is given to kernel while kernel does not support. So if
VFIO gives a big range as well, it should fail as well. And this is
possible when guest (a VM with large size memory) switches from identify
domain to a paging domain. In this case, vfio_listener will unmap all
the system MRs. And it can be a big range if VM size is big enough.

>>
>>> +        IOMMUTLBEvent event;
>>> +
>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>> +        event.entry.iova = map->iova;
>>> +        event.entry.addr_mask = map->size;
>>> +        event.entry.target_as = &address_space_memory;
>>> +        event.entry.perm = IOMMU_NONE;
>>> +        /* This field is meaningless for unmap */
>>> +        event.entry.translated_addr = 0;
>>> +        memory_region_notify_iommu_one(n, &event);
>>> +
>>> +        iova_tree_remove(tree, *map);
>>> +    }
>>> +}
>>> +
>>>    /* Unmap the whole range in the notifier's scope. */
>>>    static void vtd_address_space_unmap(VTDAddressSpace *as,
>> IOMMUNotifier *n)
>>>    {
>>> @@ -4432,6 +4469,11 @@ static void
>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>        IntelIOMMUState *s = as->iommu_state;
>>>        DMAMap map;
>>>
>>> +    if (migration_is_running()) {
>>
>> If the range is not big enough, it is still better to unmap in one-go.
>> right? If so, might add a check on the range here to go to the iova_tee
>> iteration conditionally.
> 
> We don't want to ditry track IOVA holes between IOVA ranges because it's time consuming and useless work. The hole may be large depending on guest behavior.
> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking the exact iova ranges that may be dirty actually.

I see. So this is the optimization. And it also WA the above DMA
unmap issue as well. right? If so, you may want to call out in the
commit message.

Regards,
Yi Liu
RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Duan, Zhenzhong 3 months, 3 weeks ago

>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>> migration
>>>
>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>> If a VFIO device in guest switches from IOMMU domain to block domain,
>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>
>>>> If that happens during migration, migration fails with legacy VFIO
>>>> backend as below:
>>>>
>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>
>>> this should be a giant and busy VM. right? Is a fix tag needed by the way?
>>
>> VM size is unrelated, it's not a bug, just current code doesn't work well with
>migration.
>>
>> When device switches from IOMMU domain to block domain, the whole
>iommu
>> memory region is disabled, this trigger the unmap on the whole iommu
>memory
>> region,
>
>I got this part.
>
>> no matter how many or how large the mappings are in the iommu MR.
>
>hmmm. A more explicit question: does this error happen with 4G VM memory
>as well?

Coincidently, I remember QAT team reported this issue just with 4G VM memory.

>
>>>
>>>>
>>>> Because legacy VFIO limits maximum bitmap size to 256MB which maps
>to
>>> 8TB on
>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>> unmap_bitmap
>>>> ioctl fails.
>>>>
>>>> There is no such limitation with iommufd backend, but it's still not optimal
>>>> to allocate large bitmap.
>>>>
>>>> Optimize it by iterating over DMAMap list to unmap each range with
>active
>>>> mapping when migration is active. If migration is not active, unmapping
>the
>>>> whole address space in one go is optimal.
>>>>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>> ---
>>>>    hw/i386/intel_iommu.c | 42
>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 42 insertions(+)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 83c5e44413..6876dae727 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -37,6 +37,7 @@
>>>>    #include "system/system.h"
>>>>    #include "hw/i386/apic_internal.h"
>>>>    #include "kvm/kvm_i386.h"
>>>> +#include "migration/misc.h"
>>>>    #include "migration/vmstate.h"
>>>>    #include "trace.h"
>>>>
>>>> @@ -4423,6 +4424,42 @@ static void
>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>        vtd_iommu_unlock(s);
>>>>    }
>>>>
>>>> +/*
>>>> + * Unmapping a large range in one go is not optimal during migration
>>> because
>>>> + * a large dirty bitmap needs to be allocated while there may be only
>small
>>>> + * mappings, iterate over DMAMap list to unmap each range with active
>>> mapping.
>>>> + */
>>>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>>> *as,
>>>> +
>>> IOMMUNotifier *n)
>>>> +{
>>>> +    const DMAMap *map;
>>>> +    const DMAMap target = {
>>>> +        .iova = n->start,
>>>> +        .size = n->end,
>>>> +    };
>>>> +    IOVATree *tree = as->iova_tree;
>>>> +
>>>> +    /*
>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>4KB
>>> or huge
>>>> +     * page size and always a power of 2 in size. So the range of
>>> DMAMap could
>>>> +     * be used for UNMAP notification directly.
>>>> +     */
>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>
>>> how about an empty iova_tree? If guest has not mapped anything for the
>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>> if the device is attached to an identify domain, the iova_tree is empty
>>> as well. Are we sure that we need not to unmap anything here? It looks
>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>> the vfio side? If yes, need to consider a complete fix. :)
>>
>> Not get what failure will happen, could you elaborate?
>> In case of identity domain, IOMMU memory region is disabled, no iommu
>> notifier will ever be triggered. vfio_listener monitors memory address
>space,
>> if any memory region is disabled, vfio_listener will catch it and do dirty
>tracking.
>
>My question comes from the reason why DMA unmap fails. It is due to
>a big range is given to kernel while kernel does not support. So if
>VFIO gives a big range as well, it should fail as well. And this is
>possible when guest (a VM with large size memory) switches from identify
>domain to a paging domain. In this case, vfio_listener will unmap all
>the system MRs. And it can be a big range if VM size is big enough.

Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB size.
If guest memory is large enough and lead to a memory region of more than 8TB size,
unmap_bitmap will fail. It's a rare case to live migrate VM with more than 8TB memory,
instead of fixing it in qemu with complex change, I'd suggest to bump below MACRO
value to enlarge the limit in kernel, or switch to use iommufd which doesn't have such limit.

/*
 * Input argument of number of bits to bitmap_set() is unsigned integer, which
 * further casts to signed integer for unaligned multi-bit operation,
 * __bitmap_set().
 * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
 * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
 * system.
 */
#define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
#define DIRTY_BITMAP_SIZE_MAX    DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)

>
>>>
>>>> +        IOMMUTLBEvent event;
>>>> +
>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>> +        event.entry.iova = map->iova;
>>>> +        event.entry.addr_mask = map->size;
>>>> +        event.entry.target_as = &address_space_memory;
>>>> +        event.entry.perm = IOMMU_NONE;
>>>> +        /* This field is meaningless for unmap */
>>>> +        event.entry.translated_addr = 0;
>>>> +        memory_region_notify_iommu_one(n, &event);
>>>> +
>>>> +        iova_tree_remove(tree, *map);
>>>> +    }
>>>> +}
>>>> +
>>>>    /* Unmap the whole range in the notifier's scope. */
>>>>    static void vtd_address_space_unmap(VTDAddressSpace *as,
>>> IOMMUNotifier *n)
>>>>    {
>>>> @@ -4432,6 +4469,11 @@ static void
>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>        IntelIOMMUState *s = as->iommu_state;
>>>>        DMAMap map;
>>>>
>>>> +    if (migration_is_running()) {
>>>
>>> If the range is not big enough, it is still better to unmap in one-go.
>>> right? If so, might add a check on the range here to go to the iova_tee
>>> iteration conditionally.
>>
>> We don't want to ditry track IOVA holes between IOVA ranges because it's
>time consuming and useless work. The hole may be large depending on guest
>behavior.
>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>the exact iova ranges that may be dirty actually.
>
>I see. So this is the optimization. And it also WA the above DMA
>unmap issue as well. right? If so, you may want to call out in the
>commit message.

Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then the optimization.
I'll rephrase the description and subject.

Thanks
Zhenzhong
Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Yi Liu 3 months, 3 weeks ago
On 2025/10/14 10:31, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>> migration
>>
>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>> migration
>>>>
>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>> If a VFIO device in guest switches from IOMMU domain to block domain,
>>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>>
>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>> backend as below:
>>>>>
>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>
>>>> this should be a giant and busy VM. right? Is a fix tag needed by the way?
>>>
>>> VM size is unrelated, it's not a bug, just current code doesn't work well with
>> migration.
>>>
>>> When device switches from IOMMU domain to block domain, the whole
>> iommu
>>> memory region is disabled, this trigger the unmap on the whole iommu
>> memory
>>> region,
>>
>> I got this part.
>>
>>> no matter how many or how large the mappings are in the iommu MR.
>>
>> hmmm. A more explicit question: does this error happen with 4G VM memory
>> as well?
> 
> Coincidently, I remember QAT team reported this issue just with 4G VM memory.

ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
It can be a large range. But it's still not clear to me how can guest
map a range more than 4G if VM only has 4G memory.

> 
>>
>>>>
>>>>>
>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which maps
>> to
>>>> 8TB on
>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>> unmap_bitmap
>>>>> ioctl fails.
>>>>>
>>>>> There is no such limitation with iommufd backend, but it's still not optimal
>>>>> to allocate large bitmap.
>>>>>
>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>> active
>>>>> mapping when migration is active. If migration is not active, unmapping
>> the
>>>>> whole address space in one go is optimal.
>>>>>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>> ---
>>>>>     hw/i386/intel_iommu.c | 42
>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 42 insertions(+)
>>>>>
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index 83c5e44413..6876dae727 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -37,6 +37,7 @@
>>>>>     #include "system/system.h"
>>>>>     #include "hw/i386/apic_internal.h"
>>>>>     #include "kvm/kvm_i386.h"
>>>>> +#include "migration/misc.h"
>>>>>     #include "migration/vmstate.h"
>>>>>     #include "trace.h"
>>>>>
>>>>> @@ -4423,6 +4424,42 @@ static void
>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>         vtd_iommu_unlock(s);
>>>>>     }
>>>>>
>>>>> +/*
>>>>> + * Unmapping a large range in one go is not optimal during migration
>>>> because
>>>>> + * a large dirty bitmap needs to be allocated while there may be only
>> small
>>>>> + * mappings, iterate over DMAMap list to unmap each range with active
>>>> mapping.
>>>>> + */
>>>>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>> *as,
>>>>> +
>>>> IOMMUNotifier *n)
>>>>> +{
>>>>> +    const DMAMap *map;
>>>>> +    const DMAMap target = {
>>>>> +        .iova = n->start,
>>>>> +        .size = n->end,
>>>>> +    };
>>>>> +    IOVATree *tree = as->iova_tree;
>>>>> +
>>>>> +    /*
>>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>> 4KB
>>>> or huge
>>>>> +     * page size and always a power of 2 in size. So the range of
>>>> DMAMap could
>>>>> +     * be used for UNMAP notification directly.
>>>>> +     */
>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>
>>>> how about an empty iova_tree? If guest has not mapped anything for the
>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>> as well. Are we sure that we need not to unmap anything here? It looks
>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>
>>> Not get what failure will happen, could you elaborate?
>>> In case of identity domain, IOMMU memory region is disabled, no iommu
>>> notifier will ever be triggered. vfio_listener monitors memory address
>> space,
>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>> tracking.
>>
>> My question comes from the reason why DMA unmap fails. It is due to
>> a big range is given to kernel while kernel does not support. So if
>> VFIO gives a big range as well, it should fail as well. And this is
>> possible when guest (a VM with large size memory) switches from identify
>> domain to a paging domain. In this case, vfio_listener will unmap all
>> the system MRs. And it can be a big range if VM size is big enough.
> 
> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB size.
> If guest memory is large enough and lead to a memory region of more than 8TB size,
> unmap_bitmap will fail. It's a rare case to live migrate VM with more than 8TB memory,
> instead of fixing it in qemu with complex change, I'd suggest to bump below MACRO
> value to enlarge the limit in kernel, or switch to use iommufd which doesn't have such limit.

This limit shall not affect the usage of device dirty tracking. right?
If yes, add something to tell user use iommufd backend is better. e.g
if memory size is bigger than the limit of vfio iommu type1's dirty
bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
user wants migration capability.

> /*
>   * Input argument of number of bits to bitmap_set() is unsigned integer, which
>   * further casts to signed integer for unaligned multi-bit operation,
>   * __bitmap_set().
>   * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
>   * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
>   * system.
>   */
> #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
> #define DIRTY_BITMAP_SIZE_MAX    DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> 
>>
>>>>
>>>>> +        IOMMUTLBEvent event;
>>>>> +
>>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>>> +        event.entry.iova = map->iova;
>>>>> +        event.entry.addr_mask = map->size;
>>>>> +        event.entry.target_as = &address_space_memory;
>>>>> +        event.entry.perm = IOMMU_NONE;
>>>>> +        /* This field is meaningless for unmap */
>>>>> +        event.entry.translated_addr = 0;
>>>>> +        memory_region_notify_iommu_one(n, &event);
>>>>> +
>>>>> +        iova_tree_remove(tree, *map);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>     /* Unmap the whole range in the notifier's scope. */
>>>>>     static void vtd_address_space_unmap(VTDAddressSpace *as,
>>>> IOMMUNotifier *n)
>>>>>     {
>>>>> @@ -4432,6 +4469,11 @@ static void
>>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>>         IntelIOMMUState *s = as->iommu_state;
>>>>>         DMAMap map;
>>>>>
>>>>> +    if (migration_is_running()) {
>>>>
>>>> If the range is not big enough, it is still better to unmap in one-go.
>>>> right? If so, might add a check on the range here to go to the iova_tee
>>>> iteration conditionally.
>>>
>>> We don't want to ditry track IOVA holes between IOVA ranges because it's
>> time consuming and useless work. The hole may be large depending on guest
>> behavior.
>>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>> the exact iova ranges that may be dirty actually.
>>
>> I see. So this is the optimization. And it also WA the above DMA
>> unmap issue as well. right? If so, you may want to call out in the
>> commit message.
> 
> Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then the optimization.
> I'll rephrase the description and subject.

yes. The commit message gives me the impression this is bug fix. While
subject is optimization. BTW. perhaps call it as an optimization is
clearer since this smells more like an optimization. For fix, I guess
you may need to consider the vfio_listener as well.

Regards,
Yi Liu
RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Duan, Zhenzhong 3 months, 3 weeks ago

>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>> migration
>>>
>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>> migration
>>>>>
>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>domain,
>>>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>>>
>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>> backend as below:
>>>>>>
>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>
>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>way?
>>>>
>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>with
>>> migration.
>>>>
>>>> When device switches from IOMMU domain to block domain, the whole
>>> iommu
>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>> memory
>>>> region,
>>>
>>> I got this part.
>>>
>>>> no matter how many or how large the mappings are in the iommu MR.
>>>
>>> hmmm. A more explicit question: does this error happen with 4G VM
>memory
>>> as well?
>>
>> Coincidently, I remember QAT team reported this issue just with 4G VM
>memory.
>
>ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>It can be a large range. But it's still not clear to me how can guest
>map a range more than 4G if VM only has 4G memory.

It happens when guest switch from DMA domain to block domain, below sequence is triggered:

vtd_context_device_invalidate
	vtd_address_space_sync
		vtd_address_space_unmap

You can see the whole iommu address space is unmapped, it's unrelated to actual mapping in guest.

>
>>
>>>
>>>>>
>>>>>>
>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>maps
>>> to
>>>>> 8TB on
>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>> unmap_bitmap
>>>>>> ioctl fails.
>>>>>>
>>>>>> There is no such limitation with iommufd backend, but it's still not
>optimal
>>>>>> to allocate large bitmap.
>>>>>>
>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>> active
>>>>>> mapping when migration is active. If migration is not active,
>unmapping
>>> the
>>>>>> whole address space in one go is optimal.
>>>>>>
>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>> ---
>>>>>>     hw/i386/intel_iommu.c | 42
>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 42 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>> index 83c5e44413..6876dae727 100644
>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>> @@ -37,6 +37,7 @@
>>>>>>     #include "system/system.h"
>>>>>>     #include "hw/i386/apic_internal.h"
>>>>>>     #include "kvm/kvm_i386.h"
>>>>>> +#include "migration/misc.h"
>>>>>>     #include "migration/vmstate.h"
>>>>>>     #include "trace.h"
>>>>>>
>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>         vtd_iommu_unlock(s);
>>>>>>     }
>>>>>>
>>>>>> +/*
>>>>>> + * Unmapping a large range in one go is not optimal during migration
>>>>> because
>>>>>> + * a large dirty bitmap needs to be allocated while there may be only
>>> small
>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>active
>>>>> mapping.
>>>>>> + */
>>>>>> +static void
>vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>> *as,
>>>>>> +
>>>>> IOMMUNotifier *n)
>>>>>> +{
>>>>>> +    const DMAMap *map;
>>>>>> +    const DMAMap target = {
>>>>>> +        .iova = n->start,
>>>>>> +        .size = n->end,
>>>>>> +    };
>>>>>> +    IOVATree *tree = as->iova_tree;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>>> 4KB
>>>>> or huge
>>>>>> +     * page size and always a power of 2 in size. So the range of
>>>>> DMAMap could
>>>>>> +     * be used for UNMAP notification directly.
>>>>>> +     */
>>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>>
>>>>> how about an empty iova_tree? If guest has not mapped anything for
>the
>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>> as well. Are we sure that we need not to unmap anything here? It looks
>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>
>>>> Not get what failure will happen, could you elaborate?
>>>> In case of identity domain, IOMMU memory region is disabled, no iommu
>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>> space,
>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>> tracking.
>>>
>>> My question comes from the reason why DMA unmap fails. It is due to
>>> a big range is given to kernel while kernel does not support. So if
>>> VFIO gives a big range as well, it should fail as well. And this is
>>> possible when guest (a VM with large size memory) switches from identify
>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>> the system MRs. And it can be a big range if VM size is big enough.
>>
>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB
>size.
>> If guest memory is large enough and lead to a memory region of more than
>8TB size,
>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>8TB memory,
>> instead of fixing it in qemu with complex change, I'd suggest to bump below
>MACRO
>> value to enlarge the limit in kernel, or switch to use iommufd which doesn't
>have such limit.
>
>This limit shall not affect the usage of device dirty tracking. right?
>If yes, add something to tell user use iommufd backend is better. e.g
>if memory size is bigger than the limit of vfio iommu type1's dirty
>bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>user wants migration capability.

Do you mean just dirty tracking instead of migration, like dirtyrate?
In that case, there is error print as above, I think that's enough as a hint?

I guess you mean to add a migration blocker if limit is reached? It's hard
because the limit is only helpful for identity domain, DMA domain in guest
doesn't have such limit, and we can't know guest's choice of domain type
of each VFIO device attached.

>
>> /*
>>   * Input argument of number of bits to bitmap_set() is unsigned integer,
>which
>>   * further casts to signed integer for unaligned multi-bit operation,
>>   * __bitmap_set().
>>   * Then maximum bitmap size supported is 2^31 bits divided by 2^3
>bits/byte,
>>   * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K
>page
>>   * system.
>>   */
>> #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
>> #define DIRTY_BITMAP_SIZE_MAX
>DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>
>>>
>>>>>
>>>>>> +        IOMMUTLBEvent event;
>>>>>> +
>>>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>>>> +        event.entry.iova = map->iova;
>>>>>> +        event.entry.addr_mask = map->size;
>>>>>> +        event.entry.target_as = &address_space_memory;
>>>>>> +        event.entry.perm = IOMMU_NONE;
>>>>>> +        /* This field is meaningless for unmap */
>>>>>> +        event.entry.translated_addr = 0;
>>>>>> +        memory_region_notify_iommu_one(n, &event);
>>>>>> +
>>>>>> +        iova_tree_remove(tree, *map);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>     /* Unmap the whole range in the notifier's scope. */
>>>>>>     static void vtd_address_space_unmap(VTDAddressSpace *as,
>>>>> IOMMUNotifier *n)
>>>>>>     {
>>>>>> @@ -4432,6 +4469,11 @@ static void
>>>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>>>         IntelIOMMUState *s = as->iommu_state;
>>>>>>         DMAMap map;
>>>>>>
>>>>>> +    if (migration_is_running()) {
>>>>>
>>>>> If the range is not big enough, it is still better to unmap in one-go.
>>>>> right? If so, might add a check on the range here to go to the iova_tee
>>>>> iteration conditionally.
>>>>
>>>> We don't want to ditry track IOVA holes between IOVA ranges because
>it's
>>> time consuming and useless work. The hole may be large depending on
>guest
>>> behavior.
>>>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>>> the exact iova ranges that may be dirty actually.
>>>
>>> I see. So this is the optimization. And it also WA the above DMA
>>> unmap issue as well. right? If so, you may want to call out in the
>>> commit message.
>>
>> Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then
>the optimization.
>> I'll rephrase the description and subject.
>
>yes. The commit message gives me the impression this is bug fix. While
>subject is optimization. BTW. perhaps call it as an optimization is
>clearer since this smells more like an optimization. For fix, I guess
>you may need to consider the vfio_listener as well.

Do you have any idea to fix it with vfio_listener?

My understanding is, it's hard to fix this issue from vfio core, because vfio_listener doesn't
know the mapping details in the guest, only vIOMMU cached them through DMAMap.

Thanks
Zhenzhong
Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Yi Liu 3 months, 3 weeks ago
On 2025/10/15 15:48, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>> migration
>>
>> On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>> migration
>>>>
>>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>>> migration
>>>>>>
>>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>> domain,
>>>>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>>>>
>>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>>> backend as below:
>>>>>>>
>>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>>
>>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>> way?
>>>>>
>>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>> with
>>>> migration.
>>>>>
>>>>> When device switches from IOMMU domain to block domain, the whole
>>>> iommu
>>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>>> memory
>>>>> region,
>>>>
>>>> I got this part.
>>>>
>>>>> no matter how many or how large the mappings are in the iommu MR.
>>>>
>>>> hmmm. A more explicit question: does this error happen with 4G VM
>> memory
>>>> as well?
>>>
>>> Coincidently, I remember QAT team reported this issue just with 4G VM
>> memory.
>>
>> ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>> It can be a large range. But it's still not clear to me how can guest
>> map a range more than 4G if VM only has 4G memory.
> 
> It happens when guest switch from DMA domain to block domain, below sequence is triggered:
> 
> vtd_context_device_invalidate
> 	vtd_address_space_sync
> 		vtd_address_space_unmap
> 
> You can see the whole iommu address space is unmapped, it's unrelated to actual mapping in guest.

got it.

>>
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>> maps
>>>> to
>>>>>> 8TB on
>>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>>> unmap_bitmap
>>>>>>> ioctl fails.
>>>>>>>
>>>>>>> There is no such limitation with iommufd backend, but it's still not
>> optimal
>>>>>>> to allocate large bitmap.
>>>>>>>
>>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>>> active
>>>>>>> mapping when migration is active. If migration is not active,
>> unmapping
>>>> the
>>>>>>> whole address space in one go is optimal.
>>>>>>>
>>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>>> ---
>>>>>>>      hw/i386/intel_iommu.c | 42
>>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>      1 file changed, 42 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>> index 83c5e44413..6876dae727 100644
>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>> @@ -37,6 +37,7 @@
>>>>>>>      #include "system/system.h"
>>>>>>>      #include "hw/i386/apic_internal.h"
>>>>>>>      #include "kvm/kvm_i386.h"
>>>>>>> +#include "migration/misc.h"
>>>>>>>      #include "migration/vmstate.h"
>>>>>>>      #include "trace.h"
>>>>>>>
>>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>>          vtd_iommu_unlock(s);
>>>>>>>      }
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Unmapping a large range in one go is not optimal during migration
>>>>>> because
>>>>>>> + * a large dirty bitmap needs to be allocated while there may be only
>>>> small
>>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>> active
>>>>>> mapping.
>>>>>>> + */
>>>>>>> +static void
>> vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>>> *as,
>>>>>>> +
>>>>>> IOMMUNotifier *n)
>>>>>>> +{
>>>>>>> +    const DMAMap *map;
>>>>>>> +    const DMAMap target = {
>>>>>>> +        .iova = n->start,
>>>>>>> +        .size = n->end,
>>>>>>> +    };
>>>>>>> +    IOVATree *tree = as->iova_tree;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>>>> 4KB
>>>>>> or huge
>>>>>>> +     * page size and always a power of 2 in size. So the range of
>>>>>> DMAMap could
>>>>>>> +     * be used for UNMAP notification directly.
>>>>>>> +     */
>>>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>>>
>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>> the
>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>> as well. Are we sure that we need not to unmap anything here? It looks
>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>
>>>>> Not get what failure will happen, could you elaborate?
>>>>> In case of identity domain, IOMMU memory region is disabled, no iommu
>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>> space,
>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>> tracking.
>>>>
>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>> a big range is given to kernel while kernel does not support. So if
>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>> possible when guest (a VM with large size memory) switches from identify
>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>
>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB
>> size.
>>> If guest memory is large enough and lead to a memory region of more than
>> 8TB size,
>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>> 8TB memory,
>>> instead of fixing it in qemu with complex change, I'd suggest to bump below
>> MACRO
>>> value to enlarge the limit in kernel, or switch to use iommufd which doesn't
>> have such limit.
>>
>> This limit shall not affect the usage of device dirty tracking. right?
>> If yes, add something to tell user use iommufd backend is better. e.g
>> if memory size is bigger than the limit of vfio iommu type1's dirty
>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>> user wants migration capability.
> 
> Do you mean just dirty tracking instead of migration, like dirtyrate?
> In that case, there is error print as above, I think that's enough as a hint?

it's not related to diryrate.

> I guess you mean to add a migration blocker if limit is reached? It's hard
> because the limit is only helpful for identity domain, DMA domain in guest
> doesn't have such limit, and we can't know guest's choice of domain type
> of each VFIO device attached.

I meant a blocker to boot QEMU if there is limit. something like below:

	if (VM memory > 8TB && legacy_container_backend && migration_enabled)
		fail the VM boot.

>>
>>> /*
>>>    * Input argument of number of bits to bitmap_set() is unsigned integer,
>> which
>>>    * further casts to signed integer for unaligned multi-bit operation,
>>>    * __bitmap_set().
>>>    * Then maximum bitmap size supported is 2^31 bits divided by 2^3
>> bits/byte,
>>>    * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K
>> page
>>>    * system.
>>>    */
>>> #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
>>> #define DIRTY_BITMAP_SIZE_MAX
>> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>>
>>>>
>>>>>>
>>>>>>> +        IOMMUTLBEvent event;
>>>>>>> +
>>>>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>>>>> +        event.entry.iova = map->iova;
>>>>>>> +        event.entry.addr_mask = map->size;
>>>>>>> +        event.entry.target_as = &address_space_memory;
>>>>>>> +        event.entry.perm = IOMMU_NONE;
>>>>>>> +        /* This field is meaningless for unmap */
>>>>>>> +        event.entry.translated_addr = 0;
>>>>>>> +        memory_region_notify_iommu_one(n, &event);
>>>>>>> +
>>>>>>> +        iova_tree_remove(tree, *map);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>>      /* Unmap the whole range in the notifier's scope. */
>>>>>>>      static void vtd_address_space_unmap(VTDAddressSpace *as,
>>>>>> IOMMUNotifier *n)
>>>>>>>      {
>>>>>>> @@ -4432,6 +4469,11 @@ static void
>>>>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>>>>          IntelIOMMUState *s = as->iommu_state;
>>>>>>>          DMAMap map;
>>>>>>>
>>>>>>> +    if (migration_is_running()) {
>>>>>>
>>>>>> If the range is not big enough, it is still better to unmap in one-go.
>>>>>> right? If so, might add a check on the range here to go to the iova_tee
>>>>>> iteration conditionally.
>>>>>
>>>>> We don't want to ditry track IOVA holes between IOVA ranges because
>> it's
>>>> time consuming and useless work. The hole may be large depending on
>> guest
>>>> behavior.
>>>>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>>>> the exact iova ranges that may be dirty actually.
>>>>
>>>> I see. So this is the optimization. And it also WA the above DMA
>>>> unmap issue as well. right? If so, you may want to call out in the
>>>> commit message.
>>>
>>> Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then
>> the optimization.
>>> I'll rephrase the description and subject.
>>
>> yes. The commit message gives me the impression this is bug fix. While
>> subject is optimization. BTW. perhaps call it as an optimization is
>> clearer since this smells more like an optimization. For fix, I guess
>> you may need to consider the vfio_listener as well.
> 
> Do you have any idea to fix it with vfio_listener?

no good idea.

> My understanding is, it's hard to fix this issue from vfio core, because vfio_listener doesn't
> know the mapping details in the guest, only vIOMMU cached them through DMAMap.

yes. that's why I suggest above fix/WA to avoid it.

Regards,
Yi Liu
RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Duan, Zhenzhong 3 months, 3 weeks ago

>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/10/15 15:48, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>> migration
>>>
>>> On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>> migration
>>>>>
>>>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap
>during
>>>>>>> migration
>>>>>>>
>>>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>>> domain,
>>>>>>>> vtd_address_space_unmap() is called to unmap whole address
>space.
>>>>>>>>
>>>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>>>> backend as below:
>>>>>>>>
>>>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>>>
>>>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>>> way?
>>>>>>
>>>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>>> with
>>>>> migration.
>>>>>>
>>>>>> When device switches from IOMMU domain to block domain, the
>whole
>>>>> iommu
>>>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>>>> memory
>>>>>> region,
>>>>>
>>>>> I got this part.
>>>>>
>>>>>> no matter how many or how large the mappings are in the iommu MR.
>>>>>
>>>>> hmmm. A more explicit question: does this error happen with 4G VM
>>> memory
>>>>> as well?
>>>>
>>>> Coincidently, I remember QAT team reported this issue just with 4G VM
>>> memory.
>>>
>>> ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>>> It can be a large range. But it's still not clear to me how can guest
>>> map a range more than 4G if VM only has 4G memory.
>>
>> It happens when guest switch from DMA domain to block domain, below
>sequence is triggered:
>>
>> vtd_context_device_invalidate
>> 	vtd_address_space_sync
>> 		vtd_address_space_unmap
>>
>> You can see the whole iommu address space is unmapped, it's unrelated to
>actual mapping in guest.
>
>got it.
>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>>> maps
>>>>> to
>>>>>>> 8TB on
>>>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>>>> unmap_bitmap
>>>>>>>> ioctl fails.
>>>>>>>>
>>>>>>>> There is no such limitation with iommufd backend, but it's still not
>>> optimal
>>>>>>>> to allocate large bitmap.
>>>>>>>>
>>>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>>>> active
>>>>>>>> mapping when migration is active. If migration is not active,
>>> unmapping
>>>>> the
>>>>>>>> whole address space in one go is optimal.
>>>>>>>>
>>>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>>>> ---
>>>>>>>>      hw/i386/intel_iommu.c | 42
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      1 file changed, 42 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>>> index 83c5e44413..6876dae727 100644
>>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>>> @@ -37,6 +37,7 @@
>>>>>>>>      #include "system/system.h"
>>>>>>>>      #include "hw/i386/apic_internal.h"
>>>>>>>>      #include "kvm/kvm_i386.h"
>>>>>>>> +#include "migration/misc.h"
>>>>>>>>      #include "migration/vmstate.h"
>>>>>>>>      #include "trace.h"
>>>>>>>>
>>>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>>>          vtd_iommu_unlock(s);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * Unmapping a large range in one go is not optimal during
>migration
>>>>>>> because
>>>>>>>> + * a large dirty bitmap needs to be allocated while there may be
>only
>>>>> small
>>>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>>> active
>>>>>>> mapping.
>>>>>>>> + */
>>>>>>>> +static void
>>> vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>>>> *as,
>>>>>>>> +
>>>>>>> IOMMUNotifier *n)
>>>>>>>> +{
>>>>>>>> +    const DMAMap *map;
>>>>>>>> +    const DMAMap target = {
>>>>>>>> +        .iova = n->start,
>>>>>>>> +        .size = n->end,
>>>>>>>> +    };
>>>>>>>> +    IOVATree *tree = as->iova_tree;
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * DMAMap is created during IOMMU page table sync, it's
>either
>>>>> 4KB
>>>>>>> or huge
>>>>>>>> +     * page size and always a power of 2 in size. So the range of
>>>>>>> DMAMap could
>>>>>>>> +     * be used for UNMAP notification directly.
>>>>>>>> +     */
>>>>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>>>>
>>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>>> the
>>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>looks
>>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>
>>>>>> Not get what failure will happen, could you elaborate?
>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>iommu
>>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>>> space,
>>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>>> tracking.
>>>>>
>>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>>> a big range is given to kernel while kernel does not support. So if
>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>> possible when guest (a VM with large size memory) switches from
>identify
>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>
>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to
>8TB
>>> size.
>>>> If guest memory is large enough and lead to a memory region of more
>than
>>> 8TB size,
>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>>> 8TB memory,
>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>below
>>> MACRO
>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>doesn't
>>> have such limit.
>>>
>>> This limit shall not affect the usage of device dirty tracking. right?
>>> If yes, add something to tell user use iommufd backend is better. e.g
>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>> user wants migration capability.
>>
>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>> In that case, there is error print as above, I think that's enough as a hint?
>
>it's not related to diryrate.
>
>> I guess you mean to add a migration blocker if limit is reached? It's hard
>> because the limit is only helpful for identity domain, DMA domain in guest
>> doesn't have such limit, and we can't know guest's choice of domain type
>> of each VFIO device attached.
>
>I meant a blocker to boot QEMU if there is limit. something like below:
>
>	if (VM memory > 8TB && legacy_container_backend &&
>migration_enabled)
>		fail the VM boot.

OK, will add below to vfio_migration_realize() with an extra patch:

    if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB) {
        /*
         * The 8TB comes from default kernel and QEMU config, it may be
         * conservative here as VM can use large page or run with vIOMMU
         * so the limitation may be relaxed. But 8TB is already quite
         * large for live migration. One can also switch to use IOMMUFD
         * backend if there is a need to migrate large VM.
         */
        error_setg(&err, "%s: Migration is currently not supported "
                   "with large memory VM with approximately 8TB memory "
                   "due to limitation in VFIO type1 driver", vbasedev->name);
        goto add_blocker;
    }

Thanks
Zhenzhong
Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Yi Liu 3 months, 3 weeks ago
On 2025/10/16 16:48, Duan, Zhenzhong wrote:
> 
>>>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>>>> the
>>>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>> looks
>>>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>>
>>>>>>> Not get what failure will happen, could you elaborate?
>>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>> iommu
>>>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>>>> space,
>>>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>>>> tracking.
>>>>>>
>>>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>>>> a big range is given to kernel while kernel does not support. So if
>>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>>> possible when guest (a VM with large size memory) switches from
>> identify
>>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>>
>>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to
>> 8TB
>>>> size.
>>>>> If guest memory is large enough and lead to a memory region of more
>> than
>>>> 8TB size,
>>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>>>> 8TB memory,
>>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>> below
>>>> MACRO
>>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>> doesn't
>>>> have such limit.
>>>>
>>>> This limit shall not affect the usage of device dirty tracking. right?
>>>> If yes, add something to tell user use iommufd backend is better. e.g
>>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>>> user wants migration capability.
>>>
>>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>>> In that case, there is error print as above, I think that's enough as a hint?
>>
>> it's not related to diryrate.
>>
>>> I guess you mean to add a migration blocker if limit is reached? It's hard
>>> because the limit is only helpful for identity domain, DMA domain in guest
>>> doesn't have such limit, and we can't know guest's choice of domain type
>>> of each VFIO device attached.
>>
>> I meant a blocker to boot QEMU if there is limit. something like below:
>>
>> 	if (VM memory > 8TB && legacy_container_backend &&
>> migration_enabled)
>> 		fail the VM boot.
> 
> OK, will add below to vfio_migration_realize() with an extra patch:

yeah, let's see Alex and Cedric's feedback.

>      if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB) {
>          /*
>           * The 8TB comes from default kernel and QEMU config, it may be
>           * conservative here as VM can use large page or run with vIOMMU
>           * so the limitation may be relaxed. But 8TB is already quite
>           * large for live migration. One can also switch to use IOMMUFD
>           * backend if there is a need to migrate large VM.
>           */

instead of hard code 8TB. May convert cap_mig.max_dirty_bitmap_size to
memory size. :)

>          error_setg(&err, "%s: Migration is currently not supported "
>                     "with large memory VM with approximately 8TB memory "
>                     "due to limitation in VFIO type1 driver", vbasedev->name);
>          goto add_blocker;
>      }
> 
> Thanks
> Zhenzhong
Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Cédric Le Goater 3 months, 3 weeks ago
On 10/16/25 11:53, Yi Liu wrote:
> On 2025/10/16 16:48, Duan, Zhenzhong wrote:
>>
>>>>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>>>>> the
>>>>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>>> looks
>>>>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>>>
>>>>>>>> Not get what failure will happen, could you elaborate?
>>>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>>> iommu
>>>>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>>>>> space,
>>>>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>>>>> tracking.
>>>>>>>
>>>>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>>>>> a big range is given to kernel while kernel does not support. So if
>>>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>>>> possible when guest (a VM with large size memory) switches from
>>> identify
>>>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>>>
>>>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to
>>> 8TB
>>>>> size.
>>>>>> If guest memory is large enough and lead to a memory region of more
>>> than
>>>>> 8TB size,
>>>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>>>>> 8TB memory,
>>>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>>> below
>>>>> MACRO
>>>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>>> doesn't
>>>>> have such limit.
>>>>>
>>>>> This limit shall not affect the usage of device dirty tracking. right?
>>>>> If yes, add something to tell user use iommufd backend is better. e.g
>>>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>>>> user wants migration capability.
>>>>
>>>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>>>> In that case, there is error print as above, I think that's enough as a hint?
>>>
>>> it's not related to diryrate.
>>>
>>>> I guess you mean to add a migration blocker if limit is reached? It's hard
>>>> because the limit is only helpful for identity domain, DMA domain in guest
>>>> doesn't have such limit, and we can't know guest's choice of domain type
>>>> of each VFIO device attached.
>>>
>>> I meant a blocker to boot QEMU if there is limit. something like below:
>>>
>>>     if (VM memory > 8TB && legacy_container_backend &&
>>> migration_enabled)
>>>         fail the VM boot.
>>
>> OK, will add below to vfio_migration_realize() with an extra patch:
> 
> yeah, let's see Alex and Cedric's feedback.
> 
>>      if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB) {
>>          /*
>>           * The 8TB comes from default kernel and QEMU config, it may be
>>           * conservative here as VM can use large page or run with vIOMMU
>>           * so the limitation may be relaxed. But 8TB is already quite
>>           * large for live migration. One can also switch to use IOMMUFD
>>           * backend if there is a need to migrate large VM.
>>           */
> 
> instead of hard code 8TB. May convert cap_mig.max_dirty_bitmap_size to
> memory size. :)
yes. It would reflect better that it's a VFIO dirty tracking limitation.


Zhenzhong,

Soft freeze is w45. I plan to send a PR next week, w43, and I will be out
w44. I will have some (limited) time to address more changes on w45.

Thanks,

C.



RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Posted by Duan, Zhenzhong 3 months, 3 weeks ago

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 10/16/25 11:53, Yi Liu wrote:
>> On 2025/10/16 16:48, Duan, Zhenzhong wrote:
>>>
>>>>>>>>>> how about an empty iova_tree? If guest has not mapped anything
>for
>>>>>> the
>>>>>>>>>> device, the tree is empty. And it is fine to not unmap anyting.
>While,
>>>>>>>>>> if the device is attached to an identify domain, the iova_tree is
>empty
>>>>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>>>> looks
>>>>>>>>>> the answer is yes. But I'm suspecting the unmap failure will
>happen in
>>>>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>>>>
>>>>>>>>> Not get what failure will happen, could you elaborate?
>>>>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>>>> iommu
>>>>>>>>> notifier will ever be triggered. vfio_listener monitors memory
>address
>>>>>>>> space,
>>>>>>>>> if any memory region is disabled, vfio_listener will catch it and do
>dirty
>>>>>>>> tracking.
>>>>>>>>
>>>>>>>> My question comes from the reason why DMA unmap fails. It is due
>to
>>>>>>>> a big range is given to kernel while kernel does not support. So if
>>>>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>>>>> possible when guest (a VM with large size memory) switches from
>>>> identify
>>>>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>>>>
>>>>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap
>to
>>>> 8TB
>>>>>> size.
>>>>>>> If guest memory is large enough and lead to a memory region of
>more
>>>> than
>>>>>> 8TB size,
>>>>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more
>than
>>>>>> 8TB memory,
>>>>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>>>> below
>>>>>> MACRO
>>>>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>>>> doesn't
>>>>>> have such limit.
>>>>>>
>>>>>> This limit shall not affect the usage of device dirty tracking. right?
>>>>>> If yes, add something to tell user use iommufd backend is better. e.g
>>>>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>>>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>>>>> user wants migration capability.
>>>>>
>>>>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>>>>> In that case, there is error print as above, I think that's enough as a hint?
>>>>
>>>> it's not related to diryrate.
>>>>
>>>>> I guess you mean to add a migration blocker if limit is reached? It's hard
>>>>> because the limit is only helpful for identity domain, DMA domain in
>guest
>>>>> doesn't have such limit, and we can't know guest's choice of domain
>type
>>>>> of each VFIO device attached.
>>>>
>>>> I meant a blocker to boot QEMU if there is limit. something like below:
>>>>
>>>>     if (VM memory > 8TB && legacy_container_backend &&
>>>> migration_enabled)
>>>>         fail the VM boot.
>>>
>>> OK, will add below to vfio_migration_realize() with an extra patch:
>>
>> yeah, let's see Alex and Cedric's feedback.
>>
>>>      if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB)
>{
>>>          /*
>>>           * The 8TB comes from default kernel and QEMU config, it
>may be
>>>           * conservative here as VM can use large page or run with
>vIOMMU
>>>           * so the limitation may be relaxed. But 8TB is already quite
>>>           * large for live migration. One can also switch to use
>IOMMUFD
>>>           * backend if there is a need to migrate large VM.
>>>           */
>>
>> instead of hard code 8TB. May convert cap_mig.max_dirty_bitmap_size to
>> memory size. :)
>yes. It would reflect better that it's a VFIO dirty tracking limitation.
>
>
>Zhenzhong,
>
>Soft freeze is w45. I plan to send a PR next week, w43, and I will be out
>w44. I will have some (limited) time to address more changes on w45.

Got it, I'll send a new version soon.

Thanks
Zhenzhong