[v5] vfio: relax the vIOMMU check

[PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap memory address space

Posted by Zhenzhong Duan 3 months ago

If a VFIO device in guest switches from passthrough(PT) domain to block
domain, the whole memory address space is unmapped, but we passed a NULL
iotlb entry to unmap_bitmap, then bitmap query didn't happen and we lost
dirty pages.

By constructing an iotlb entry with iova = gpa for unmap_bitmap, it can
set dirty bits correctly.

For IOMMU address space, we still send NULL iotlb because VFIO don't
know the actual mappings in guest. It's vIOMMU's responsibility to send
actual unmapping notifications, e.g., vtd_address_space_unmap_in_migration()

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
---
 hw/vfio/listener.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 2109101158..3b48f6796c 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -713,14 +713,27 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
     if (try_unmap) {
         bool unmap_all = false;
+        IOMMUTLBEntry entry = {}, *iotlb = NULL;
 
         if (int128_eq(llsize, int128_2_64())) {
             assert(!iova);
             unmap_all = true;
             llsize = int128_zero();
         }
+
+        /*
+         * Fake an IOTLB entry for identity mapping which is needed by dirty
+         * tracking. In fact, in unmap_bitmap, only translated_addr field is
+         * used to set dirty bitmap.
+         */
+        if (!memory_region_is_iommu(section->mr)) {
+            entry.iova = iova;
+            entry.translated_addr = iova;
+            iotlb = &entry;
+        }
+
         ret = vfio_container_dma_unmap(bcontainer, iova, int128_get64(llsize),
-                                       NULL, unmap_all);
+                                       iotlb, unmap_all);
         if (ret) {
             error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%s)",
-- 
2.47.1

Re: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap memory address space

Posted by Yi Liu 2 months, 2 weeks ago

On 2025/11/6 12:20, Zhenzhong Duan wrote:
> If a VFIO device in guest switches from passthrough(PT) domain to block
> domain, the whole memory address space is unmapped, but we passed a NULL
> iotlb entry to unmap_bitmap, then bitmap query didn't happen and we lost
> dirty pages.

this is a good catch. :) Have you observed problem in testing or just
identified it with patch iteration?

> By constructing an iotlb entry with iova = gpa for unmap_bitmap, it can
> set dirty bits correctly.
> 
> For IOMMU address space, we still send NULL iotlb because VFIO don't
> know the actual mappings in guest. It's vIOMMU's responsibility to send
> actual unmapping notifications, e.g., vtd_address_space_unmap_in_migration()
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
> ---
>   hw/vfio/listener.c | 15 ++++++++++++++-
>   1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index 2109101158..3b48f6796c 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -713,14 +713,27 @@ static void vfio_listener_region_del(MemoryListener *listener,
>   
>       if (try_unmap) {
>           bool unmap_all = false;
> +        IOMMUTLBEntry entry = {}, *iotlb = NULL;
>   
>           if (int128_eq(llsize, int128_2_64())) {
>               assert(!iova);
>               unmap_all = true;
>               llsize = int128_zero();
>           }
> +
> +        /*
> +         * Fake an IOTLB entry for identity mapping which is needed by dirty
> +         * tracking. In fact, in unmap_bitmap, only translated_addr field is
> +         * used to set dirty bitmap.

Just say sync dirty is needed per unmap. So you may add a check
in_migration as well. If not in migration, it is no needed to do it.

> +         */
> +        if (!memory_region_is_iommu(section->mr)) {
> +            entry.iova = iova;
> +            entry.translated_addr = iova;
> +            iotlb = &entry;
> +        }
> +

While, I'm still wondering how to deal with iommu MR case. Let's see a
scenario first. When switching from DMA domain to PT, QEMU will switch
to PT. This shall trigger the vfio_listener_region_del() and unregister
the iommu notifier. This means vIOMMU side needs to do unmap prior to
switching AS. If not, the iommu notifier is gone when vIOMMU wants to
unmap with an IOTLBEvent. For virtual intel_iommu, it is calling
vtd_address_space_unmap_in_migration() prior to calling
vtd_switch_address_space(). So I think you need to tweak the intel_iommu
a bit to suit the order requirement. :)

BTW. should the iommu MRs even go to this try_unmap branch? I think for
such MRs, it relies on the vIOMMU to unmap explicitly (hence trigger the
vfio_iommu_map_notify()).

>           ret = vfio_container_dma_unmap(bcontainer, iova, int128_get64(llsize),
> -                                       NULL, unmap_all);
> +                                       iotlb, unmap_all);
>           if (ret) {
>               error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                            "0x%"HWADDR_PRIx") = %d (%s)",

Regards,
Yi Liu

RE: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap memory address space

Posted by Duan, Zhenzhong 2 months, 2 weeks ago


>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap
>memory address space
>
>On 2025/11/6 12:20, Zhenzhong Duan wrote:
>> If a VFIO device in guest switches from passthrough(PT) domain to block
>> domain, the whole memory address space is unmapped, but we passed a
>NULL
>> iotlb entry to unmap_bitmap, then bitmap query didn't happen and we lost
>> dirty pages.
>
>this is a good catch. :) Have you observed problem in testing or just
>identified it with patch iteration?

Patch iteration.

>
>> By constructing an iotlb entry with iova = gpa for unmap_bitmap, it can
>> set dirty bits correctly.
>>
>> For IOMMU address space, we still send NULL iotlb because VFIO don't
>> know the actual mappings in guest. It's vIOMMU's responsibility to send
>> actual unmapping notifications, e.g.,
>vtd_address_space_unmap_in_migration()
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>> ---
>>   hw/vfio/listener.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>> index 2109101158..3b48f6796c 100644
>> --- a/hw/vfio/listener.c
>> +++ b/hw/vfio/listener.c
>> @@ -713,14 +713,27 @@ static void
>vfio_listener_region_del(MemoryListener *listener,
>>
>>       if (try_unmap) {
>>           bool unmap_all = false;
>> +        IOMMUTLBEntry entry = {}, *iotlb = NULL;
>>
>>           if (int128_eq(llsize, int128_2_64())) {
>>               assert(!iova);
>>               unmap_all = true;
>>               llsize = int128_zero();
>>           }
>> +
>> +        /*
>> +         * Fake an IOTLB entry for identity mapping which is needed by
>dirty
>> +         * tracking. In fact, in unmap_bitmap, only translated_addr field
>is
>> +         * used to set dirty bitmap.
>
>Just say sync dirty is needed per unmap. So you may add a check
>in_migration as well. If not in migration, it is no needed to do it.

Dirty tracking is not only for migration, but also ditry rate/limit. So a non-null iotlb
is always passed for ram MR. That iotlb pointer is used only when
vfio_container_dirty_tracking_is_started() return true.

I can add a check on global_dirty_tracking if you prefer to add a check.

>
>> +         */
>> +        if (!memory_region_is_iommu(section->mr)) {
>> +            entry.iova = iova;
>> +            entry.translated_addr = iova;
>> +            iotlb = &entry;
>> +        }
>> +
>
>While, I'm still wondering how to deal with iommu MR case. Let's see a
>scenario first. When switching from DMA domain to PT, QEMU will switch
>to PT. This shall trigger the vfio_listener_region_del() and unregister
>the iommu notifier. This means vIOMMU side needs to do unmap prior to
>switching AS. If not, the iommu notifier is gone when vIOMMU wants to
>unmap with an IOTLBEvent. For virtual intel_iommu, it is calling
>vtd_address_space_unmap_in_migration() prior to calling
>vtd_switch_address_space(). So I think you need to tweak the intel_iommu
>a bit to suit the order requirement. :)

VTD doesn't support switching from DMA domain to PT atomically, so switches
to block domain in between, see intel_iommu_attach_device() in kernel.

So with this sequence is DMA->block->PT domain, we have guaranteed the order
you shared? See vtd_pasid_cache_sync_locked().

>
>BTW. should the iommu MRs even go to this try_unmap branch? I think for
>such MRs, it relies on the vIOMMU to unmap explicitly (hence trigger the
>vfio_iommu_map_notify()).

Yes, it's unnecessary, but it's hard for VFIO to distinguish if try_unmap is due to
domain switch or a real unmap. I think it's harmless because the second try_unmap
unmaps nothing.

Thanks
Zhenzhong

>
>>           ret = vfio_container_dma_unmap(bcontainer, iova,
>int128_get64(llsize),
>> -                                       NULL, unmap_all);
>> +                                       iotlb, unmap_all);
>>           if (ret) {
>>               error_report("vfio_container_dma_unmap(%p,
>0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%s)",
>
>Regards,
>Yi Liu

Re: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap memory address space

Posted by Yi Liu 2 months, 2 weeks ago

On 2025/11/26 13:45, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap
>> memory address space
>>
>> On 2025/11/6 12:20, Zhenzhong Duan wrote:
>>> If a VFIO device in guest switches from passthrough(PT) domain to block
>>> domain, the whole memory address space is unmapped, but we passed a
>> NULL
>>> iotlb entry to unmap_bitmap, then bitmap query didn't happen and we lost
>>> dirty pages.
>>
>> this is a good catch. :) Have you observed problem in testing or just
>> identified it with patch iteration?
> 
> Patch iteration.
> 
>>
>>> By constructing an iotlb entry with iova = gpa for unmap_bitmap, it can
>>> set dirty bits correctly.
>>>
>>> For IOMMU address space, we still send NULL iotlb because VFIO don't
>>> know the actual mappings in guest. It's vIOMMU's responsibility to send
>>> actual unmapping notifications, e.g.,
>> vtd_address_space_unmap_in_migration()
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>> ---
>>>    hw/vfio/listener.c | 15 ++++++++++++++-
>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>>> index 2109101158..3b48f6796c 100644
>>> --- a/hw/vfio/listener.c
>>> +++ b/hw/vfio/listener.c
>>> @@ -713,14 +713,27 @@ static void
>> vfio_listener_region_del(MemoryListener *listener,
>>>
>>>        if (try_unmap) {
>>>            bool unmap_all = false;
>>> +        IOMMUTLBEntry entry = {}, *iotlb = NULL;
>>>
>>>            if (int128_eq(llsize, int128_2_64())) {
>>>                assert(!iova);
>>>                unmap_all = true;
>>>                llsize = int128_zero();
>>>            }
>>> +
>>> +        /*
>>> +         * Fake an IOTLB entry for identity mapping which is needed by
>> dirty
>>> +         * tracking. In fact, in unmap_bitmap, only translated_addr field
>> is
>>> +         * used to set dirty bitmap.
>>
>> Just say sync dirty is needed per unmap. So you may add a check
>> in_migration as well. If not in migration, it is no needed to do it.
> 
> Dirty tracking is not only for migration, but also ditry rate/limit. So a non-null iotlb
> is always passed for ram MR. That iotlb pointer is used only when
> vfio_container_dirty_tracking_is_started() return true.
> 
> I can add a check on global_dirty_tracking if you prefer to add a check.

yeah, this would be helpful.

> 
>>
>>> +         */
>>> +        if (!memory_region_is_iommu(section->mr)) {
>>> +            entry.iova = iova;
>>> +            entry.translated_addr = iova;
>>> +            iotlb = &entry;
>>> +        }
>>> +
>>
>> While, I'm still wondering how to deal with iommu MR case. Let's see a
>> scenario first. When switching from DMA domain to PT, QEMU will switch
>> to PT. This shall trigger the vfio_listener_region_del() and unregister
>> the iommu notifier. This means vIOMMU side needs to do unmap prior to
>> switching AS. If not, the iommu notifier is gone when vIOMMU wants to
>> unmap with an IOTLBEvent. For virtual intel_iommu, it is calling
>> vtd_address_space_unmap_in_migration() prior to calling
>> vtd_switch_address_space(). So I think you need to tweak the intel_iommu
>> a bit to suit the order requirement. :)
> 
> VTD doesn't support switching from DMA domain to PT atomically, so switches
> to block domain in between, see intel_iommu_attach_device() in kernel.
> 
> So with this sequence is DMA->block->PT domain, we have guaranteed the order
> you shared? See vtd_pasid_cache_sync_locked().

I see. So guest helps it. This might be a bit weak since we rely on
guest behavior. I think you may add a TODO or add comment to note it.

BTW. I think the subject can be refined since the real purpose is to
make tracking dirty pages in the unmap happen in region_del.

vfio/listener: Add missing dirty tracking in region_del

with this and the prior check, this patch looks good to me.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

>>
>> BTW. should the iommu MRs even go to this try_unmap branch? I think for
>> such MRs, it relies on the vIOMMU to unmap explicitly (hence trigger the
>> vfio_iommu_map_notify()).
> 
> Yes, it's unnecessary, but it's hard for VFIO to distinguish if try_unmap is due to
> domain switch or a real unmap. I think it's harmless because the second try_unmap
> unmaps nothing.

hmmm. can a unmap path go to region_del()? Not quite get the second 
try_unmap, do you mean when vIOMMU unmaps everything via 
vfio_iommu_map_notify() and then switch AS which triggers the region_del
and this try_unmap branch?

Regards,
Yi Liu

RE: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap memory address space

Posted by Duan, Zhenzhong 2 months, 1 week ago


>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when unmap
>memory address space
>
>On 2025/11/26 13:45, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH v5 7/9] vfio/listener: Construct iotlb entry when
>unmap
>>> memory address space
>>>
>>> On 2025/11/6 12:20, Zhenzhong Duan wrote:
>>>> If a VFIO device in guest switches from passthrough(PT) domain to block
>>>> domain, the whole memory address space is unmapped, but we passed a
>>> NULL
>>>> iotlb entry to unmap_bitmap, then bitmap query didn't happen and we
>lost
>>>> dirty pages.
>>>
>>> this is a good catch. :) Have you observed problem in testing or just
>>> identified it with patch iteration?
>>
>> Patch iteration.
>>
>>>
>>>> By constructing an iotlb entry with iova = gpa for unmap_bitmap, it can
>>>> set dirty bits correctly.
>>>>
>>>> For IOMMU address space, we still send NULL iotlb because VFIO don't
>>>> know the actual mappings in guest. It's vIOMMU's responsibility to send
>>>> actual unmapping notifications, e.g.,
>>> vtd_address_space_unmap_in_migration()
>>>>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>> ---
>>>>    hw/vfio/listener.c | 15 ++++++++++++++-
>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>>>> index 2109101158..3b48f6796c 100644
>>>> --- a/hw/vfio/listener.c
>>>> +++ b/hw/vfio/listener.c
>>>> @@ -713,14 +713,27 @@ static void
>>> vfio_listener_region_del(MemoryListener *listener,
>>>>
>>>>        if (try_unmap) {
>>>>            bool unmap_all = false;
>>>> +        IOMMUTLBEntry entry = {}, *iotlb = NULL;
>>>>
>>>>            if (int128_eq(llsize, int128_2_64())) {
>>>>                assert(!iova);
>>>>                unmap_all = true;
>>>>                llsize = int128_zero();
>>>>            }
>>>> +
>>>> +        /*
>>>> +         * Fake an IOTLB entry for identity mapping which is needed
>by
>>> dirty
>>>> +         * tracking. In fact, in unmap_bitmap, only translated_addr
>field
>>> is
>>>> +         * used to set dirty bitmap.
>>>
>>> Just say sync dirty is needed per unmap. So you may add a check
>>> in_migration as well. If not in migration, it is no needed to do it.
>>
>> Dirty tracking is not only for migration, but also ditry rate/limit. So a
>non-null iotlb
>> is always passed for ram MR. That iotlb pointer is used only when
>> vfio_container_dirty_tracking_is_started() return true.
>>
>> I can add a check on global_dirty_tracking if you prefer to add a check.
>
>yeah, this would be helpful.
>
>>
>>>
>>>> +         */
>>>> +        if (!memory_region_is_iommu(section->mr)) {
>>>> +            entry.iova = iova;
>>>> +            entry.translated_addr = iova;
>>>> +            iotlb = &entry;
>>>> +        }
>>>> +
>>>
>>> While, I'm still wondering how to deal with iommu MR case. Let's see a
>>> scenario first. When switching from DMA domain to PT, QEMU will switch
>>> to PT. This shall trigger the vfio_listener_region_del() and unregister
>>> the iommu notifier. This means vIOMMU side needs to do unmap prior to
>>> switching AS. If not, the iommu notifier is gone when vIOMMU wants to
>>> unmap with an IOTLBEvent. For virtual intel_iommu, it is calling
>>> vtd_address_space_unmap_in_migration() prior to calling
>>> vtd_switch_address_space(). So I think you need to tweak the intel_iommu
>>> a bit to suit the order requirement. :)
>>
>> VTD doesn't support switching from DMA domain to PT atomically, so
>switches
>> to block domain in between, see intel_iommu_attach_device() in kernel.
>>
>> So with this sequence is DMA->block->PT domain, we have guaranteed the
>order
>> you shared? See vtd_pasid_cache_sync_locked().
>
>I see. So guest helps it. This might be a bit weak since we rely on
>guest behavior. I think you may add a TODO or add comment to note it.

Make sense, will add.

>
>BTW. I think the subject can be refined since the real purpose is to
>make tracking dirty pages in the unmap happen in region_del.
>
>vfio/listener: Add missing dirty tracking in region_del

Will do.

>
>with this and the prior check, this patch looks good to me.
>
>Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>
>>>
>>> BTW. should the iommu MRs even go to this try_unmap branch? I think for
>>> such MRs, it relies on the vIOMMU to unmap explicitly (hence trigger the
>>> vfio_iommu_map_notify()).
>>
>> Yes, it's unnecessary, but it's hard for VFIO to distinguish if try_unmap is due
>to
>> domain switch or a real unmap. I think it's harmless because the second
>try_unmap
>> unmaps nothing.
>
>hmmm. can a unmap path go to region_del()? Not quite get the second
>try_unmap, do you mean when vIOMMU unmaps everything via
>vfio_iommu_map_notify() and then switch AS which triggers the region_del
>and this try_unmap branch?

Yes, vfio_iommu_map_notify() is first try_unmap, unmap in region_del is the second try_unmap.

Thanks
Zhenzhong