[PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls

Zhenzhong Duan posted 1 patch 11 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230523080702.179363-1-zhenzhong.duan@intel.com
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Peter Xu <peterx@redhat.com>, Jason Wang <jasowang@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)
[PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Zhenzhong Duan 11 months, 2 weeks ago
Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds logic
to record mapped IOVA ranges so we only need to send MAP or UNMAP when
necessary. But there are still a few corner cases of unnecessary UNMAP.

One is address space switch. During switching to iommu address space,
all the original mappings have been dropped by VFIO memory listener,
we don't need to unmap again in replay. The other is invalidation,
we only need to unmap when there are recorded mapped IOVA ranges,
presuming most of OSes allocating IOVA range continuously,
ex. on x86, linux sets up mapping from 0xffffffff downwards.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
Tested on x86 with a net card passed or hotpluged to kvm guest,
ping/ssh pass.

 hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 94d52f4205d2..6afd6428aaaa 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3743,6 +3743,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     hwaddr start = n->start;
     hwaddr end = n->end;
     IntelIOMMUState *s = as->iommu_state;
+    IOMMUTLBEvent event;
     DMAMap map;
 
     /*
@@ -3762,22 +3763,25 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     assert(start <= end);
     size = remain = end - start + 1;
 
+    event.type = IOMMU_NOTIFIER_UNMAP;
+    event.entry.target_as = &address_space_memory;
+    event.entry.perm = IOMMU_NONE;
+    /* This field is meaningless for unmap */
+    event.entry.translated_addr = 0;
+
     while (remain >= VTD_PAGE_SIZE) {
-        IOMMUTLBEvent event;
         uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
         uint64_t size = mask + 1;
 
         assert(size);
 
-        event.type = IOMMU_NOTIFIER_UNMAP;
-        event.entry.iova = start;
-        event.entry.addr_mask = mask;
-        event.entry.target_as = &address_space_memory;
-        event.entry.perm = IOMMU_NONE;
-        /* This field is meaningless for unmap */
-        event.entry.translated_addr = 0;
-
-        memory_region_notify_iommu_one(n, &event);
+        map.iova = start;
+        map.size = size;
+        if (iova_tree_find(as->iova_tree, &map)) {
+            event.entry.iova = start;
+            event.entry.addr_mask = mask;
+            memory_region_notify_iommu_one(n, &event);
+        }
 
         start += size;
         remain -= size;
@@ -3826,13 +3830,6 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     uint8_t bus_n = pci_bus_num(vtd_as->bus);
     VTDContextEntry ce;
 
-    /*
-     * The replay can be triggered by either a invalidation or a newly
-     * created entry. No matter what, we release existing mappings
-     * (it means flushing caches for UNMAP-only registers).
-     */
-    vtd_address_space_unmap(vtd_as, n);
-
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
         trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
                                   "legacy mode",
-- 
2.34.1
Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Peter Xu 11 months, 2 weeks ago
Hi, Zhenzhong,

On Tue, May 23, 2023 at 04:07:02PM +0800, Zhenzhong Duan wrote:
> Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds logic
> to record mapped IOVA ranges so we only need to send MAP or UNMAP when
> necessary. But there are still a few corner cases of unnecessary UNMAP.
> 
> One is address space switch. During switching to iommu address space,
> all the original mappings have been dropped by VFIO memory listener,
> we don't need to unmap again in replay. The other is invalidation,
> we only need to unmap when there are recorded mapped IOVA ranges,
> presuming most of OSes allocating IOVA range continuously,
> ex. on x86, linux sets up mapping from 0xffffffff downwards.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> Tested on x86 with a net card passed or hotpluged to kvm guest,
> ping/ssh pass.

Since this is a performance related patch, do you have any number to show
the effect?

> 
>  hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 94d52f4205d2..6afd6428aaaa 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3743,6 +3743,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>      hwaddr start = n->start;
>      hwaddr end = n->end;
>      IntelIOMMUState *s = as->iommu_state;
> +    IOMMUTLBEvent event;
>      DMAMap map;
>  
>      /*
> @@ -3762,22 +3763,25 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>      assert(start <= end);
>      size = remain = end - start + 1;
>  
> +    event.type = IOMMU_NOTIFIER_UNMAP;
> +    event.entry.target_as = &address_space_memory;
> +    event.entry.perm = IOMMU_NONE;
> +    /* This field is meaningless for unmap */
> +    event.entry.translated_addr = 0;
> +
>      while (remain >= VTD_PAGE_SIZE) {
> -        IOMMUTLBEvent event;
>          uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
>          uint64_t size = mask + 1;
>  
>          assert(size);
>  
> -        event.type = IOMMU_NOTIFIER_UNMAP;
> -        event.entry.iova = start;
> -        event.entry.addr_mask = mask;
> -        event.entry.target_as = &address_space_memory;
> -        event.entry.perm = IOMMU_NONE;
> -        /* This field is meaningless for unmap */
> -        event.entry.translated_addr = 0;
> -
> -        memory_region_notify_iommu_one(n, &event);
> +        map.iova = start;
> +        map.size = size;
> +        if (iova_tree_find(as->iova_tree, &map)) {
> +            event.entry.iova = start;
> +            event.entry.addr_mask = mask;
> +            memory_region_notify_iommu_one(n, &event);
> +        }

This one looks fine to me, but I'm not sure how much benefit we'll get here
either as this path should be rare afaiu.

>  
>          start += size;
>          remain -= size;
> @@ -3826,13 +3830,6 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
>      VTDContextEntry ce;
>  
> -    /*
> -     * The replay can be triggered by either a invalidation or a newly
> -     * created entry. No matter what, we release existing mappings
> -     * (it means flushing caches for UNMAP-only registers).
> -     */
> -    vtd_address_space_unmap(vtd_as, n);

IIUC this is needed to satisfy current replay() semantics:

    /**
     * @replay:
     *
     * Called to handle memory_region_iommu_replay().
     *
     * The default implementation of memory_region_iommu_replay() is to
     * call the IOMMU translate method for every page in the address space
     * with flag == IOMMU_NONE and then call the notifier if translate
     * returns a valid mapping. If this method is implemented then it
     * overrides the default behaviour, and must provide the full semantics
     * of memory_region_iommu_replay(), by calling @notifier for every
     * translation present in the IOMMU.

The problem is vtd_page_walk() currently by default only notifies on page
changes, so we'll notify all MAP only if we unmap all of them first.

I assumed it was not a major issue with/without it before because
previously AFAIU the major path to trigger this is when someone hot plug a
vfio-pci into an existing guest IOMMU domain, so the unmap_all() is indeed
no-op.  However from semantics level it seems unmap_all() is still needed.

The other thing is when I am looking at the new code I found that we
actually extended the replay() to be used also in dirty tracking of vfio,
in vfio_sync_dirty_bitmap().  For that maybe it's already broken if
unmap_all() because afaiu log_sync() can be called in migration thread
anytime during DMA so I think it means the device is prone to DMA with the
IOMMU pgtable quickly erased and rebuilt here, which means the DMA could
fail unexpectedly.  Copy Alex, Kirti and Neo.

Perhaps to fix it we'll need to teach the vtd pgtable walker to notify all
existing MAP events without touching the IOVA tree at all.

> -
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
>                                    "legacy mode",
> -- 
> 2.34.1
> 

-- 
Peter Xu
RE: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Duan, Zhenzhong 11 months, 2 weeks ago
Hi Peter,

See inline.
>-----Original Message-----
>From: Peter Xu <peterx@redhat.com>
>Sent: Thursday, May 25, 2023 12:59 AM
>Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
>calls
>
>Hi, Zhenzhong,
>
>On Tue, May 23, 2023 at 04:07:02PM +0800, Zhenzhong Duan wrote:
>> Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds
>> logic to record mapped IOVA ranges so we only need to send MAP or
>> UNMAP when necessary. But there are still a few corner cases of
>unnecessary UNMAP.
>>
>> One is address space switch. During switching to iommu address space,
>> all the original mappings have been dropped by VFIO memory listener,
>> we don't need to unmap again in replay. The other is invalidation, we
>> only need to unmap when there are recorded mapped IOVA ranges,
>> presuming most of OSes allocating IOVA range continuously, ex. on x86,
>> linux sets up mapping from 0xffffffff downwards.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> Tested on x86 with a net card passed or hotpluged to kvm guest,
>> ping/ssh pass.
>
>Since this is a performance related patch, do you have any number to show
>the effect?

I straced the time of UNMAP ioctl, its time is 0.000014us and we have 28 ioctl() due to
the two notifiers in x86 are split into power of 2 pieces.

ioctl(48, VFIO_DEVICE_QUERY_GFX_PLANE or VFIO_IOMMU_UNMAP_DMA, 0x7ffffd5c42f0) = 0 <0.000014>

>
>>
>>  hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
>>  1 file changed, 14 insertions(+), 17 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>> 94d52f4205d2..6afd6428aaaa 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3743,6 +3743,7 @@ static void
>vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>      hwaddr start = n->start;
>>      hwaddr end = n->end;
>>      IntelIOMMUState *s = as->iommu_state;
>> +    IOMMUTLBEvent event;
>>      DMAMap map;
>>
>>      /*
>> @@ -3762,22 +3763,25 @@ static void
>vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>      assert(start <= end);
>>      size = remain = end - start + 1;
>>
>> +    event.type = IOMMU_NOTIFIER_UNMAP;
>> +    event.entry.target_as = &address_space_memory;
>> +    event.entry.perm = IOMMU_NONE;
>> +    /* This field is meaningless for unmap */
>> +    event.entry.translated_addr = 0;
>> +
>>      while (remain >= VTD_PAGE_SIZE) {
>> -        IOMMUTLBEvent event;
>>          uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
>>          uint64_t size = mask + 1;
>>
>>          assert(size);
>>
>> -        event.type = IOMMU_NOTIFIER_UNMAP;
>> -        event.entry.iova = start;
>> -        event.entry.addr_mask = mask;
>> -        event.entry.target_as = &address_space_memory;
>> -        event.entry.perm = IOMMU_NONE;
>> -        /* This field is meaningless for unmap */
>> -        event.entry.translated_addr = 0;
>> -
>> -        memory_region_notify_iommu_one(n, &event);
>> +        map.iova = start;
>> +        map.size = size;
>> +        if (iova_tree_find(as->iova_tree, &map)) {
>> +            event.entry.iova = start;
>> +            event.entry.addr_mask = mask;
>> +            memory_region_notify_iommu_one(n, &event);
>> +        }
>
>This one looks fine to me, but I'm not sure how much benefit we'll get here
>either as this path should be rare afaiu.

Yes, I only see such UNMAP call at cold bootup/shutdown, hot plug and unplug.

In fact, the other purpose of this patch is to eliminate noisy error log when
we work with IOMMUFD. It looks the duplicate UNMAP call will fail with IOMMUFD
while always succeed with legacy container. This behavior difference lead to below
error log for IOMMUFD:

IOMMU_IOAS_UNMAP failed: No such file or directory
vfio_container_dma_unmap(0x562012d6b6d0, 0x0, 0x80000000) = -2 (No such file or directory)
IOMMU_IOAS_UNMAP failed: No such file or directory
vfio_container_dma_unmap(0x562012d6b6d0, 0x80000000, 0x40000000) = -2 (No such file or directory)

>
>>
>>          start += size;
>>          remain -= size;
>> @@ -3826,13 +3830,6 @@ static void
>vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
>>      VTDContextEntry ce;
>>
>> -    /*
>> -     * The replay can be triggered by either a invalidation or a newly
>> -     * created entry. No matter what, we release existing mappings
>> -     * (it means flushing caches for UNMAP-only registers).
>> -     */
>> -    vtd_address_space_unmap(vtd_as, n);
>
>IIUC this is needed to satisfy current replay() semantics:
>
>    /**
>     * @replay:
>     *
>     * Called to handle memory_region_iommu_replay().
>     *
>     * The default implementation of memory_region_iommu_replay() is to
>     * call the IOMMU translate method for every page in the address space
>     * with flag == IOMMU_NONE and then call the notifier if translate
>     * returns a valid mapping. If this method is implemented then it
>     * overrides the default behaviour, and must provide the full semantics
>     * of memory_region_iommu_replay(), by calling @notifier for every
>     * translation present in the IOMMU.
Above semantics claims calling @notifier for every translation present in the IOMMU
But it doesn't claim if calling @notifier for non-present translation.
I checked other custom replay() callback, ex. virtio_iommu_replay(), spapr_tce_replay()
it looks only intel_iommu is special by calling unmap_all() before rebuild mapping.

>
>The problem is vtd_page_walk() currently by default only notifies on page
>changes, so we'll notify all MAP only if we unmap all of them first.
Hmm, I didn't get this point. Checked vtd_page_walk_one(), it will rebuild the
mapping except the DMAMap is exactly same which it will skip. See below:

    /* Update local IOVA mapped ranges */
    if (event->type == IOMMU_NOTIFIER_MAP) {
        if (mapped) {
            /* If it's exactly the same translation, skip */
            if (!memcmp(mapped, &target, sizeof(target))) {
                trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
                                                 entry->translated_addr);
                return 0;
            } else {
                /*
                 * Translation changed.  Normally this should not
                 * happen, but it can happen when with buggy guest

>
>I assumed it was not a major issue with/without it before because previously
>AFAIU the major path to trigger this is when someone hot plug a vfio-pci into
>an existing guest IOMMU domain, so the unmap_all() is indeed no-op.
Yes, same for cold plug.

>However from semantics level it seems unmap_all() is still needed.
>
>The other thing is when I am looking at the new code I found that we actually
>extended the replay() to be used also in dirty tracking of vfio, in
>vfio_sync_dirty_bitmap().  For that maybe it's already broken if
>unmap_all() because afaiu log_sync() can be called in migration thread
>anytime during DMA so I think it means the device is prone to DMA with the
>IOMMU pgtable quickly erased and rebuilt here, which means the DMA could
>fail unexpectedly.  Copy Alex, Kirti and Neo.
Good catch, indeed.

Thanks
Zhenzhong
>
>Perhaps to fix it we'll need to teach the vtd pgtable walker to notify all existing
>MAP events without touching the IOVA tree at all.
>
>> -
>>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
>>                                    "legacy mode",
>> --
>> 2.34.1
>>
>
>--
>Peter Xu
Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Peter Xu 11 months, 2 weeks ago
On Thu, May 25, 2023 at 11:29:34AM +0000, Duan, Zhenzhong wrote:
> Hi Peter,
> 
> See inline.
> >-----Original Message-----
> >From: Peter Xu <peterx@redhat.com>
> >Sent: Thursday, May 25, 2023 12:59 AM
> >Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
> >calls
> >
> >Hi, Zhenzhong,
> >
> >On Tue, May 23, 2023 at 04:07:02PM +0800, Zhenzhong Duan wrote:
> >> Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds
> >> logic to record mapped IOVA ranges so we only need to send MAP or
> >> UNMAP when necessary. But there are still a few corner cases of
> >unnecessary UNMAP.
> >>
> >> One is address space switch. During switching to iommu address space,
> >> all the original mappings have been dropped by VFIO memory listener,
> >> we don't need to unmap again in replay. The other is invalidation, we
> >> only need to unmap when there are recorded mapped IOVA ranges,
> >> presuming most of OSes allocating IOVA range continuously, ex. on x86,
> >> linux sets up mapping from 0xffffffff downwards.
> >>
> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> >> ---
> >> Tested on x86 with a net card passed or hotpluged to kvm guest,
> >> ping/ssh pass.
> >
> >Since this is a performance related patch, do you have any number to show
> >the effect?
> 
> I straced the time of UNMAP ioctl, its time is 0.000014us and we have 28 ioctl() due to
> the two notifiers in x86 are split into power of 2 pieces.
> 
> ioctl(48, VFIO_DEVICE_QUERY_GFX_PLANE or VFIO_IOMMU_UNMAP_DMA, 0x7ffffd5c42f0) = 0 <0.000014>

Could you add some information like this into the commit message when
repost?  E.g. UNMAP was xxx sec before, and this patch reduces it to yyy.

> 
> >
> >>
> >>  hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
> >>  1 file changed, 14 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> >> 94d52f4205d2..6afd6428aaaa 100644
> >> --- a/hw/i386/intel_iommu.c
> >> +++ b/hw/i386/intel_iommu.c
> >> @@ -3743,6 +3743,7 @@ static void
> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> >>      hwaddr start = n->start;
> >>      hwaddr end = n->end;
> >>      IntelIOMMUState *s = as->iommu_state;
> >> +    IOMMUTLBEvent event;
> >>      DMAMap map;
> >>
> >>      /*
> >> @@ -3762,22 +3763,25 @@ static void
> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> >>      assert(start <= end);
> >>      size = remain = end - start + 1;
> >>
> >> +    event.type = IOMMU_NOTIFIER_UNMAP;
> >> +    event.entry.target_as = &address_space_memory;
> >> +    event.entry.perm = IOMMU_NONE;
> >> +    /* This field is meaningless for unmap */
> >> +    event.entry.translated_addr = 0;
> >> +
> >>      while (remain >= VTD_PAGE_SIZE) {
> >> -        IOMMUTLBEvent event;
> >>          uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
> >>          uint64_t size = mask + 1;
> >>
> >>          assert(size);
> >>
> >> -        event.type = IOMMU_NOTIFIER_UNMAP;
> >> -        event.entry.iova = start;
> >> -        event.entry.addr_mask = mask;
> >> -        event.entry.target_as = &address_space_memory;
> >> -        event.entry.perm = IOMMU_NONE;
> >> -        /* This field is meaningless for unmap */
> >> -        event.entry.translated_addr = 0;
> >> -
> >> -        memory_region_notify_iommu_one(n, &event);
> >> +        map.iova = start;
> >> +        map.size = size;
> >> +        if (iova_tree_find(as->iova_tree, &map)) {
> >> +            event.entry.iova = start;
> >> +            event.entry.addr_mask = mask;
> >> +            memory_region_notify_iommu_one(n, &event);
> >> +        }
> >
> >This one looks fine to me, but I'm not sure how much benefit we'll get here
> >either as this path should be rare afaiu.
> 
> Yes, I only see such UNMAP call at cold bootup/shutdown, hot plug and unplug.
> 
> In fact, the other purpose of this patch is to eliminate noisy error log when
> we work with IOMMUFD. It looks the duplicate UNMAP call will fail with IOMMUFD
> while always succeed with legacy container. This behavior difference lead to below
> error log for IOMMUFD:
> 
> IOMMU_IOAS_UNMAP failed: No such file or directory
> vfio_container_dma_unmap(0x562012d6b6d0, 0x0, 0x80000000) = -2 (No such file or directory)
> IOMMU_IOAS_UNMAP failed: No such file or directory
> vfio_container_dma_unmap(0x562012d6b6d0, 0x80000000, 0x40000000) = -2 (No such file or directory)

I see.  Please also mention this in the commit log, that'll help reviewers
understand the goal of the patch, thanks!

> 
> >
> >>
> >>          start += size;
> >>          remain -= size;
> >> @@ -3826,13 +3830,6 @@ static void
> >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
> >>      VTDContextEntry ce;
> >>
> >> -    /*
> >> -     * The replay can be triggered by either a invalidation or a newly
> >> -     * created entry. No matter what, we release existing mappings
> >> -     * (it means flushing caches for UNMAP-only registers).
> >> -     */
> >> -    vtd_address_space_unmap(vtd_as, n);
> >
> >IIUC this is needed to satisfy current replay() semantics:
> >
> >    /**
> >     * @replay:
> >     *
> >     * Called to handle memory_region_iommu_replay().
> >     *
> >     * The default implementation of memory_region_iommu_replay() is to
> >     * call the IOMMU translate method for every page in the address space
> >     * with flag == IOMMU_NONE and then call the notifier if translate
> >     * returns a valid mapping. If this method is implemented then it
> >     * overrides the default behaviour, and must provide the full semantics
> >     * of memory_region_iommu_replay(), by calling @notifier for every
> >     * translation present in the IOMMU.
> Above semantics claims calling @notifier for every translation present in the IOMMU
> But it doesn't claim if calling @notifier for non-present translation.
> I checked other custom replay() callback, ex. virtio_iommu_replay(), spapr_tce_replay()
> it looks only intel_iommu is special by calling unmap_all() before rebuild mapping.

Yes, and I'll reply below for this..

> 
> >
> >The problem is vtd_page_walk() currently by default only notifies on page
> >changes, so we'll notify all MAP only if we unmap all of them first.
> Hmm, I didn't get this point. Checked vtd_page_walk_one(), it will rebuild the
> mapping except the DMAMap is exactly same which it will skip. See below:
> 
>     /* Update local IOVA mapped ranges */
>     if (event->type == IOMMU_NOTIFIER_MAP) {
>         if (mapped) {
>             /* If it's exactly the same translation, skip */
>             if (!memcmp(mapped, &target, sizeof(target))) {
>                 trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
>                                                  entry->translated_addr);
>                 return 0;
>             } else {
>                 /*
>                  * Translation changed.  Normally this should not
>                  * happen, but it can happen when with buggy guest

So I haven't touched the vIOMMU code for a few years, but IIRC if we
replay() on an address space that has mapping already, then if without the
unmap_all() at the start we'll just notify nothing, because "mapped" will
be true for all the existing mappings, and memcmp() should return 0 too if
nothing changed?

I think (and agree) it could be a "bug" for vtd only, mostly not affecting
anything at least before vfio migration.

Do you agree, and perhaps want to fix it altogether?  If so I suppose it'll
also fix the issue below on vfio dirty sync.

Thanks,

> 
> >
> >I assumed it was not a major issue with/without it before because previously
> >AFAIU the major path to trigger this is when someone hot plug a vfio-pci into
> >an existing guest IOMMU domain, so the unmap_all() is indeed no-op.
> Yes, same for cold plug.
> 
> >However from semantics level it seems unmap_all() is still needed.
> >
> >The other thing is when I am looking at the new code I found that we actually
> >extended the replay() to be used also in dirty tracking of vfio, in
> >vfio_sync_dirty_bitmap().  For that maybe it's already broken if
> >unmap_all() because afaiu log_sync() can be called in migration thread
> >anytime during DMA so I think it means the device is prone to DMA with the
> >IOMMU pgtable quickly erased and rebuilt here, which means the DMA could
> >fail unexpectedly.  Copy Alex, Kirti and Neo.
> Good catch, indeed.
> 
> Thanks
> Zhenzhong
> >
> >Perhaps to fix it we'll need to teach the vtd pgtable walker to notify all existing
> >MAP events without touching the IOVA tree at all.
> >
> >> -
> >>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> >>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
> >>                                    "legacy mode",
> >> --
> >> 2.34.1
> >>
> >
> >--
> >Peter Xu

-- 
Peter Xu
RE: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Duan, Zhenzhong 11 months, 1 week ago
>-----Original Message-----
>From: Peter Xu <peterx@redhat.com>
>Sent: Thursday, May 25, 2023 9:54 PM
>Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
>calls
>
>On Thu, May 25, 2023 at 11:29:34AM +0000, Duan, Zhenzhong wrote:
>> Hi Peter,
>>
>> See inline.
>> >-----Original Message-----
>> >From: Peter Xu <peterx@redhat.com>
>> >Sent: Thursday, May 25, 2023 12:59 AM
>> >Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
>> >calls
>> >
>> >Hi, Zhenzhong,
>> >
>> >On Tue, May 23, 2023 at 04:07:02PM +0800, Zhenzhong Duan wrote:
>> >> Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds
>> >> logic to record mapped IOVA ranges so we only need to send MAP or
>> >> UNMAP when necessary. But there are still a few corner cases of
>> >unnecessary UNMAP.
>> >>
>> >> One is address space switch. During switching to iommu address
>> >> space, all the original mappings have been dropped by VFIO memory
>> >> listener, we don't need to unmap again in replay. The other is
>> >> invalidation, we only need to unmap when there are recorded mapped
>> >> IOVA ranges, presuming most of OSes allocating IOVA range
>> >> continuously, ex. on x86, linux sets up mapping from 0xffffffff
>downwards.
>> >>
>> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> >> ---
>> >> Tested on x86 with a net card passed or hotpluged to kvm guest,
>> >> ping/ssh pass.
>> >
>> >Since this is a performance related patch, do you have any number to
>> >show the effect?
>>
>> I straced the time of UNMAP ioctl, its time is 0.000014us and we have
>> 28 ioctl() due to the two notifiers in x86 are split into power of 2 pieces.
>>
>> ioctl(48, VFIO_DEVICE_QUERY_GFX_PLANE or VFIO_IOMMU_UNMAP_DMA,
>> 0x7ffffd5c42f0) = 0 <0.000014>
>
>Could you add some information like this into the commit message when
>repost?  E.g. UNMAP was xxx sec before, and this patch reduces it to yyy.
Sure, will do.

>
>>
>> >
>> >>
>> >>  hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
>> >>  1 file changed, 14 insertions(+), 17 deletions(-)
>> >>
>> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>> >> 94d52f4205d2..6afd6428aaaa 100644
>> >> --- a/hw/i386/intel_iommu.c
>> >> +++ b/hw/i386/intel_iommu.c
>> >> @@ -3743,6 +3743,7 @@ static void
>> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>> >>      hwaddr start = n->start;
>> >>      hwaddr end = n->end;
>> >>      IntelIOMMUState *s = as->iommu_state;
>> >> +    IOMMUTLBEvent event;
>> >>      DMAMap map;
>> >>
>> >>      /*
>> >> @@ -3762,22 +3763,25 @@ static void
>> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>> >>      assert(start <= end);
>> >>      size = remain = end - start + 1;
>> >>
>> >> +    event.type = IOMMU_NOTIFIER_UNMAP;
>> >> +    event.entry.target_as = &address_space_memory;
>> >> +    event.entry.perm = IOMMU_NONE;
>> >> +    /* This field is meaningless for unmap */
>> >> +    event.entry.translated_addr = 0;
>> >> +
>> >>      while (remain >= VTD_PAGE_SIZE) {
>> >> -        IOMMUTLBEvent event;
>> >>          uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
>> >>          uint64_t size = mask + 1;
>> >>
>> >>          assert(size);
>> >>
>> >> -        event.type = IOMMU_NOTIFIER_UNMAP;
>> >> -        event.entry.iova = start;
>> >> -        event.entry.addr_mask = mask;
>> >> -        event.entry.target_as = &address_space_memory;
>> >> -        event.entry.perm = IOMMU_NONE;
>> >> -        /* This field is meaningless for unmap */
>> >> -        event.entry.translated_addr = 0;
>> >> -
>> >> -        memory_region_notify_iommu_one(n, &event);
>> >> +        map.iova = start;
>> >> +        map.size = size;
>> >> +        if (iova_tree_find(as->iova_tree, &map)) {
>> >> +            event.entry.iova = start;
>> >> +            event.entry.addr_mask = mask;
>> >> +            memory_region_notify_iommu_one(n, &event);
>> >> +        }
>> >
>> >This one looks fine to me, but I'm not sure how much benefit we'll
>> >get here either as this path should be rare afaiu.
>>
>> Yes, I only see such UNMAP call at cold bootup/shutdown, hot plug and
>unplug.
>>
>> In fact, the other purpose of this patch is to eliminate noisy error
>> log when we work with IOMMUFD. It looks the duplicate UNMAP call will
>> fail with IOMMUFD while always succeed with legacy container. This
>> behavior difference lead to below error log for IOMMUFD:
>>
>> IOMMU_IOAS_UNMAP failed: No such file or directory
>> vfio_container_dma_unmap(0x562012d6b6d0, 0x0, 0x80000000) = -2 (No
>> such file or directory) IOMMU_IOAS_UNMAP failed: No such file or
>> directory vfio_container_dma_unmap(0x562012d6b6d0, 0x80000000,
>> 0x40000000) = -2 (No such file or directory)
>
>I see.  Please also mention this in the commit log, that'll help reviewers
>understand the goal of the patch, thanks!
Will do.

>
>>
>> >
>> >>
>> >>          start += size;
>> >>          remain -= size;
>> >> @@ -3826,13 +3830,6 @@ static void
>> >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier
>*n)
>> >>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
>> >>      VTDContextEntry ce;
>> >>
>> >> -    /*
>> >> -     * The replay can be triggered by either a invalidation or a newly
>> >> -     * created entry. No matter what, we release existing mappings
>> >> -     * (it means flushing caches for UNMAP-only registers).
>> >> -     */
>> >> -    vtd_address_space_unmap(vtd_as, n);
>> >
>> >IIUC this is needed to satisfy current replay() semantics:
>> >
>> >    /**
>> >     * @replay:
>> >     *
>> >     * Called to handle memory_region_iommu_replay().
>> >     *
>> >     * The default implementation of memory_region_iommu_replay() is to
>> >     * call the IOMMU translate method for every page in the address space
>> >     * with flag == IOMMU_NONE and then call the notifier if translate
>> >     * returns a valid mapping. If this method is implemented then it
>> >     * overrides the default behaviour, and must provide the full semantics
>> >     * of memory_region_iommu_replay(), by calling @notifier for every
>> >     * translation present in the IOMMU.
>> Above semantics claims calling @notifier for every translation present
>> in the IOMMU But it doesn't claim if calling @notifier for non-present
>translation.
>> I checked other custom replay() callback, ex. virtio_iommu_replay(),
>> spapr_tce_replay() it looks only intel_iommu is special by calling unmap_all()
>before rebuild mapping.
>
>Yes, and I'll reply below for this..
>
>>
>> >
>> >The problem is vtd_page_walk() currently by default only notifies on
>> >page changes, so we'll notify all MAP only if we unmap all of them first.
>> Hmm, I didn't get this point. Checked vtd_page_walk_one(), it will
>> rebuild the mapping except the DMAMap is exactly same which it will skip.
>See below:
>>
>>     /* Update local IOVA mapped ranges */
>>     if (event->type == IOMMU_NOTIFIER_MAP) {
>>         if (mapped) {
>>             /* If it's exactly the same translation, skip */
>>             if (!memcmp(mapped, &target, sizeof(target))) {
>>                 trace_vtd_page_walk_one_skip_map(entry->iova, entry-
>>addr_mask,
>>                                                  entry->translated_addr);
>>                 return 0;
>>             } else {
>>                 /*
>>                  * Translation changed.  Normally this should not
>>                  * happen, but it can happen when with buggy guest
>
>So I haven't touched the vIOMMU code for a few years, but IIRC if we
>replay() on an address space that has mapping already, then if without the
>unmap_all() at the start we'll just notify nothing, because "mapped" will be
>true for all the existing mappings, and memcmp() should return 0 too if
>nothing changed?
Understood, you are right. VFIO migration dirty sync needs to be notified
even if mapping is unchanged.

>
>I think (and agree) it could be a "bug" for vtd only, mostly not affecting
>anything at least before vfio migration.
>
>Do you agree, and perhaps want to fix it altogether?  If so I suppose it'll also
>fix the issue below on vfio dirty sync.
Yes, I'll write an implementation.

Thanks
Zhenzhong
Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Jason Wang 11 months, 1 week ago
On Fri, May 26, 2023 at 2:22 PM Duan, Zhenzhong
<zhenzhong.duan@intel.com> wrote:
>
>
> >-----Original Message-----
> >From: Peter Xu <peterx@redhat.com>
> >Sent: Thursday, May 25, 2023 9:54 PM
> >Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
> >calls
> >
> >On Thu, May 25, 2023 at 11:29:34AM +0000, Duan, Zhenzhong wrote:
> >> Hi Peter,
> >>
> >> See inline.
> >> >-----Original Message-----
> >> >From: Peter Xu <peterx@redhat.com>
> >> >Sent: Thursday, May 25, 2023 12:59 AM
> >> >Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
> >> >calls
> >> >
> >> >Hi, Zhenzhong,
> >> >
> >> >On Tue, May 23, 2023 at 04:07:02PM +0800, Zhenzhong Duan wrote:
> >> >> Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds
> >> >> logic to record mapped IOVA ranges so we only need to send MAP or
> >> >> UNMAP when necessary. But there are still a few corner cases of
> >> >unnecessary UNMAP.
> >> >>
> >> >> One is address space switch. During switching to iommu address
> >> >> space, all the original mappings have been dropped by VFIO memory
> >> >> listener, we don't need to unmap again in replay. The other is
> >> >> invalidation, we only need to unmap when there are recorded mapped
> >> >> IOVA ranges, presuming most of OSes allocating IOVA range
> >> >> continuously, ex. on x86, linux sets up mapping from 0xffffffff
> >downwards.
> >> >>
> >> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> >> >> ---
> >> >> Tested on x86 with a net card passed or hotpluged to kvm guest,
> >> >> ping/ssh pass.
> >> >
> >> >Since this is a performance related patch, do you have any number to
> >> >show the effect?
> >>
> >> I straced the time of UNMAP ioctl, its time is 0.000014us and we have
> >> 28 ioctl() due to the two notifiers in x86 are split into power of 2 pieces.
> >>
> >> ioctl(48, VFIO_DEVICE_QUERY_GFX_PLANE or VFIO_IOMMU_UNMAP_DMA,
> >> 0x7ffffd5c42f0) = 0 <0.000014>
> >
> >Could you add some information like this into the commit message when
> >repost?  E.g. UNMAP was xxx sec before, and this patch reduces it to yyy.
> Sure, will do.
>
> >
> >>
> >> >
> >> >>
> >> >>  hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
> >> >>  1 file changed, 14 insertions(+), 17 deletions(-)
> >> >>
> >> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> >> >> 94d52f4205d2..6afd6428aaaa 100644
> >> >> --- a/hw/i386/intel_iommu.c
> >> >> +++ b/hw/i386/intel_iommu.c
> >> >> @@ -3743,6 +3743,7 @@ static void
> >> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> >> >>      hwaddr start = n->start;
> >> >>      hwaddr end = n->end;
> >> >>      IntelIOMMUState *s = as->iommu_state;
> >> >> +    IOMMUTLBEvent event;
> >> >>      DMAMap map;
> >> >>
> >> >>      /*
> >> >> @@ -3762,22 +3763,25 @@ static void
> >> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> >> >>      assert(start <= end);
> >> >>      size = remain = end - start + 1;
> >> >>
> >> >> +    event.type = IOMMU_NOTIFIER_UNMAP;
> >> >> +    event.entry.target_as = &address_space_memory;
> >> >> +    event.entry.perm = IOMMU_NONE;
> >> >> +    /* This field is meaningless for unmap */
> >> >> +    event.entry.translated_addr = 0;
> >> >> +
> >> >>      while (remain >= VTD_PAGE_SIZE) {
> >> >> -        IOMMUTLBEvent event;
> >> >>          uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
> >> >>          uint64_t size = mask + 1;
> >> >>
> >> >>          assert(size);
> >> >>
> >> >> -        event.type = IOMMU_NOTIFIER_UNMAP;
> >> >> -        event.entry.iova = start;
> >> >> -        event.entry.addr_mask = mask;
> >> >> -        event.entry.target_as = &address_space_memory;
> >> >> -        event.entry.perm = IOMMU_NONE;
> >> >> -        /* This field is meaningless for unmap */
> >> >> -        event.entry.translated_addr = 0;
> >> >> -
> >> >> -        memory_region_notify_iommu_one(n, &event);
> >> >> +        map.iova = start;
> >> >> +        map.size = size;
> >> >> +        if (iova_tree_find(as->iova_tree, &map)) {
> >> >> +            event.entry.iova = start;
> >> >> +            event.entry.addr_mask = mask;
> >> >> +            memory_region_notify_iommu_one(n, &event);
> >> >> +        }
> >> >
> >> >This one looks fine to me, but I'm not sure how much benefit we'll
> >> >get here either as this path should be rare afaiu.
> >>
> >> Yes, I only see such UNMAP call at cold bootup/shutdown, hot plug and
> >unplug.
> >>
> >> In fact, the other purpose of this patch is to eliminate noisy error
> >> log when we work with IOMMUFD. It looks the duplicate UNMAP call will
> >> fail with IOMMUFD while always succeed with legacy container. This
> >> behavior difference lead to below error log for IOMMUFD:

A dumb question, should IOMMUFD stick the same behaviour with legacy container?

Thanks

> >>
> >> IOMMU_IOAS_UNMAP failed: No such file or directory
> >> vfio_container_dma_unmap(0x562012d6b6d0, 0x0, 0x80000000) = -2 (No
> >> such file or directory) IOMMU_IOAS_UNMAP failed: No such file or
> >> directory vfio_container_dma_unmap(0x562012d6b6d0, 0x80000000,
> >> 0x40000000) = -2 (No such file or directory)
> >
> >I see.  Please also mention this in the commit log, that'll help reviewers
> >understand the goal of the patch, thanks!
> Will do.
>
> >
> >>
> >> >
> >> >>
> >> >>          start += size;
> >> >>          remain -= size;
> >> >> @@ -3826,13 +3830,6 @@ static void
> >> >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier
> >*n)
> >> >>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
> >> >>      VTDContextEntry ce;
> >> >>
> >> >> -    /*
> >> >> -     * The replay can be triggered by either a invalidation or a newly
> >> >> -     * created entry. No matter what, we release existing mappings
> >> >> -     * (it means flushing caches for UNMAP-only registers).
> >> >> -     */
> >> >> -    vtd_address_space_unmap(vtd_as, n);
> >> >
> >> >IIUC this is needed to satisfy current replay() semantics:
> >> >
> >> >    /**
> >> >     * @replay:
> >> >     *
> >> >     * Called to handle memory_region_iommu_replay().
> >> >     *
> >> >     * The default implementation of memory_region_iommu_replay() is to
> >> >     * call the IOMMU translate method for every page in the address space
> >> >     * with flag == IOMMU_NONE and then call the notifier if translate
> >> >     * returns a valid mapping. If this method is implemented then it
> >> >     * overrides the default behaviour, and must provide the full semantics
> >> >     * of memory_region_iommu_replay(), by calling @notifier for every
> >> >     * translation present in the IOMMU.
> >> Above semantics claims calling @notifier for every translation present
> >> in the IOMMU But it doesn't claim if calling @notifier for non-present
> >translation.
> >> I checked other custom replay() callback, ex. virtio_iommu_replay(),
> >> spapr_tce_replay() it looks only intel_iommu is special by calling unmap_all()
> >before rebuild mapping.
> >
> >Yes, and I'll reply below for this..
> >
> >>
> >> >
> >> >The problem is vtd_page_walk() currently by default only notifies on
> >> >page changes, so we'll notify all MAP only if we unmap all of them first.
> >> Hmm, I didn't get this point. Checked vtd_page_walk_one(), it will
> >> rebuild the mapping except the DMAMap is exactly same which it will skip.
> >See below:
> >>
> >>     /* Update local IOVA mapped ranges */
> >>     if (event->type == IOMMU_NOTIFIER_MAP) {
> >>         if (mapped) {
> >>             /* If it's exactly the same translation, skip */
> >>             if (!memcmp(mapped, &target, sizeof(target))) {
> >>                 trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> >>addr_mask,
> >>                                                  entry->translated_addr);
> >>                 return 0;
> >>             } else {
> >>                 /*
> >>                  * Translation changed.  Normally this should not
> >>                  * happen, but it can happen when with buggy guest
> >
> >So I haven't touched the vIOMMU code for a few years, but IIRC if we
> >replay() on an address space that has mapping already, then if without the
> >unmap_all() at the start we'll just notify nothing, because "mapped" will be
> >true for all the existing mappings, and memcmp() should return 0 too if
> >nothing changed?
> Understood, you are right. VFIO migration dirty sync needs to be notified
> even if mapping is unchanged.
>
> >
> >I think (and agree) it could be a "bug" for vtd only, mostly not affecting
> >anything at least before vfio migration.
> >
> >Do you agree, and perhaps want to fix it altogether?  If so I suppose it'll also
> >fix the issue below on vfio dirty sync.
> Yes, I'll write an implementation.
>
> Thanks
> Zhenzhong
RE: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Liu, Yi L 11 months, 1 week ago
> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, May 26, 2023 2:28 PM
> 
> On Fri, May 26, 2023 at 2:22 PM Duan, Zhenzhong
> <zhenzhong.duan@intel.com> wrote:
> >
> >
> > >-----Original Message-----
> > >From: Peter Xu <peterx@redhat.com>
> > >Sent: Thursday, May 25, 2023 9:54 PM
> > >Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
> > >calls
> > >
> > >On Thu, May 25, 2023 at 11:29:34AM +0000, Duan, Zhenzhong wrote:
> > >> Hi Peter,
> > >>
> > >> See inline.
> > >> >-----Original Message-----
> > >> >From: Peter Xu <peterx@redhat.com>
> > >> >Sent: Thursday, May 25, 2023 12:59 AM
> > >> >Subject: Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP
> > >> >calls
> > >> >
> > >> >Hi, Zhenzhong,
> > >> >
> > >> >On Tue, May 23, 2023 at 04:07:02PM +0800, Zhenzhong Duan wrote:
> > >> >> Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds
> > >> >> logic to record mapped IOVA ranges so we only need to send MAP or
> > >> >> UNMAP when necessary. But there are still a few corner cases of
> > >> >unnecessary UNMAP.
> > >> >>
> > >> >> One is address space switch. During switching to iommu address
> > >> >> space, all the original mappings have been dropped by VFIO memory
> > >> >> listener, we don't need to unmap again in replay. The other is
> > >> >> invalidation, we only need to unmap when there are recorded mapped
> > >> >> IOVA ranges, presuming most of OSes allocating IOVA range
> > >> >> continuously, ex. on x86, linux sets up mapping from 0xffffffff
> > >downwards.
> > >> >>
> > >> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> > >> >> ---
> > >> >> Tested on x86 with a net card passed or hotpluged to kvm guest,
> > >> >> ping/ssh pass.
> > >> >
> > >> >Since this is a performance related patch, do you have any number to
> > >> >show the effect?
> > >>
> > >> I straced the time of UNMAP ioctl, its time is 0.000014us and we have
> > >> 28 ioctl() due to the two notifiers in x86 are split into power of 2 pieces.
> > >>
> > >> ioctl(48, VFIO_DEVICE_QUERY_GFX_PLANE or VFIO_IOMMU_UNMAP_DMA,
> > >> 0x7ffffd5c42f0) = 0 <0.000014>
> > >
> > >Could you add some information like this into the commit message when
> > >repost?  E.g. UNMAP was xxx sec before, and this patch reduces it to yyy.
> > Sure, will do.
> >
> > >
> > >>
> > >> >
> > >> >>
> > >> >>  hw/i386/intel_iommu.c | 31 ++++++++++++++-----------------
> > >> >>  1 file changed, 14 insertions(+), 17 deletions(-)
> > >> >>
> > >> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > >> >> 94d52f4205d2..6afd6428aaaa 100644
> > >> >> --- a/hw/i386/intel_iommu.c
> > >> >> +++ b/hw/i386/intel_iommu.c
> > >> >> @@ -3743,6 +3743,7 @@ static void
> > >> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> > >> >>      hwaddr start = n->start;
> > >> >>      hwaddr end = n->end;
> > >> >>      IntelIOMMUState *s = as->iommu_state;
> > >> >> +    IOMMUTLBEvent event;
> > >> >>      DMAMap map;
> > >> >>
> > >> >>      /*
> > >> >> @@ -3762,22 +3763,25 @@ static void
> > >> >vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> > >> >>      assert(start <= end);
> > >> >>      size = remain = end - start + 1;
> > >> >>
> > >> >> +    event.type = IOMMU_NOTIFIER_UNMAP;
> > >> >> +    event.entry.target_as = &address_space_memory;
> > >> >> +    event.entry.perm = IOMMU_NONE;
> > >> >> +    /* This field is meaningless for unmap */
> > >> >> +    event.entry.translated_addr = 0;
> > >> >> +
> > >> >>      while (remain >= VTD_PAGE_SIZE) {
> > >> >> -        IOMMUTLBEvent event;
> > >> >>          uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
> > >> >>          uint64_t size = mask + 1;
> > >> >>
> > >> >>          assert(size);
> > >> >>
> > >> >> -        event.type = IOMMU_NOTIFIER_UNMAP;
> > >> >> -        event.entry.iova = start;
> > >> >> -        event.entry.addr_mask = mask;
> > >> >> -        event.entry.target_as = &address_space_memory;
> > >> >> -        event.entry.perm = IOMMU_NONE;
> > >> >> -        /* This field is meaningless for unmap */
> > >> >> -        event.entry.translated_addr = 0;
> > >> >> -
> > >> >> -        memory_region_notify_iommu_one(n, &event);
> > >> >> +        map.iova = start;
> > >> >> +        map.size = size;
> > >> >> +        if (iova_tree_find(as->iova_tree, &map)) {
> > >> >> +            event.entry.iova = start;
> > >> >> +            event.entry.addr_mask = mask;
> > >> >> +            memory_region_notify_iommu_one(n, &event);
> > >> >> +        }
> > >> >
> > >> >This one looks fine to me, but I'm not sure how much benefit we'll
> > >> >get here either as this path should be rare afaiu.
> > >>
> > >> Yes, I only see such UNMAP call at cold bootup/shutdown, hot plug and
> > >unplug.
> > >>
> > >> In fact, the other purpose of this patch is to eliminate noisy error
> > >> log when we work with IOMMUFD. It looks the duplicate UNMAP call will
> > >> fail with IOMMUFD while always succeed with legacy container. This
> > >> behavior difference lead to below error log for IOMMUFD:
> 
> A dumb question, should IOMMUFD stick the same behaviour with legacy container?

May need to hear from JasonG. 😊 Should IOMMU_IOAS_UNMAP return error or
success if the iova is not found?

Regards,
Yi Liu

> Thanks
> 
> > >>
> > >> IOMMU_IOAS_UNMAP failed: No such file or directory
> > >> vfio_container_dma_unmap(0x562012d6b6d0, 0x0, 0x80000000) = -2 (No
> > >> such file or directory) IOMMU_IOAS_UNMAP failed: No such file or
> > >> directory vfio_container_dma_unmap(0x562012d6b6d0, 0x80000000,
> > >> 0x40000000) = -2 (No such file or directory)
> > >
> > >I see.  Please also mention this in the commit log, that'll help reviewers
> > >understand the goal of the patch, thanks!
> > Will do.
> >
> > >
> > >>
> > >> >
> > >> >>
> > >> >>          start += size;
> > >> >>          remain -= size;
> > >> >> @@ -3826,13 +3830,6 @@ static void
> > >> >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier
> > >*n)
> > >> >>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
> > >> >>      VTDContextEntry ce;
> > >> >>
> > >> >> -    /*
> > >> >> -     * The replay can be triggered by either a invalidation or a newly
> > >> >> -     * created entry. No matter what, we release existing mappings
> > >> >> -     * (it means flushing caches for UNMAP-only registers).
> > >> >> -     */
> > >> >> -    vtd_address_space_unmap(vtd_as, n);
> > >> >
> > >> >IIUC this is needed to satisfy current replay() semantics:
> > >> >
> > >> >    /**
> > >> >     * @replay:
> > >> >     *
> > >> >     * Called to handle memory_region_iommu_replay().
> > >> >     *
> > >> >     * The default implementation of memory_region_iommu_replay() is to
> > >> >     * call the IOMMU translate method for every page in the address space
> > >> >     * with flag == IOMMU_NONE and then call the notifier if translate
> > >> >     * returns a valid mapping. If this method is implemented then it
> > >> >     * overrides the default behaviour, and must provide the full semantics
> > >> >     * of memory_region_iommu_replay(), by calling @notifier for every
> > >> >     * translation present in the IOMMU.
> > >> Above semantics claims calling @notifier for every translation present
> > >> in the IOMMU But it doesn't claim if calling @notifier for non-present
> > >translation.
> > >> I checked other custom replay() callback, ex. virtio_iommu_replay(),
> > >> spapr_tce_replay() it looks only intel_iommu is special by calling unmap_all()
> > >before rebuild mapping.
> > >
> > >Yes, and I'll reply below for this..
> > >
> > >>
> > >> >
> > >> >The problem is vtd_page_walk() currently by default only notifies on
> > >> >page changes, so we'll notify all MAP only if we unmap all of them first.
> > >> Hmm, I didn't get this point. Checked vtd_page_walk_one(), it will
> > >> rebuild the mapping except the DMAMap is exactly same which it will skip.
> > >See below:
> > >>
> > >>     /* Update local IOVA mapped ranges */
> > >>     if (event->type == IOMMU_NOTIFIER_MAP) {
> > >>         if (mapped) {
> > >>             /* If it's exactly the same translation, skip */
> > >>             if (!memcmp(mapped, &target, sizeof(target))) {
> > >>                 trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > >>addr_mask,
> > >>                                                  entry->translated_addr);
> > >>                 return 0;
> > >>             } else {
> > >>                 /*
> > >>                  * Translation changed.  Normally this should not
> > >>                  * happen, but it can happen when with buggy guest
> > >
> > >So I haven't touched the vIOMMU code for a few years, but IIRC if we
> > >replay() on an address space that has mapping already, then if without the
> > >unmap_all() at the start we'll just notify nothing, because "mapped" will be
> > >true for all the existing mappings, and memcmp() should return 0 too if
> > >nothing changed?
> > Understood, you are right. VFIO migration dirty sync needs to be notified
> > even if mapping is unchanged.
> >
> > >
> > >I think (and agree) it could be a "bug" for vtd only, mostly not affecting
> > >anything at least before vfio migration.
> > >
> > >Do you agree, and perhaps want to fix it altogether?  If so I suppose it'll also
> > >fix the issue below on vfio dirty sync.
> > Yes, I'll write an implementation.
> >
> > Thanks
> > Zhenzhong

Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls
Posted by Jason Gunthorpe 11 months, 1 week ago
On Fri, May 26, 2023 at 08:44:29AM +0000, Liu, Yi L wrote:

> > > >> In fact, the other purpose of this patch is to eliminate noisy error
> > > >> log when we work with IOMMUFD. It looks the duplicate UNMAP call will
> > > >> fail with IOMMUFD while always succeed with legacy container. This
> > > >> behavior difference lead to below error log for IOMMUFD:
> > 
> > A dumb question, should IOMMUFD stick the same behaviour with legacy container?
> 
> May need to hear from JasonG. 😊 Should IOMMU_IOAS_UNMAP return error or
> success if the iova is not found?

The native iommufd functions will return failure if they could not
unmap anything.

Otherwise they return the number of consecutive bytes unmapped.

The VFIO emulation functions should do whatever VFIO does, is there a
mistake there?

Jason