[RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree

Jonah Palmer posted 2 patches 1 month, 2 weeks ago
[RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Jonah Palmer 1 month, 2 weeks ago
Implements the IOVA->GPA tree for handling mapping, unmapping, and
translations for guest memory regions.

When the guest has overlapping memory regions, an HVA to IOVA translation
may return an incorrect IOVA when searching the IOVA->HVA tree. This is
due to one HVA range being contained (overlapping) in another HVA range
in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
translate and find the correct IOVA for guest memory regions.

Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
---
 hw/virtio/vhost-iova-tree.c | 78 +++++++++++++++++++++++++++++++++++--
 hw/virtio/vhost-iova-tree.h |  5 +++
 hw/virtio/vhost-vdpa.c      | 20 ++++++----
 3 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
index 3d03395a77..e33fd56225 100644
--- a/hw/virtio/vhost-iova-tree.c
+++ b/hw/virtio/vhost-iova-tree.c
@@ -28,12 +28,15 @@ struct VhostIOVATree {
 
     /* IOVA address to qemu memory maps. */
     IOVATree *iova_taddr_map;
+
+    /* IOVA address to guest memory maps. */
+    IOVATree *iova_gpa_map;
 };
 
 /**
- * Create a new IOVA tree
+ * Create a new VhostIOVATree
  *
- * Returns the new IOVA tree
+ * Returns the new VhostIOVATree
  */
 VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
 {
@@ -44,6 +47,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
     tree->iova_last = iova_last;
 
     tree->iova_taddr_map = iova_tree_new();
+    tree->iova_gpa_map = iova_tree_new();
     return tree;
 }
 
@@ -53,6 +57,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
 void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
 {
     iova_tree_destroy(iova_tree->iova_taddr_map);
+    iova_tree_destroy(iova_tree->iova_gpa_map);
     g_free(iova_tree);
 }
 
@@ -71,7 +76,7 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
 }
 
 /**
- * Allocate a new mapping
+ * Allocate a new mapping in the IOVA->HVA tree
  *
  * @tree: The iova tree
  * @map: The iova map
@@ -108,3 +113,70 @@ void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
 {
     iova_tree_remove(iova_tree->iova_taddr_map, map);
 }
+
+/**
+ * Find the IOVA address stored from a guest memory address
+ *
+ * @tree: The VhostIOVATree
+ * @map: The map with the guest memory address
+ *
+ * Return the stored mapping, or NULL if not found.
+ */
+const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *tree,
+                                            const DMAMap *map)
+{
+    return iova_tree_find_iova(tree->iova_gpa_map, map);
+}
+
+/**
+ * Allocate new mappings in the IOVA->HVA & IOVA->GPA trees
+ *
+ * @tree: The VhostIOVATree
+ * @map: The iova map
+ * @gpa: The guest physical address (GPA)
+ *
+ * Returns:
+ * - IOVA_OK if the map fits both containers
+ * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
+ * - IOVA_ERR_NOMEM if the IOVA->HVA tree cannot allocate more space
+ *
+ * It returns an assigned iova in map->iova if return value is IOVA_OK.
+ */
+int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *tree, DMAMap *map, hwaddr gpa)
+{
+    int ret;
+
+    /* Some vhost devices don't like addr 0. Skip first page */
+    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
+
+    if (map->translated_addr + map->size < map->translated_addr ||
+        map->perm == IOMMU_NONE) {
+        return IOVA_ERR_INVALID;
+    }
+
+    /* Allocate a node in the IOVA->HVA tree */
+    ret = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
+                              tree->iova_last);
+    if (unlikely(ret != IOVA_OK)) {
+        return ret;
+    }
+
+    /* Insert a node in the IOVA->GPA tree */
+    map->translated_addr = gpa;
+    return iova_tree_insert(tree->iova_gpa_map, map);
+}
+
+/**
+ * Remove existing mappings from the IOVA->HVA & IOVA->GPA trees
+ *
+ * @iova_tree: The VhostIOVATree
+ * @map: The map to remove
+ */
+void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map)
+{
+    /* Remove the existing mapping from the IOVA->GPA tree */
+    iova_tree_remove(iova_tree->iova_gpa_map, map);
+
+    /* Remove the corresponding mapping from the IOVA->HVA tree */
+    iova_tree_remove(iova_tree->iova_taddr_map, map);
+}
diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
index 4adfd79ff0..511c6d18ae 100644
--- a/hw/virtio/vhost-iova-tree.h
+++ b/hw/virtio/vhost-iova-tree.h
@@ -23,5 +23,10 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
                                         const DMAMap *map);
 int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
 void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
+const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *iova_tree,
+                                            const DMAMap *map);
+int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *iova_tree, DMAMap *map,
+                                  hwaddr gpa);
+void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map);
 
 #endif
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3cdaa12ed5..591ff426e7 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -365,9 +365,16 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
         mem_region.size = int128_get64(llsize) - 1,
         mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
 
-        r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
+        r = vhost_iova_tree_map_alloc_gpa(s->iova_tree, &mem_region,
+                                          section->offset_within_address_space);
         if (unlikely(r != IOVA_OK)) {
             error_report("Can't allocate a mapping (%d)", r);
+
+            /* Insertion to IOVA->GPA tree failed */
+            if (mem_region.translated_addr ==
+                section->offset_within_address_space) {
+                goto fail_map;
+            }
             goto fail;
         }
 
@@ -386,7 +393,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
 
 fail_map:
     if (s->shadow_data) {
-        vhost_iova_tree_remove(s->iova_tree, mem_region);
+        vhost_iova_tree_remove_gpa(s->iova_tree, mem_region);
     }
 
 fail:
@@ -440,21 +447,18 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
 
     if (s->shadow_data) {
         const DMAMap *result;
-        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
-            section->offset_within_region +
-            (iova - section->offset_within_address_space);
         DMAMap mem_region = {
-            .translated_addr = (hwaddr)(uintptr_t)vaddr,
+            .translated_addr = section->offset_within_address_space,
             .size = int128_get64(llsize) - 1,
         };
 
-        result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
+        result = vhost_iova_gpa_tree_find_iova(s->iova_tree, &mem_region);
         if (!result) {
             /* The memory listener map wasn't mapped */
             return;
         }
         iova = result->iova;
-        vhost_iova_tree_remove(s->iova_tree, *result);
+        vhost_iova_tree_remove_gpa(s->iova_tree, *result);
     }
     vhost_vdpa_iotlb_batch_begin_once(s);
     /*
-- 
2.43.5
Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Eugenio Perez Martin 1 month, 2 weeks ago
On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> Implements the IOVA->GPA tree for handling mapping, unmapping, and
> translations for guest memory regions.
>
> When the guest has overlapping memory regions, an HVA to IOVA translation
> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
> due to one HVA range being contained (overlapping) in another HVA range
> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
> translate and find the correct IOVA for guest memory regions.
>

Yes, this first patch is super close to what I meant, just one issue
and a pair of nits here and there.

I'd leave the second patch as an optimization on top, if the numbers
prove that adding the code is worth it.

> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> ---
>  hw/virtio/vhost-iova-tree.c | 78 +++++++++++++++++++++++++++++++++++--
>  hw/virtio/vhost-iova-tree.h |  5 +++
>  hw/virtio/vhost-vdpa.c      | 20 ++++++----
>  3 files changed, 92 insertions(+), 11 deletions(-)
>
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> index 3d03395a77..e33fd56225 100644
> --- a/hw/virtio/vhost-iova-tree.c
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -28,12 +28,15 @@ struct VhostIOVATree {
>
>      /* IOVA address to qemu memory maps. */
>      IOVATree *iova_taddr_map;
> +
> +    /* IOVA address to guest memory maps. */
> +    IOVATree *iova_gpa_map;
>  };
>
>  /**
> - * Create a new IOVA tree
> + * Create a new VhostIOVATree
>   *
> - * Returns the new IOVA tree
> + * Returns the new VhostIOVATree
>   */
>  VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>  {
> @@ -44,6 +47,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>      tree->iova_last = iova_last;
>
>      tree->iova_taddr_map = iova_tree_new();
> +    tree->iova_gpa_map = iova_tree_new();
>      return tree;
>  }
>
> @@ -53,6 +57,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>  void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>  {
>      iova_tree_destroy(iova_tree->iova_taddr_map);
> +    iova_tree_destroy(iova_tree->iova_gpa_map);
>      g_free(iova_tree);
>  }
>
> @@ -71,7 +76,7 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
>  }
>
>  /**
> - * Allocate a new mapping
> + * Allocate a new mapping in the IOVA->HVA tree
>   *
>   * @tree: The iova tree
>   * @map: The iova map
> @@ -108,3 +113,70 @@ void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>  {
>      iova_tree_remove(iova_tree->iova_taddr_map, map);
>  }
> +
> +/**
> + * Find the IOVA address stored from a guest memory address
> + *
> + * @tree: The VhostIOVATree
> + * @map: The map with the guest memory address
> + *
> + * Return the stored mapping, or NULL if not found.
> + */
> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *tree,
> +                                            const DMAMap *map)

Nit: Not an english native, but I find vhost_iova_tree should not be
broken for coherency with the rest of the functions. What about
vhost_iova_tree_find_iova_gpa, like _gpa variant?

> +{
> +    return iova_tree_find_iova(tree->iova_gpa_map, map);
> +}
> +
> +/**
> + * Allocate new mappings in the IOVA->HVA & IOVA->GPA trees
> + *
> + * @tree: The VhostIOVATree
> + * @map: The iova map
> + * @gpa: The guest physical address (GPA)
> + *
> + * Returns:
> + * - IOVA_OK if the map fits both containers
> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> + * - IOVA_ERR_NOMEM if the IOVA->HVA tree cannot allocate more space
> + *
> + * It returns an assigned iova in map->iova if return value is IOVA_OK.
> + */
> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *tree, DMAMap *map, hwaddr gpa)
> +{
> +    int ret;
> +
> +    /* Some vhost devices don't like addr 0. Skip first page */
> +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
> +
> +    if (map->translated_addr + map->size < map->translated_addr ||
> +        map->perm == IOMMU_NONE) {
> +        return IOVA_ERR_INVALID;
> +    }
> +
> +    /* Allocate a node in the IOVA->HVA tree */
> +    ret = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> +                              tree->iova_last);

Why not call vhost_iova_tree_map_alloc instead of duplicating it here?

> +    if (unlikely(ret != IOVA_OK)) {
> +        return ret;
> +    }
> +
> +    /* Insert a node in the IOVA->GPA tree */
> +    map->translated_addr = gpa;
> +    return iova_tree_insert(tree->iova_gpa_map, map);
> +}
> +
> +/**
> + * Remove existing mappings from the IOVA->HVA & IOVA->GPA trees
> + *
> + * @iova_tree: The VhostIOVATree
> + * @map: The map to remove
> + */
> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map)
> +{
> +    /* Remove the existing mapping from the IOVA->GPA tree */
> +    iova_tree_remove(iova_tree->iova_gpa_map, map);
> +
> +    /* Remove the corresponding mapping from the IOVA->HVA tree */
> +    iova_tree_remove(iova_tree->iova_taddr_map, map);

If we remove it blindly from both trees, we are keeping the bug, isn't it?

I think the remove should receive the "gpa" as a parameter, same as
alloc_gpa. After that, vhost_iova_tree_remove_gpa looks the right iova
into iova_gpa_map. And only after that, it removes that iova from
iova_tree_remove.

If it makes things easier it could receive (hwaddr gpa, size_t len) or
all of the info in a DMAMap. What do you think?

> +}
> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> index 4adfd79ff0..511c6d18ae 100644
> --- a/hw/virtio/vhost-iova-tree.h
> +++ b/hw/virtio/vhost-iova-tree.h
> @@ -23,5 +23,10 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>                                          const DMAMap *map);
>  int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>  void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *iova_tree,
> +                                            const DMAMap *map);
> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *iova_tree, DMAMap *map,
> +                                  hwaddr gpa);
> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map);
>
>  #endif
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 3cdaa12ed5..591ff426e7 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -365,9 +365,16 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>          mem_region.size = int128_get64(llsize) - 1,
>          mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>
> -        r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
> +        r = vhost_iova_tree_map_alloc_gpa(s->iova_tree, &mem_region,
> +                                          section->offset_within_address_space);
>          if (unlikely(r != IOVA_OK)) {
>              error_report("Can't allocate a mapping (%d)", r);
> +
> +            /* Insertion to IOVA->GPA tree failed */
> +            if (mem_region.translated_addr ==
> +                section->offset_within_address_space) {
> +                goto fail_map;
> +            }

We can move this cleanup code into vhost_iova_tree_map_alloc_gpa, isn't it?

>              goto fail;
>          }
>
> @@ -386,7 +393,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>
>  fail_map:
>      if (s->shadow_data) {
> -        vhost_iova_tree_remove(s->iova_tree, mem_region);
> +        vhost_iova_tree_remove_gpa(s->iova_tree, mem_region);
>      }
>
>  fail:
> @@ -440,21 +447,18 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>
>      if (s->shadow_data) {
>          const DMAMap *result;
> -        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> -            section->offset_within_region +
> -            (iova - section->offset_within_address_space);
>          DMAMap mem_region = {
> -            .translated_addr = (hwaddr)(uintptr_t)vaddr,
> +            .translated_addr = section->offset_within_address_space,
>              .size = int128_get64(llsize) - 1,
>          };
>
> -        result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
> +        result = vhost_iova_gpa_tree_find_iova(s->iova_tree, &mem_region);
>          if (!result) {
>              /* The memory listener map wasn't mapped */
>              return;
>          }
>          iova = result->iova;
> -        vhost_iova_tree_remove(s->iova_tree, *result);
> +        vhost_iova_tree_remove_gpa(s->iova_tree, *result);
>      }
>      vhost_vdpa_iotlb_batch_begin_once(s);
>      /*
> --
> 2.43.5
>
Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Jonah Palmer 1 month, 2 weeks ago

On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>> translations for guest memory regions.
>>
>> When the guest has overlapping memory regions, an HVA to IOVA translation
>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
>> due to one HVA range being contained (overlapping) in another HVA range
>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
>> translate and find the correct IOVA for guest memory regions.
>>
> 
> Yes, this first patch is super close to what I meant, just one issue
> and a pair of nits here and there.
> 
> I'd leave the second patch as an optimization on top, if the numbers
> prove that adding the code is worth it.
> 

Ah okay, gotcha. I also wasn't sure if what you mentioned below on the 
previous series you also wanted implemented or if these would also be 
optimizations on top.

[Adding code to the vhost_iova_tree layer for handling multiple buffers 
returned from translation for the memory area where each iovec covers]:
-----------------------------------------------------------------------
"Let's say that SVQ wants to translate the HVA range
0xfeda0000-0xfedd0000. So it makes available for the device two
chained buffers: One with addr=0x1000 len=0x20000 and the other one
with addr=(0x20000c1000 len=0x10000).

The VirtIO device should be able to translate these two buffers in
isolation and chain them. Not optimal but it helps to keep QEMU source
clean, as the device already must support it."

[Adding a permission check to iova_tree_find_address_iterator and match 
the range by permissions]:
-----------------------------------------------------------------------
"About the permissions, maybe we can make the permissions to be part of
the lookup? Instead of returning them at iova_tree_find_iova, make
them match at iova_tree_find_address_iterator."

But I understand that the problems with this is that we're assuming the 
SVQ translation will always be done in a transient manner.

>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>> ---
>>   hw/virtio/vhost-iova-tree.c | 78 +++++++++++++++++++++++++++++++++++--
>>   hw/virtio/vhost-iova-tree.h |  5 +++
>>   hw/virtio/vhost-vdpa.c      | 20 ++++++----
>>   3 files changed, 92 insertions(+), 11 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>> index 3d03395a77..e33fd56225 100644
>> --- a/hw/virtio/vhost-iova-tree.c
>> +++ b/hw/virtio/vhost-iova-tree.c
>> @@ -28,12 +28,15 @@ struct VhostIOVATree {
>>
>>       /* IOVA address to qemu memory maps. */
>>       IOVATree *iova_taddr_map;
>> +
>> +    /* IOVA address to guest memory maps. */
>> +    IOVATree *iova_gpa_map;
>>   };
>>
>>   /**
>> - * Create a new IOVA tree
>> + * Create a new VhostIOVATree
>>    *
>> - * Returns the new IOVA tree
>> + * Returns the new VhostIOVATree
>>    */
>>   VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>   {
>> @@ -44,6 +47,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>       tree->iova_last = iova_last;
>>
>>       tree->iova_taddr_map = iova_tree_new();
>> +    tree->iova_gpa_map = iova_tree_new();
>>       return tree;
>>   }
>>
>> @@ -53,6 +57,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>   void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>   {
>>       iova_tree_destroy(iova_tree->iova_taddr_map);
>> +    iova_tree_destroy(iova_tree->iova_gpa_map);
>>       g_free(iova_tree);
>>   }
>>
>> @@ -71,7 +76,7 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
>>   }
>>
>>   /**
>> - * Allocate a new mapping
>> + * Allocate a new mapping in the IOVA->HVA tree
>>    *
>>    * @tree: The iova tree
>>    * @map: The iova map
>> @@ -108,3 +113,70 @@ void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>   {
>>       iova_tree_remove(iova_tree->iova_taddr_map, map);
>>   }
>> +
>> +/**
>> + * Find the IOVA address stored from a guest memory address
>> + *
>> + * @tree: The VhostIOVATree
>> + * @map: The map with the guest memory address
>> + *
>> + * Return the stored mapping, or NULL if not found.
>> + */
>> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *tree,
>> +                                            const DMAMap *map)
> 
> Nit: Not an english native, but I find vhost_iova_tree should not be
> broken for coherency with the rest of the functions. What about
> vhost_iova_tree_find_iova_gpa, like _gpa variant?
> 

Yea, I totally understand what you mean here and I have *no problem* 
making it into vhost_iova_tree_find_iova_gpa.

Just to add my two cents on this, for what it's worth, now that we have 
both an IOVA->HVA tree and a IOVA->GPA tree, coming up with function 
names that operate on this new tree while conforming to the 
vhost_iova_tree convention and being descriptive in the naming is a bit 
difficult.

For example, to me, vhost_iova_tree_find_iova_gpa would seem a bit 
misleading to me if I didn't know about it beforehand (or was just 
seeing it for the first time). Like, are we finding the IOVA or GPA or 
both? And what tree are we operating on?

If this was some personal code I was writing and I had free reign over 
it, I personally would go with a format like:

vhost_<tree this function concerns>_tree_<action>

So a name like vhost_iova_gpa_tree_find_iova communicates to me that 
we're operating on the iova_gpa (IOVA->GPA) tree and our action is to 
find_iova (find the IOVA).

Similarly for something like vhost_iova_gpa_tree_remove or 
vhost_iova_hva_tree_remove, etc.

But obviously this is the complete opposite of personal code and 
certainly not something that's needed so I'm totally okay with renaming 
it to vhost_iova_tree_find_iova_gpa :)

>> +{
>> +    return iova_tree_find_iova(tree->iova_gpa_map, map);
>> +}
>> +
>> +/**
>> + * Allocate new mappings in the IOVA->HVA & IOVA->GPA trees
>> + *
>> + * @tree: The VhostIOVATree
>> + * @map: The iova map
>> + * @gpa: The guest physical address (GPA)
>> + *
>> + * Returns:
>> + * - IOVA_OK if the map fits both containers
>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>> + * - IOVA_ERR_NOMEM if the IOVA->HVA tree cannot allocate more space
>> + *
>> + * It returns an assigned iova in map->iova if return value is IOVA_OK.
>> + */
>> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *tree, DMAMap *map, hwaddr gpa)
>> +{
>> +    int ret;
>> +
>> +    /* Some vhost devices don't like addr 0. Skip first page */
>> +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>> +
>> +    if (map->translated_addr + map->size < map->translated_addr ||
>> +        map->perm == IOMMU_NONE) {
>> +        return IOVA_ERR_INVALID;
>> +    }
>> +
>> +    /* Allocate a node in the IOVA->HVA tree */
>> +    ret = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>> +                              tree->iova_last);
> 
> Why not call vhost_iova_tree_map_alloc instead of duplicating it here?
> 

Great question with no good answer! For some reason I thought against 
putting it in there but will do that in the next series.

>> +    if (unlikely(ret != IOVA_OK)) {
>> +        return ret;
>> +    }
>> +
>> +    /* Insert a node in the IOVA->GPA tree */
>> +    map->translated_addr = gpa;
>> +    return iova_tree_insert(tree->iova_gpa_map, map);
>> +}
>> +
>> +/**
>> + * Remove existing mappings from the IOVA->HVA & IOVA->GPA trees
>> + *
>> + * @iova_tree: The VhostIOVATree
>> + * @map: The map to remove
>> + */
>> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map)
>> +{
>> +    /* Remove the existing mapping from the IOVA->GPA tree */
>> +    iova_tree_remove(iova_tree->iova_gpa_map, map);
>> +
>> +    /* Remove the corresponding mapping from the IOVA->HVA tree */
>> +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> 
> If we remove it blindly from both trees, we are keeping the bug, isn't it?
> 
> I think the remove should receive the "gpa" as a parameter, same as
> alloc_gpa. After that, vhost_iova_tree_remove_gpa looks the right iova
> into iova_gpa_map. And only after that, it removes that iova from
> iova_tree_remove.
> 
> If it makes things easier it could receive (hwaddr gpa, size_t len) or
> all of the info in a DMAMap. What do you think?
> 

Initially that was my plan but this only gets called in 
vhost_vdpa_listener_region_add/del and in both functions, by the time 
this removal function is called, we already have the correct IOVA.

In vhost_vdpa_listener_region_add, we just allocated that IOVA and saved 
it in mem_region. In vhost_vdpa_listener_region_del, we already found 
the IOVA via vhost_iova_gpa_tree_find_iova prior to calling the removal 
function.

But I could be misunderstanding something here. Let me know if I am.

>> +}
>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>> index 4adfd79ff0..511c6d18ae 100644
>> --- a/hw/virtio/vhost-iova-tree.h
>> +++ b/hw/virtio/vhost-iova-tree.h
>> @@ -23,5 +23,10 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>                                           const DMAMap *map);
>>   int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>   void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *iova_tree,
>> +                                            const DMAMap *map);
>> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *iova_tree, DMAMap *map,
>> +                                  hwaddr gpa);
>> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map);
>>
>>   #endif
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index 3cdaa12ed5..591ff426e7 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -365,9 +365,16 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>           mem_region.size = int128_get64(llsize) - 1,
>>           mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>
>> -        r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>> +        r = vhost_iova_tree_map_alloc_gpa(s->iova_tree, &mem_region,
>> +                                          section->offset_within_address_space);
>>           if (unlikely(r != IOVA_OK)) {
>>               error_report("Can't allocate a mapping (%d)", r);
>> +
>> +            /* Insertion to IOVA->GPA tree failed */
>> +            if (mem_region.translated_addr ==
>> +                section->offset_within_address_space) {
>> +                goto fail_map;
>> +            }
> 
> We can move this cleanup code into vhost_iova_tree_map_alloc_gpa, isn't it?
> 

Sure can. We'd still need to check if vhost_iova_tree_map_alloc_gpa 
returned IOVA_OK though and goto the fail label.

>>               goto fail;
>>           }
>>
>> @@ -386,7 +393,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>
>>   fail_map:
>>       if (s->shadow_data) {
>> -        vhost_iova_tree_remove(s->iova_tree, mem_region);
>> +        vhost_iova_tree_remove_gpa(s->iova_tree, mem_region);
>>       }
>>
>>   fail:
>> @@ -440,21 +447,18 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>>
>>       if (s->shadow_data) {
>>           const DMAMap *result;
>> -        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>> -            section->offset_within_region +
>> -            (iova - section->offset_within_address_space);
>>           DMAMap mem_region = {
>> -            .translated_addr = (hwaddr)(uintptr_t)vaddr,
>> +            .translated_addr = section->offset_within_address_space,
>>               .size = int128_get64(llsize) - 1,
>>           };
>>
>> -        result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
>> +        result = vhost_iova_gpa_tree_find_iova(s->iova_tree, &mem_region);
>>           if (!result) {
>>               /* The memory listener map wasn't mapped */
>>               return;
>>           }
>>           iova = result->iova;
>> -        vhost_iova_tree_remove(s->iova_tree, *result);
>> +        vhost_iova_tree_remove_gpa(s->iova_tree, *result);
>>       }
>>       vhost_vdpa_iotlb_batch_begin_once(s);
>>       /*
>> --
>> 2.43.5
>>
> 


Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Eugenio Perez Martin 1 month, 2 weeks ago
On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
> > On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >> Implements the IOVA->GPA tree for handling mapping, unmapping, and
> >> translations for guest memory regions.
> >>
> >> When the guest has overlapping memory regions, an HVA to IOVA translation
> >> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
> >> due to one HVA range being contained (overlapping) in another HVA range
> >> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
> >> translate and find the correct IOVA for guest memory regions.
> >>
> >
> > Yes, this first patch is super close to what I meant, just one issue
> > and a pair of nits here and there.
> >
> > I'd leave the second patch as an optimization on top, if the numbers
> > prove that adding the code is worth it.
> >
>
> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
> previous series you also wanted implemented or if these would also be
> optimizations on top.
>
> [Adding code to the vhost_iova_tree layer for handling multiple buffers
> returned from translation for the memory area where each iovec covers]:
> -----------------------------------------------------------------------
> "Let's say that SVQ wants to translate the HVA range
> 0xfeda0000-0xfedd0000. So it makes available for the device two
> chained buffers: One with addr=0x1000 len=0x20000 and the other one
> with addr=(0x20000c1000 len=0x10000).
>
> The VirtIO device should be able to translate these two buffers in
> isolation and chain them. Not optimal but it helps to keep QEMU source
> clean, as the device already must support it."
>

This is 100% in the device and QEMU is already able to split the
buffers that way, so we don't need any change in QEMU.

> [Adding a permission check to iova_tree_find_address_iterator and match
> the range by permissions]:
> -----------------------------------------------------------------------
> "About the permissions, maybe we can make the permissions to be part of
> the lookup? Instead of returning them at iova_tree_find_iova, make
> them match at iova_tree_find_address_iterator."
>

Ouch, I forgot this part. This is a small change we also need, can you
add it for the next version? Thanks for remind it!

> But I understand that the problems with this is that we're assuming the
> SVQ translation will always be done in a transient manner.
>
> >> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> >> ---
> >>   hw/virtio/vhost-iova-tree.c | 78 +++++++++++++++++++++++++++++++++++--
> >>   hw/virtio/vhost-iova-tree.h |  5 +++
> >>   hw/virtio/vhost-vdpa.c      | 20 ++++++----
> >>   3 files changed, 92 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >> index 3d03395a77..e33fd56225 100644
> >> --- a/hw/virtio/vhost-iova-tree.c
> >> +++ b/hw/virtio/vhost-iova-tree.c
> >> @@ -28,12 +28,15 @@ struct VhostIOVATree {
> >>
> >>       /* IOVA address to qemu memory maps. */
> >>       IOVATree *iova_taddr_map;
> >> +
> >> +    /* IOVA address to guest memory maps. */
> >> +    IOVATree *iova_gpa_map;
> >>   };
> >>
> >>   /**
> >> - * Create a new IOVA tree
> >> + * Create a new VhostIOVATree
> >>    *
> >> - * Returns the new IOVA tree
> >> + * Returns the new VhostIOVATree
> >>    */
> >>   VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>   {
> >> @@ -44,6 +47,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>       tree->iova_last = iova_last;
> >>
> >>       tree->iova_taddr_map = iova_tree_new();
> >> +    tree->iova_gpa_map = iova_tree_new();
> >>       return tree;
> >>   }
> >>
> >> @@ -53,6 +57,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
> >>   void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> >>   {
> >>       iova_tree_destroy(iova_tree->iova_taddr_map);
> >> +    iova_tree_destroy(iova_tree->iova_gpa_map);
> >>       g_free(iova_tree);
> >>   }
> >>
> >> @@ -71,7 +76,7 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> >>   }
> >>
> >>   /**
> >> - * Allocate a new mapping
> >> + * Allocate a new mapping in the IOVA->HVA tree
> >>    *
> >>    * @tree: The iova tree
> >>    * @map: The iova map
> >> @@ -108,3 +113,70 @@ void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
> >>   {
> >>       iova_tree_remove(iova_tree->iova_taddr_map, map);
> >>   }
> >> +
> >> +/**
> >> + * Find the IOVA address stored from a guest memory address
> >> + *
> >> + * @tree: The VhostIOVATree
> >> + * @map: The map with the guest memory address
> >> + *
> >> + * Return the stored mapping, or NULL if not found.
> >> + */
> >> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *tree,
> >> +                                            const DMAMap *map)
> >
> > Nit: Not an english native, but I find vhost_iova_tree should not be
> > broken for coherency with the rest of the functions. What about
> > vhost_iova_tree_find_iova_gpa, like _gpa variant?
> >
>
> Yea, I totally understand what you mean here and I have *no problem*
> making it into vhost_iova_tree_find_iova_gpa.
>
> Just to add my two cents on this, for what it's worth, now that we have
> both an IOVA->HVA tree and a IOVA->GPA tree, coming up with function
> names that operate on this new tree while conforming to the
> vhost_iova_tree convention and being descriptive in the naming is a bit
> difficult.
>
> For example, to me, vhost_iova_tree_find_iova_gpa would seem a bit
> misleading to me if I didn't know about it beforehand (or was just
> seeing it for the first time). Like, are we finding the IOVA or GPA or
> both? And what tree are we operating on?
>
> If this was some personal code I was writing and I had free reign over
> it, I personally would go with a format like:
>
> vhost_<tree this function concerns>_tree_<action>
>
> So a name like vhost_iova_gpa_tree_find_iova communicates to me that
> we're operating on the iova_gpa (IOVA->GPA) tree and our action is to
> find_iova (find the IOVA).
>
> Similarly for something like vhost_iova_gpa_tree_remove or
> vhost_iova_hva_tree_remove, etc.
>
> But obviously this is the complete opposite of personal code and
> certainly not something that's needed so I'm totally okay with renaming
> it to vhost_iova_tree_find_iova_gpa :)
>

You're creating the patch and you (and everybody) can comment on it,
of course :).

To me, the fact that GPA is stored in a separated *tree* should be an
implementation detail not exposed by the function names. Ideally, the
user of the VhostIOVATree just knows that some entries are tagged with
the GPA and others are not. Saying otherwise, we could replace the GPA
for another struct, or merge both trees, and the API would remain
unchanged if we just add the _gpa as suffix.

Having said that, I'm happy with both names so feel free to keep yours.



> >> +{
> >> +    return iova_tree_find_iova(tree->iova_gpa_map, map);
> >> +}
> >> +
> >> +/**
> >> + * Allocate new mappings in the IOVA->HVA & IOVA->GPA trees
> >> + *
> >> + * @tree: The VhostIOVATree
> >> + * @map: The iova map
> >> + * @gpa: The guest physical address (GPA)
> >> + *
> >> + * Returns:
> >> + * - IOVA_OK if the map fits both containers
> >> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> >> + * - IOVA_ERR_NOMEM if the IOVA->HVA tree cannot allocate more space
> >> + *
> >> + * It returns an assigned iova in map->iova if return value is IOVA_OK.
> >> + */
> >> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *tree, DMAMap *map, hwaddr gpa)
> >> +{
> >> +    int ret;
> >> +
> >> +    /* Some vhost devices don't like addr 0. Skip first page */
> >> +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
> >> +
> >> +    if (map->translated_addr + map->size < map->translated_addr ||
> >> +        map->perm == IOMMU_NONE) {
> >> +        return IOVA_ERR_INVALID;
> >> +    }
> >> +
> >> +    /* Allocate a node in the IOVA->HVA tree */
> >> +    ret = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> >> +                              tree->iova_last);
> >
> > Why not call vhost_iova_tree_map_alloc instead of duplicating it here?
> >
>
> Great question with no good answer! For some reason I thought against
> putting it in there but will do that in the next series.
>
> >> +    if (unlikely(ret != IOVA_OK)) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    /* Insert a node in the IOVA->GPA tree */
> >> +    map->translated_addr = gpa;
> >> +    return iova_tree_insert(tree->iova_gpa_map, map);
> >> +}
> >> +
> >> +/**
> >> + * Remove existing mappings from the IOVA->HVA & IOVA->GPA trees
> >> + *
> >> + * @iova_tree: The VhostIOVATree
> >> + * @map: The map to remove
> >> + */
> >> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map)
> >> +{
> >> +    /* Remove the existing mapping from the IOVA->GPA tree */
> >> +    iova_tree_remove(iova_tree->iova_gpa_map, map);
> >> +
> >> +    /* Remove the corresponding mapping from the IOVA->HVA tree */
> >> +    iova_tree_remove(iova_tree->iova_taddr_map, map);
> >
> > If we remove it blindly from both trees, we are keeping the bug, isn't it?
> >
> > I think the remove should receive the "gpa" as a parameter, same as
> > alloc_gpa. After that, vhost_iova_tree_remove_gpa looks the right iova
> > into iova_gpa_map. And only after that, it removes that iova from
> > iova_tree_remove.
> >
> > If it makes things easier it could receive (hwaddr gpa, size_t len) or
> > all of the info in a DMAMap. What do you think?
> >
>
> Initially that was my plan but this only gets called in
> vhost_vdpa_listener_region_add/del and in both functions, by the time
> this removal function is called, we already have the correct IOVA.
>
> In vhost_vdpa_listener_region_add, we just allocated that IOVA and saved
> it in mem_region. In vhost_vdpa_listener_region_del, we already found
> the IOVA via vhost_iova_gpa_tree_find_iova prior to calling the removal
> function.
>
> But I could be misunderstanding something here. Let me know if I am.
>

Ok I missed that. I think you're totally right.

> >> +}
> >> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> >> index 4adfd79ff0..511c6d18ae 100644
> >> --- a/hw/virtio/vhost-iova-tree.h
> >> +++ b/hw/virtio/vhost-iova-tree.h
> >> @@ -23,5 +23,10 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
> >>                                           const DMAMap *map);
> >>   int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> >>   void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> >> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *iova_tree,
> >> +                                            const DMAMap *map);
> >> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *iova_tree, DMAMap *map,
> >> +                                  hwaddr gpa);
> >> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map);
> >>
> >>   #endif
> >> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >> index 3cdaa12ed5..591ff426e7 100644
> >> --- a/hw/virtio/vhost-vdpa.c
> >> +++ b/hw/virtio/vhost-vdpa.c
> >> @@ -365,9 +365,16 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>           mem_region.size = int128_get64(llsize) - 1,
> >>           mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>
> >> -        r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
> >> +        r = vhost_iova_tree_map_alloc_gpa(s->iova_tree, &mem_region,
> >> +                                          section->offset_within_address_space);
> >>           if (unlikely(r != IOVA_OK)) {
> >>               error_report("Can't allocate a mapping (%d)", r);
> >> +
> >> +            /* Insertion to IOVA->GPA tree failed */
> >> +            if (mem_region.translated_addr ==
> >> +                section->offset_within_address_space) {
> >> +                goto fail_map;
> >> +            }
> >
> > We can move this cleanup code into vhost_iova_tree_map_alloc_gpa, isn't it?
> >
>
> Sure can. We'd still need to check if vhost_iova_tree_map_alloc_gpa
> returned IOVA_OK though and goto the fail label.
>

Yes, right.

Thanks!

> >>               goto fail;
> >>           }
> >>
> >> @@ -386,7 +393,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>
> >>   fail_map:
> >>       if (s->shadow_data) {
> >> -        vhost_iova_tree_remove(s->iova_tree, mem_region);
> >> +        vhost_iova_tree_remove_gpa(s->iova_tree, mem_region);
> >>       }
> >>
> >>   fail:
> >> @@ -440,21 +447,18 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
> >>
> >>       if (s->shadow_data) {
> >>           const DMAMap *result;
> >> -        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
> >> -            section->offset_within_region +
> >> -            (iova - section->offset_within_address_space);
> >>           DMAMap mem_region = {
> >> -            .translated_addr = (hwaddr)(uintptr_t)vaddr,
> >> +            .translated_addr = section->offset_within_address_space,
> >>               .size = int128_get64(llsize) - 1,
> >>           };
> >>
> >> -        result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
> >> +        result = vhost_iova_gpa_tree_find_iova(s->iova_tree, &mem_region);
> >>           if (!result) {
> >>               /* The memory listener map wasn't mapped */
> >>               return;
> >>           }
> >>           iova = result->iova;
> >> -        vhost_iova_tree_remove(s->iova_tree, *result);
> >> +        vhost_iova_tree_remove_gpa(s->iova_tree, *result);
> >>       }
> >>       vhost_vdpa_iotlb_batch_begin_once(s);
> >>       /*
> >> --
> >> 2.43.5
> >>
> >
>
Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Si-Wei Liu 1 month, 2 weeks ago

On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>>>> translations for guest memory regions.
>>>>
>>>> When the guest has overlapping memory regions, an HVA to IOVA translation
>>>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
>>>> due to one HVA range being contained (overlapping) in another HVA range
>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
>>>> translate and find the correct IOVA for guest memory regions.
>>>>
>>> Yes, this first patch is super close to what I meant, just one issue
>>> and a pair of nits here and there.
>>>
>>> I'd leave the second patch as an optimization on top, if the numbers
>>> prove that adding the code is worth it.
>>>
>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
>> previous series you also wanted implemented or if these would also be
>> optimizations on top.
>>
>> [Adding code to the vhost_iova_tree layer for handling multiple buffers
>> returned from translation for the memory area where each iovec covers]:
>> -----------------------------------------------------------------------
>> "Let's say that SVQ wants to translate the HVA range
>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>> with addr=(0x20000c1000 len=0x10000).
>>
>> The VirtIO device should be able to translate these two buffers in
>> isolation and chain them. Not optimal but it helps to keep QEMU source
>> clean, as the device already must support it."
>>
> This is 100% in the device and QEMU is already able to split the
> buffers that way, so we don't need any change in QEMU.
Noted that if working with the full HVA tree directly, the internal iova 
tree linear iterator iova_tree_find_address_iterator() today doesn't 
guarantee the iova range returned can cover the entire length of the 
iov, so things could happen like that the aliased range with smaller 
size (than the requested) happens to be hit first in the linear search 
and be returned, the fragmentation of which can't be guarded against by 
the VirtIO device or the DMA API mentioned above.

The second patch in this series kind of mitigated this side effect by 
sorting out the backing ram_block with the help of 
qemu_ram_block_from_host() in case of guest memory backed iov, so it 
doesn't really count on vhost_iova_gpa_tree_find_iova() to find the 
matching IOVA, but instead (somehow implicitly) avoids the fragmentation 
side effect as mentioned above would never happen. Not saying I like the 
way how it is implemented, but just wanted to point out the implication 
if the second patch has to be removed - either add special handling code 
to the iova-tree iterator sizing the range (same as how range selection 
based upon permission will be done), or add special code in SVQ layer to 
deal with fragmented IOVA ranges due to aliasing.


Regards,
-Siwei





>
>> [Adding a permission check to iova_tree_find_address_iterator and match
>> the range by permissions]:
>> -----------------------------------------------------------------------
>> "About the permissions, maybe we can make the permissions to be part of
>> the lookup? Instead of returning them at iova_tree_find_iova, make
>> them match at iova_tree_find_address_iterator."
>>
> Ouch, I forgot this part. This is a small change we also need, can you
> add it for the next version? Thanks for remind it!
>
>> But I understand that the problems with this is that we're assuming the
>> SVQ translation will always be done in a transient manner.
>>
>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>> ---
>>>>    hw/virtio/vhost-iova-tree.c | 78 +++++++++++++++++++++++++++++++++++--
>>>>    hw/virtio/vhost-iova-tree.h |  5 +++
>>>>    hw/virtio/vhost-vdpa.c      | 20 ++++++----
>>>>    3 files changed, 92 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>> index 3d03395a77..e33fd56225 100644
>>>> --- a/hw/virtio/vhost-iova-tree.c
>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>> @@ -28,12 +28,15 @@ struct VhostIOVATree {
>>>>
>>>>        /* IOVA address to qemu memory maps. */
>>>>        IOVATree *iova_taddr_map;
>>>> +
>>>> +    /* IOVA address to guest memory maps. */
>>>> +    IOVATree *iova_gpa_map;
>>>>    };
>>>>
>>>>    /**
>>>> - * Create a new IOVA tree
>>>> + * Create a new VhostIOVATree
>>>>     *
>>>> - * Returns the new IOVA tree
>>>> + * Returns the new VhostIOVATree
>>>>     */
>>>>    VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    {
>>>> @@ -44,6 +47,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>        tree->iova_last = iova_last;
>>>>
>>>>        tree->iova_taddr_map = iova_tree_new();
>>>> +    tree->iova_gpa_map = iova_tree_new();
>>>>        return tree;
>>>>    }
>>>>
>>>> @@ -53,6 +57,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>>>    {
>>>>        iova_tree_destroy(iova_tree->iova_taddr_map);
>>>> +    iova_tree_destroy(iova_tree->iova_gpa_map);
>>>>        g_free(iova_tree);
>>>>    }
>>>>
>>>> @@ -71,7 +76,7 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
>>>>    }
>>>>
>>>>    /**
>>>> - * Allocate a new mapping
>>>> + * Allocate a new mapping in the IOVA->HVA tree
>>>>     *
>>>>     * @tree: The iova tree
>>>>     * @map: The iova map
>>>> @@ -108,3 +113,70 @@ void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>>>    {
>>>>        iova_tree_remove(iova_tree->iova_taddr_map, map);
>>>>    }
>>>> +
>>>> +/**
>>>> + * Find the IOVA address stored from a guest memory address
>>>> + *
>>>> + * @tree: The VhostIOVATree
>>>> + * @map: The map with the guest memory address
>>>> + *
>>>> + * Return the stored mapping, or NULL if not found.
>>>> + */
>>>> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *tree,
>>>> +                                            const DMAMap *map)
>>> Nit: Not an english native, but I find vhost_iova_tree should not be
>>> broken for coherency with the rest of the functions. What about
>>> vhost_iova_tree_find_iova_gpa, like _gpa variant?
>>>
>> Yea, I totally understand what you mean here and I have *no problem*
>> making it into vhost_iova_tree_find_iova_gpa.
>>
>> Just to add my two cents on this, for what it's worth, now that we have
>> both an IOVA->HVA tree and a IOVA->GPA tree, coming up with function
>> names that operate on this new tree while conforming to the
>> vhost_iova_tree convention and being descriptive in the naming is a bit
>> difficult.
>>
>> For example, to me, vhost_iova_tree_find_iova_gpa would seem a bit
>> misleading to me if I didn't know about it beforehand (or was just
>> seeing it for the first time). Like, are we finding the IOVA or GPA or
>> both? And what tree are we operating on?
>>
>> If this was some personal code I was writing and I had free reign over
>> it, I personally would go with a format like:
>>
>> vhost_<tree this function concerns>_tree_<action>
>>
>> So a name like vhost_iova_gpa_tree_find_iova communicates to me that
>> we're operating on the iova_gpa (IOVA->GPA) tree and our action is to
>> find_iova (find the IOVA).
>>
>> Similarly for something like vhost_iova_gpa_tree_remove or
>> vhost_iova_hva_tree_remove, etc.
>>
>> But obviously this is the complete opposite of personal code and
>> certainly not something that's needed so I'm totally okay with renaming
>> it to vhost_iova_tree_find_iova_gpa :)
>>
> You're creating the patch and you (and everybody) can comment on it,
> of course :).
>
> To me, the fact that GPA is stored in a separated *tree* should be an
> implementation detail not exposed by the function names. Ideally, the
> user of the VhostIOVATree just knows that some entries are tagged with
> the GPA and others are not. Saying otherwise, we could replace the GPA
> for another struct, or merge both trees, and the API would remain
> unchanged if we just add the _gpa as suffix.
>
> Having said that, I'm happy with both names so feel free to keep yours.
>
>
>
>>>> +{
>>>> +    return iova_tree_find_iova(tree->iova_gpa_map, map);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Allocate new mappings in the IOVA->HVA & IOVA->GPA trees
>>>> + *
>>>> + * @tree: The VhostIOVATree
>>>> + * @map: The iova map
>>>> + * @gpa: The guest physical address (GPA)
>>>> + *
>>>> + * Returns:
>>>> + * - IOVA_OK if the map fits both containers
>>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>>>> + * - IOVA_ERR_NOMEM if the IOVA->HVA tree cannot allocate more space
>>>> + *
>>>> + * It returns an assigned iova in map->iova if return value is IOVA_OK.
>>>> + */
>>>> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *tree, DMAMap *map, hwaddr gpa)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    /* Some vhost devices don't like addr 0. Skip first page */
>>>> +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>>>> +
>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>>> +        map->perm == IOMMU_NONE) {
>>>> +        return IOVA_ERR_INVALID;
>>>> +    }
>>>> +
>>>> +    /* Allocate a node in the IOVA->HVA tree */
>>>> +    ret = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>>>> +                              tree->iova_last);
>>> Why not call vhost_iova_tree_map_alloc instead of duplicating it here?
>>>
>> Great question with no good answer! For some reason I thought against
>> putting it in there but will do that in the next series.
>>
>>>> +    if (unlikely(ret != IOVA_OK)) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    /* Insert a node in the IOVA->GPA tree */
>>>> +    map->translated_addr = gpa;
>>>> +    return iova_tree_insert(tree->iova_gpa_map, map);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Remove existing mappings from the IOVA->HVA & IOVA->GPA trees
>>>> + *
>>>> + * @iova_tree: The VhostIOVATree
>>>> + * @map: The map to remove
>>>> + */
>>>> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map)
>>>> +{
>>>> +    /* Remove the existing mapping from the IOVA->GPA tree */
>>>> +    iova_tree_remove(iova_tree->iova_gpa_map, map);
>>>> +
>>>> +    /* Remove the corresponding mapping from the IOVA->HVA tree */
>>>> +    iova_tree_remove(iova_tree->iova_taddr_map, map);
>>> If we remove it blindly from both trees, we are keeping the bug, isn't it?
>>>
>>> I think the remove should receive the "gpa" as a parameter, same as
>>> alloc_gpa. After that, vhost_iova_tree_remove_gpa looks the right iova
>>> into iova_gpa_map. And only after that, it removes that iova from
>>> iova_tree_remove.
>>>
>>> If it makes things easier it could receive (hwaddr gpa, size_t len) or
>>> all of the info in a DMAMap. What do you think?
>>>
>> Initially that was my plan but this only gets called in
>> vhost_vdpa_listener_region_add/del and in both functions, by the time
>> this removal function is called, we already have the correct IOVA.
>>
>> In vhost_vdpa_listener_region_add, we just allocated that IOVA and saved
>> it in mem_region. In vhost_vdpa_listener_region_del, we already found
>> the IOVA via vhost_iova_gpa_tree_find_iova prior to calling the removal
>> function.
>>
>> But I could be misunderstanding something here. Let me know if I am.
>>
> Ok I missed that. I think you're totally right.
>
>>>> +}
>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>> index 4adfd79ff0..511c6d18ae 100644
>>>> --- a/hw/virtio/vhost-iova-tree.h
>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>> @@ -23,5 +23,10 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>>>                                            const DMAMap *map);
>>>>    int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>>>> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *iova_tree,
>>>> +                                            const DMAMap *map);
>>>> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *iova_tree, DMAMap *map,
>>>> +                                  hwaddr gpa);
>>>> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map);
>>>>
>>>>    #endif
>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>> index 3cdaa12ed5..591ff426e7 100644
>>>> --- a/hw/virtio/vhost-vdpa.c
>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>> @@ -365,9 +365,16 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>            mem_region.size = int128_get64(llsize) - 1,
>>>>            mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>
>>>> -        r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>>>> +        r = vhost_iova_tree_map_alloc_gpa(s->iova_tree, &mem_region,
>>>> +                                          section->offset_within_address_space);
>>>>            if (unlikely(r != IOVA_OK)) {
>>>>                error_report("Can't allocate a mapping (%d)", r);
>>>> +
>>>> +            /* Insertion to IOVA->GPA tree failed */
>>>> +            if (mem_region.translated_addr ==
>>>> +                section->offset_within_address_space) {
>>>> +                goto fail_map;
>>>> +            }
>>> We can move this cleanup code into vhost_iova_tree_map_alloc_gpa, isn't it?
>>>
>> Sure can. We'd still need to check if vhost_iova_tree_map_alloc_gpa
>> returned IOVA_OK though and goto the fail label.
>>
> Yes, right.
>
> Thanks!
>
>>>>                goto fail;
>>>>            }
>>>>
>>>> @@ -386,7 +393,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>
>>>>    fail_map:
>>>>        if (s->shadow_data) {
>>>> -        vhost_iova_tree_remove(s->iova_tree, mem_region);
>>>> +        vhost_iova_tree_remove_gpa(s->iova_tree, mem_region);
>>>>        }
>>>>
>>>>    fail:
>>>> @@ -440,21 +447,18 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>>>>
>>>>        if (s->shadow_data) {
>>>>            const DMAMap *result;
>>>> -        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>>>> -            section->offset_within_region +
>>>> -            (iova - section->offset_within_address_space);
>>>>            DMAMap mem_region = {
>>>> -            .translated_addr = (hwaddr)(uintptr_t)vaddr,
>>>> +            .translated_addr = section->offset_within_address_space,
>>>>                .size = int128_get64(llsize) - 1,
>>>>            };
>>>>
>>>> -        result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
>>>> +        result = vhost_iova_gpa_tree_find_iova(s->iova_tree, &mem_region);
>>>>            if (!result) {
>>>>                /* The memory listener map wasn't mapped */
>>>>                return;
>>>>            }
>>>>            iova = result->iova;
>>>> -        vhost_iova_tree_remove(s->iova_tree, *result);
>>>> +        vhost_iova_tree_remove_gpa(s->iova_tree, *result);
>>>>        }
>>>>        vhost_vdpa_iotlb_batch_begin_once(s);
>>>>        /*
>>>> --
>>>> 2.43.5
>>>>


Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Eugenio Perez Martin 1 month, 2 weeks ago
On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
> > On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
> >>>> translations for guest memory regions.
> >>>>
> >>>> When the guest has overlapping memory regions, an HVA to IOVA translation
> >>>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
> >>>> due to one HVA range being contained (overlapping) in another HVA range
> >>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
> >>>> translate and find the correct IOVA for guest memory regions.
> >>>>
> >>> Yes, this first patch is super close to what I meant, just one issue
> >>> and a pair of nits here and there.
> >>>
> >>> I'd leave the second patch as an optimization on top, if the numbers
> >>> prove that adding the code is worth it.
> >>>
> >> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
> >> previous series you also wanted implemented or if these would also be
> >> optimizations on top.
> >>
> >> [Adding code to the vhost_iova_tree layer for handling multiple buffers
> >> returned from translation for the memory area where each iovec covers]:
> >> -----------------------------------------------------------------------
> >> "Let's say that SVQ wants to translate the HVA range
> >> 0xfeda0000-0xfedd0000. So it makes available for the device two
> >> chained buffers: One with addr=0x1000 len=0x20000 and the other one
> >> with addr=(0x20000c1000 len=0x10000).
> >>
> >> The VirtIO device should be able to translate these two buffers in
> >> isolation and chain them. Not optimal but it helps to keep QEMU source
> >> clean, as the device already must support it."
> >>
> > This is 100% in the device and QEMU is already able to split the
> > buffers that way, so we don't need any change in QEMU.
> Noted that if working with the full HVA tree directly, the internal iova
> tree linear iterator iova_tree_find_address_iterator() today doesn't
> guarantee the iova range returned can cover the entire length of the
> iov, so things could happen like that the aliased range with smaller
> size (than the requested) happens to be hit first in the linear search
> and be returned, the fragmentation of which can't be guarded against by
> the VirtIO device or the DMA API mentioned above.
>
> The second patch in this series kind of mitigated this side effect by
> sorting out the backing ram_block with the help of
> qemu_ram_block_from_host() in case of guest memory backed iov, so it
> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
> matching IOVA, but instead (somehow implicitly) avoids the fragmentation
> side effect as mentioned above would never happen. Not saying I like the
> way how it is implemented, but just wanted to point out the implication
> if the second patch has to be removed - either add special handling code
> to the iova-tree iterator sizing the range (same as how range selection
> based upon permission will be done), or add special code in SVQ layer to
> deal with fragmented IOVA ranges due to aliasing.
>

This special code in SVQ is already there. And it will be needed even
if we look for the buffers by GPA instead of by vaddr, the same way
virtqueue_map_desc needs to handle it even if it works with GPA.
Continuous GPA does not imply continuous vaddr.
Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Jonah Palmer 1 month, 2 weeks ago

On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>>
>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>>>>>> translations for guest memory regions.
>>>>>>
>>>>>> When the guest has overlapping memory regions, an HVA to IOVA translation
>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
>>>>>> due to one HVA range being contained (overlapping) in another HVA range
>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
>>>>>> translate and find the correct IOVA for guest memory regions.
>>>>>>
>>>>> Yes, this first patch is super close to what I meant, just one issue
>>>>> and a pair of nits here and there.
>>>>>
>>>>> I'd leave the second patch as an optimization on top, if the numbers
>>>>> prove that adding the code is worth it.
>>>>>
>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
>>>> previous series you also wanted implemented or if these would also be
>>>> optimizations on top.
>>>>
>>>> [Adding code to the vhost_iova_tree layer for handling multiple buffers
>>>> returned from translation for the memory area where each iovec covers]:
>>>> -----------------------------------------------------------------------
>>>> "Let's say that SVQ wants to translate the HVA range
>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>>>> with addr=(0x20000c1000 len=0x10000).
>>>>
>>>> The VirtIO device should be able to translate these two buffers in
>>>> isolation and chain them. Not optimal but it helps to keep QEMU source
>>>> clean, as the device already must support it."
>>>>
>>> This is 100% in the device and QEMU is already able to split the
>>> buffers that way, so we don't need any change in QEMU.
>> Noted that if working with the full HVA tree directly, the internal iova
>> tree linear iterator iova_tree_find_address_iterator() today doesn't
>> guarantee the iova range returned can cover the entire length of the
>> iov, so things could happen like that the aliased range with smaller
>> size (than the requested) happens to be hit first in the linear search
>> and be returned, the fragmentation of which can't be guarded against by
>> the VirtIO device or the DMA API mentioned above.
>>
>> The second patch in this series kind of mitigated this side effect by
>> sorting out the backing ram_block with the help of
>> qemu_ram_block_from_host() in case of guest memory backed iov, so it
>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
>> matching IOVA, but instead (somehow implicitly) avoids the fragmentation
>> side effect as mentioned above would never happen. Not saying I like the
>> way how it is implemented, but just wanted to point out the implication
>> if the second patch has to be removed - either add special handling code
>> to the iova-tree iterator sizing the range (same as how range selection
>> based upon permission will be done), or add special code in SVQ layer to
>> deal with fragmented IOVA ranges due to aliasing.
>>
> 
> This special code in SVQ is already there. And it will be needed even
> if we look for the buffers by GPA instead of by vaddr, the same way
> virtqueue_map_desc needs to handle it even if it works with GPA.
> Continuous GPA does not imply continuous vaddr.
> 

My apologies if I misunderstood something here regarding size & 
permission matching, but I attempted to implement both the size and 
permission check to iova_tree_find_address_iterator(), however, there's 
a sizing mismatch in the vhost_svq_translate_addr() code path that's 
causing vhost-net to fail to start.

qemu-system-x86_64: unable to start vhost net: 22: falling back on
userspace virtio

More specifically, in vhost_svq_add_split(), the first call to 
vhost_svq_vring_write_descs() returns false:

     ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
                                      0, false);
     if (unlikely(!ok)) {
         return false;
     }

Maybe I misunderstood the proposal, but in 
iova_tree_find_address_iterator I'm checking for an exact match for sizes:

     if (map->size != needle->size || map->perm != needle->perm) {
         return false;
     }

During the device setup phase, vhost_svq_add_split() -> 
vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() -> 
vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but in 
iova_tree_find_address_iterator() map->size & needle->size mismatch. I 
inserted some printf's to give more information:

...
[    8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
[    8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
[    8.022403] scsi host0: ahci
[    8.023511] scsi host1: ahci
[    8.024446] scsi host2: ahci
[    8.025308
vhost_svq_translate_addr: iovec[i].iov_len = 0x8
iova_tree_find_address_iterator: mismatched sizes
map->size: 0xfff, needle->size: 0x8
map->perm: 1, needle->perm: 1
vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg: 
0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1, 
more_descs: true, write: false
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_listener_region_del
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
vhost_vdpa_svq_unmap_ring
2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhost 
net: 22: falling back on userspace virtio
] scsi host3: ahci
[   10.921733] scsi host4: ahci
[   10.922946] scsi host5: ahci
[   10.923486] virtio_net virtio1 enp0s2: renamed from eth0
...

This is with similar Qemu args as Si-Wei's from way back when:

-m size=128G,slots=8,maxmem=256G

There are also some error catches with just the permission check but it 
appears the vhost-net device is still able to start up (when not 
matching sizes also).

Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Si-Wei Liu 1 month, 2 weeks ago

On 10/8/2024 8:40 AM, Jonah Palmer wrote:
>
>
> On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
>> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>
>>>
>>>
>>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
>>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer 
>>>> <jonah.palmer@oracle.com> wrote:
>>>>>
>>>>>
>>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
>>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer 
>>>>>> <jonah.palmer@oracle.com> wrote:
>>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>>>>>>> translations for guest memory regions.
>>>>>>>
>>>>>>> When the guest has overlapping memory regions, an HVA to IOVA 
>>>>>>> translation
>>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree. 
>>>>>>> This is
>>>>>>> due to one HVA range being contained (overlapping) in another 
>>>>>>> HVA range
>>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use 
>>>>>>> GPAs to
>>>>>>> translate and find the correct IOVA for guest memory regions.
>>>>>>>
>>>>>> Yes, this first patch is super close to what I meant, just one issue
>>>>>> and a pair of nits here and there.
>>>>>>
>>>>>> I'd leave the second patch as an optimization on top, if the numbers
>>>>>> prove that adding the code is worth it.
>>>>>>
>>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on 
>>>>> the
>>>>> previous series you also wanted implemented or if these would also be
>>>>> optimizations on top.
>>>>>
>>>>> [Adding code to the vhost_iova_tree layer for handling multiple 
>>>>> buffers
>>>>> returned from translation for the memory area where each iovec 
>>>>> covers]:
>>>>> ----------------------------------------------------------------------- 
>>>>>
>>>>> "Let's say that SVQ wants to translate the HVA range
>>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>>>>> with addr=(0x20000c1000 len=0x10000).
>>>>>
>>>>> The VirtIO device should be able to translate these two buffers in
>>>>> isolation and chain them. Not optimal but it helps to keep QEMU 
>>>>> source
>>>>> clean, as the device already must support it."
>>>>>
>>>> This is 100% in the device and QEMU is already able to split the
>>>> buffers that way, so we don't need any change in QEMU.
>>> Noted that if working with the full HVA tree directly, the internal 
>>> iova
>>> tree linear iterator iova_tree_find_address_iterator() today doesn't
>>> guarantee the iova range returned can cover the entire length of the
>>> iov, so things could happen like that the aliased range with smaller
>>> size (than the requested) happens to be hit first in the linear search
>>> and be returned, the fragmentation of which can't be guarded against by
>>> the VirtIO device or the DMA API mentioned above.
>>>
>>> The second patch in this series kind of mitigated this side effect by
>>> sorting out the backing ram_block with the help of
>>> qemu_ram_block_from_host() in case of guest memory backed iov, so it
>>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
>>> matching IOVA, but instead (somehow implicitly) avoids the 
>>> fragmentation
>>> side effect as mentioned above would never happen. Not saying I like 
>>> the
>>> way how it is implemented, but just wanted to point out the implication
>>> if the second patch has to be removed - either add special handling 
>>> code
>>> to the iova-tree iterator sizing the range (same as how range selection
>>> based upon permission will be done), or add special code in SVQ 
>>> layer to
>>> deal with fragmented IOVA ranges due to aliasing.
>>>
>>
>> This special code in SVQ is already there. And it will be needed even
>> if we look for the buffers by GPA instead of by vaddr, the same way
>> virtqueue_map_desc needs to handle it even if it works with GPA.
>> Continuous GPA does not imply continuous vaddr.
>>
>
> My apologies if I misunderstood something here regarding size & 
> permission matching, but I attempted to implement both the size and 
> permission check to iova_tree_find_address_iterator(), however, 
> there's a sizing mismatch in the vhost_svq_translate_addr() code path 
> that's causing vhost-net to fail to start.
>
> qemu-system-x86_64: unable to start vhost net: 22: falling back on
> userspace virtio
>
> More specifically, in vhost_svq_add_split(), the first call to 
> vhost_svq_vring_write_descs() returns false:
>
>     ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
>                                      0, false);
>     if (unlikely(!ok)) {
>         return false;
>     }
>
> Maybe I misunderstood the proposal, but in 
> iova_tree_find_address_iterator I'm checking for an exact match for 
> sizes:
>
>     if (map->size != needle->size || map->perm != needle->perm) {
>         return false;
>     }
The permission needs to exact match, while the size doesn't have to. The 
current iova_tree_find_address_iterator() has the following check:

     if (map->translated_addr + map->size < needle->translated_addr ||
         needle->translated_addr + needle->size < map->translated_addr) {
         return false;
     }

So essentially it does make sure the starting translated_addr on the 
needle is greater than or equal to the starting translated_addr on the 
map, and the first match of the map range in an ordered linear search 
will be returned, but it doesn't guarantee the ending address on the 
needle (needle->translated_addr + needle->size) will be covered by the 
ending address on the map (map->translated_addr + map->size), so this 
implementation is subjected to fragmented iova ranges with contiguous 
HVA address. I don't see the current SVQ handles this well, and the 
reason of iova fragmentation is due to overlapped region or memory 
aliasing (unlike the GPA tree implementation, we have no additional info 
to help us identify the exact IOVA when two or more overlapped HVA 
ranges are given), which is orthogonal to the HVA fragmentation (with 
contiguous GPA) handling in virtqueue_map_desc().

How to handle this situation? Well, I guess Eugenio may have different 
opinion, but to me the only way seems to continue to search till a 
well-covered IOVA range can be found, or we may have to return multiple 
IOVA ranges splitting a contiguous HVA buffer...

Regards,
-Siwei



>
> During the device setup phase, vhost_svq_add_split() -> 
> vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() -> 
> vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but in 
> iova_tree_find_address_iterator() map->size & needle->size mismatch. I 
> inserted some printf's to give more information:
>
> ...
> [    8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
> [    8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
> [    8.022403] scsi host0: ahci
> [    8.023511] scsi host1: ahci
> [    8.024446] scsi host2: ahci
> [    8.025308
> vhost_svq_translate_addr: iovec[i].iov_len = 0x8
> iova_tree_find_address_iterator: mismatched sizes
> map->size: 0xfff, needle->size: 0x8
> map->perm: 1, needle->perm: 1
> vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg: 
> 0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1, 
> more_descs: true, write: false
> vhost_vdpa_svq_unmap_ring
> vhost_vdpa_svq_unmap_ring
> vhost_vdpa_listener_region_del
> vhost_vdpa_listener_region_del
> vhost_vdpa_listener_region_del
> vhost_vdpa_listener_region_del
> vhost_vdpa_listener_region_del
> vhost_vdpa_svq_unmap_ring
> vhost_vdpa_svq_unmap_ring
> vhost_vdpa_svq_unmap_ring
> vhost_vdpa_svq_unmap_ring
> 2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhost 
> net: 22: falling back on userspace virtio
> ] scsi host3: ahci
> [   10.921733] scsi host4: ahci
> [   10.922946] scsi host5: ahci
> [   10.923486] virtio_net virtio1 enp0s2: renamed from eth0
> ...
>
> This is with similar Qemu args as Si-Wei's from way back when:
>
> -m size=128G,slots=8,maxmem=256G
>
> There are also some error catches with just the permission check but 
> it appears the vhost-net device is still able to start up (when not 
> matching sizes also).


Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Eugenio Perez Martin 1 month, 2 weeks ago
On Tue, Oct 8, 2024 at 10:30 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/8/2024 8:40 AM, Jonah Palmer wrote:
> >
> >
> > On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
> >> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>
> >>>
> >>>
> >>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
> >>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer
> >>>> <jonah.palmer@oracle.com> wrote:
> >>>>>
> >>>>>
> >>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
> >>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer
> >>>>>> <jonah.palmer@oracle.com> wrote:
> >>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
> >>>>>>> translations for guest memory regions.
> >>>>>>>
> >>>>>>> When the guest has overlapping memory regions, an HVA to IOVA
> >>>>>>> translation
> >>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree.
> >>>>>>> This is
> >>>>>>> due to one HVA range being contained (overlapping) in another
> >>>>>>> HVA range
> >>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use
> >>>>>>> GPAs to
> >>>>>>> translate and find the correct IOVA for guest memory regions.
> >>>>>>>
> >>>>>> Yes, this first patch is super close to what I meant, just one issue
> >>>>>> and a pair of nits here and there.
> >>>>>>
> >>>>>> I'd leave the second patch as an optimization on top, if the numbers
> >>>>>> prove that adding the code is worth it.
> >>>>>>
> >>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on
> >>>>> the
> >>>>> previous series you also wanted implemented or if these would also be
> >>>>> optimizations on top.
> >>>>>
> >>>>> [Adding code to the vhost_iova_tree layer for handling multiple
> >>>>> buffers
> >>>>> returned from translation for the memory area where each iovec
> >>>>> covers]:
> >>>>> -----------------------------------------------------------------------
> >>>>>
> >>>>> "Let's say that SVQ wants to translate the HVA range
> >>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
> >>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
> >>>>> with addr=(0x20000c1000 len=0x10000).
> >>>>>
> >>>>> The VirtIO device should be able to translate these two buffers in
> >>>>> isolation and chain them. Not optimal but it helps to keep QEMU
> >>>>> source
> >>>>> clean, as the device already must support it."
> >>>>>
> >>>> This is 100% in the device and QEMU is already able to split the
> >>>> buffers that way, so we don't need any change in QEMU.
> >>> Noted that if working with the full HVA tree directly, the internal
> >>> iova
> >>> tree linear iterator iova_tree_find_address_iterator() today doesn't
> >>> guarantee the iova range returned can cover the entire length of the
> >>> iov, so things could happen like that the aliased range with smaller
> >>> size (than the requested) happens to be hit first in the linear search
> >>> and be returned, the fragmentation of which can't be guarded against by
> >>> the VirtIO device or the DMA API mentioned above.
> >>>
> >>> The second patch in this series kind of mitigated this side effect by
> >>> sorting out the backing ram_block with the help of
> >>> qemu_ram_block_from_host() in case of guest memory backed iov, so it
> >>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
> >>> matching IOVA, but instead (somehow implicitly) avoids the
> >>> fragmentation
> >>> side effect as mentioned above would never happen. Not saying I like
> >>> the
> >>> way how it is implemented, but just wanted to point out the implication
> >>> if the second patch has to be removed - either add special handling
> >>> code
> >>> to the iova-tree iterator sizing the range (same as how range selection
> >>> based upon permission will be done), or add special code in SVQ
> >>> layer to
> >>> deal with fragmented IOVA ranges due to aliasing.
> >>>
> >>
> >> This special code in SVQ is already there. And it will be needed even
> >> if we look for the buffers by GPA instead of by vaddr, the same way
> >> virtqueue_map_desc needs to handle it even if it works with GPA.
> >> Continuous GPA does not imply continuous vaddr.
> >>
> >
> > My apologies if I misunderstood something here regarding size &
> > permission matching, but I attempted to implement both the size and
> > permission check to iova_tree_find_address_iterator(), however,
> > there's a sizing mismatch in the vhost_svq_translate_addr() code path
> > that's causing vhost-net to fail to start.
> >
> > qemu-system-x86_64: unable to start vhost net: 22: falling back on
> > userspace virtio
> >
> > More specifically, in vhost_svq_add_split(), the first call to
> > vhost_svq_vring_write_descs() returns false:
> >
> >     ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
> >                                      0, false);
> >     if (unlikely(!ok)) {
> >         return false;
> >     }
> >
> > Maybe I misunderstood the proposal, but in
> > iova_tree_find_address_iterator I'm checking for an exact match for
> > sizes:
> >
> >     if (map->size != needle->size || map->perm != needle->perm) {
> >         return false;
> >     }
> The permission needs to exact match,

Care with this, read only buffers can live in the guest's RW memory. I
think the only actual condition is that if the buffer is writable, the
memory area must be writable. If they don't match, we should keep
looking for the right area.


> while the size doesn't have to. The
> current iova_tree_find_address_iterator() has the following check:
>
>      if (map->translated_addr + map->size < needle->translated_addr ||
>          needle->translated_addr + needle->size < map->translated_addr) {
>          return false;
>      }
>
> So essentially it does make sure the starting translated_addr on the
> needle is greater than or equal to the starting translated_addr on the
> map, and the first match of the map range in an ordered linear search
> will be returned, but it doesn't guarantee the ending address on the
> needle (needle->translated_addr + needle->size) will be covered by the
> ending address on the map (map->translated_addr + map->size), so this
> implementation is subjected to fragmented iova ranges with contiguous
> HVA address. I don't see the current SVQ handles this well, and the
> reason of iova fragmentation is due to overlapped region or memory
> aliasing (unlike the GPA tree implementation, we have no additional info
> to help us identify the exact IOVA when two or more overlapped HVA
> ranges are given), which is orthogonal to the HVA fragmentation (with
> contiguous GPA) handling in virtqueue_map_desc().
>
> How to handle this situation? Well, I guess Eugenio may have different
> opinion, but to me the only way seems to continue to search till a
> well-covered IOVA range can be found, or we may have to return multiple
> IOVA ranges splitting a contiguous HVA buffer...
>

Not sure if I followed this 100%, but we must be capable of returning
multiple IOVA ranges if we trust in SVQ to do the translation anyway.

When SVQ starts, the guest memory listener already gives the GPA ->
HVA translations fragmented and unordered, as QEMU can hotplug memory
for example. Let me put a simple example:

GPA [0, 0x1000) -> HVA [0x10000, 0x10100)
GPA [0x1000, 0x2000) -> HVA [0x20000, 0x20100)

Now we send the IOVA to the device this way:
IOVA [0x2000, 0x3000) -> HVA [0x20000, 0x20100)
IOVA [0x3000, 0x4000) -> HVA [0x10000, 0x10100)

So we have this translation tree (assuming we already store them as GPA->IOVA):
GPA [0, 0x1000) -> IOVA [0x3000, 0x4000)
GPA [0x1000, 0x2000) -> IOVA [0x2000, 0x3000)

Now the guest sends this single buffer in a single descriptor:
GPA 0x500, len 0x1000.

We need to split it into two descriptors anyway. Without memory
aliases we're covered at this moment, as virtqueue_pop already solves
these splits for us in virtqueue_map_desc.

Now I realize that SVQ may not be able to cover these splits with
aliases, as we might need to return more addresses than MAX(out_num,
in_num) in vhost_svq_vring_write_descs. So maybe we need to allocate
the translated addresses array in vhost_svq_vring_write_descs and
return it, or allow it to grow.

Having said that, I'd keep the plan as building something simple that
works first, re-prioritize if we want to focus on reducing the
downtime or in increasing the SVQ throughput, and then act in
consequence by profiling the advantages of the changes. I'm really
looking forward to the switching to (GPA|HVA)->IOVA tree, as I'm sure
it will increase the performance of the solution, but to review all of
it in one shot is out of my bandwith capabilities at the moment.

> Regards,
> -Siwei
>
>
>
> >
> > During the device setup phase, vhost_svq_add_split() ->
> > vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() ->
> > vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but in
> > iova_tree_find_address_iterator() map->size & needle->size mismatch. I
> > inserted some printf's to give more information:
> >
> > ...
> > [    8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
> > [    8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
> > [    8.022403] scsi host0: ahci
> > [    8.023511] scsi host1: ahci
> > [    8.024446] scsi host2: ahci
> > [    8.025308
> > vhost_svq_translate_addr: iovec[i].iov_len = 0x8
> > iova_tree_find_address_iterator: mismatched sizes
> > map->size: 0xfff, needle->size: 0x8
> > map->perm: 1, needle->perm: 1
> > vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg:
> > 0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1,
> > more_descs: true, write: false
> > vhost_vdpa_svq_unmap_ring
> > vhost_vdpa_svq_unmap_ring
> > vhost_vdpa_listener_region_del
> > vhost_vdpa_listener_region_del
> > vhost_vdpa_listener_region_del
> > vhost_vdpa_listener_region_del
> > vhost_vdpa_listener_region_del
> > vhost_vdpa_svq_unmap_ring
> > vhost_vdpa_svq_unmap_ring
> > vhost_vdpa_svq_unmap_ring
> > vhost_vdpa_svq_unmap_ring
> > 2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhost
> > net: 22: falling back on userspace virtio
> > ] scsi host3: ahci
> > [   10.921733] scsi host4: ahci
> > [   10.922946] scsi host5: ahci
> > [   10.923486] virtio_net virtio1 enp0s2: renamed from eth0
> > ...
> >
> > This is with similar Qemu args as Si-Wei's from way back when:
> >
> > -m size=128G,slots=8,maxmem=256G
> >
> > There are also some error catches with just the permission check but
> > it appears the vhost-net device is still able to start up (when not
> > matching sizes also).
>

I think this is because of the forced check for equal sizes, but let
me know if it is not caused by that!

Thanks!
Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Si-Wei Liu 1 month, 2 weeks ago

On 10/9/2024 2:29 AM, Eugenio Perez Martin wrote:
> On Tue, Oct 8, 2024 at 10:30 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 10/8/2024 8:40 AM, Jonah Palmer wrote:
>>>
>>> On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
>>>> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>
>>>>>
>>>>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
>>>>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer
>>>>>> <jonah.palmer@oracle.com> wrote:
>>>>>>>
>>>>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
>>>>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer
>>>>>>>> <jonah.palmer@oracle.com> wrote:
>>>>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>>>>>>>>> translations for guest memory regions.
>>>>>>>>>
>>>>>>>>> When the guest has overlapping memory regions, an HVA to IOVA
>>>>>>>>> translation
>>>>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree.
>>>>>>>>> This is
>>>>>>>>> due to one HVA range being contained (overlapping) in another
>>>>>>>>> HVA range
>>>>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use
>>>>>>>>> GPAs to
>>>>>>>>> translate and find the correct IOVA for guest memory regions.
>>>>>>>>>
>>>>>>>> Yes, this first patch is super close to what I meant, just one issue
>>>>>>>> and a pair of nits here and there.
>>>>>>>>
>>>>>>>> I'd leave the second patch as an optimization on top, if the numbers
>>>>>>>> prove that adding the code is worth it.
>>>>>>>>
>>>>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on
>>>>>>> the
>>>>>>> previous series you also wanted implemented or if these would also be
>>>>>>> optimizations on top.
>>>>>>>
>>>>>>> [Adding code to the vhost_iova_tree layer for handling multiple
>>>>>>> buffers
>>>>>>> returned from translation for the memory area where each iovec
>>>>>>> covers]:
>>>>>>> -----------------------------------------------------------------------
>>>>>>>
>>>>>>> "Let's say that SVQ wants to translate the HVA range
>>>>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>>>>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>>>>>>> with addr=(0x20000c1000 len=0x10000).
>>>>>>>
>>>>>>> The VirtIO device should be able to translate these two buffers in
>>>>>>> isolation and chain them. Not optimal but it helps to keep QEMU
>>>>>>> source
>>>>>>> clean, as the device already must support it."
>>>>>>>
>>>>>> This is 100% in the device and QEMU is already able to split the
>>>>>> buffers that way, so we don't need any change in QEMU.
>>>>> Noted that if working with the full HVA tree directly, the internal
>>>>> iova
>>>>> tree linear iterator iova_tree_find_address_iterator() today doesn't
>>>>> guarantee the iova range returned can cover the entire length of the
>>>>> iov, so things could happen like that the aliased range with smaller
>>>>> size (than the requested) happens to be hit first in the linear search
>>>>> and be returned, the fragmentation of which can't be guarded against by
>>>>> the VirtIO device or the DMA API mentioned above.
>>>>>
>>>>> The second patch in this series kind of mitigated this side effect by
>>>>> sorting out the backing ram_block with the help of
>>>>> qemu_ram_block_from_host() in case of guest memory backed iov, so it
>>>>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
>>>>> matching IOVA, but instead (somehow implicitly) avoids the
>>>>> fragmentation
>>>>> side effect as mentioned above would never happen. Not saying I like
>>>>> the
>>>>> way how it is implemented, but just wanted to point out the implication
>>>>> if the second patch has to be removed - either add special handling
>>>>> code
>>>>> to the iova-tree iterator sizing the range (same as how range selection
>>>>> based upon permission will be done), or add special code in SVQ
>>>>> layer to
>>>>> deal with fragmented IOVA ranges due to aliasing.
>>>>>
>>>> This special code in SVQ is already there. And it will be needed even
>>>> if we look for the buffers by GPA instead of by vaddr, the same way
>>>> virtqueue_map_desc needs to handle it even if it works with GPA.
>>>> Continuous GPA does not imply continuous vaddr.
>>>>
>>> My apologies if I misunderstood something here regarding size &
>>> permission matching, but I attempted to implement both the size and
>>> permission check to iova_tree_find_address_iterator(), however,
>>> there's a sizing mismatch in the vhost_svq_translate_addr() code path
>>> that's causing vhost-net to fail to start.
>>>
>>> qemu-system-x86_64: unable to start vhost net: 22: falling back on
>>> userspace virtio
>>>
>>> More specifically, in vhost_svq_add_split(), the first call to
>>> vhost_svq_vring_write_descs() returns false:
>>>
>>>      ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
>>>                                       0, false);
>>>      if (unlikely(!ok)) {
>>>          return false;
>>>      }
>>>
>>> Maybe I misunderstood the proposal, but in
>>> iova_tree_find_address_iterator I'm checking for an exact match for
>>> sizes:
>>>
>>>      if (map->size != needle->size || map->perm != needle->perm) {
>>>          return false;
>>>      }
>> The permission needs to exact match,
> Care with this, read only buffers can live in the guest's RW memory. I
> think the only actual condition is that if the buffer is writable, the
> memory area must be writable. If they don't match, we should keep
> looking for the right area.
Indeed.

>
>
>> while the size doesn't have to. The
>> current iova_tree_find_address_iterator() has the following check:
>>
>>       if (map->translated_addr + map->size < needle->translated_addr ||
>>           needle->translated_addr + needle->size < map->translated_addr) {
>>           return false;
>>       }
>>
>> So essentially it does make sure the starting translated_addr on the
>> needle is greater than or equal to the starting translated_addr on the
>> map, and the first match of the map range in an ordered linear search
>> will be returned, but it doesn't guarantee the ending address on the
>> needle (needle->translated_addr + needle->size) will be covered by the
>> ending address on the map (map->translated_addr + map->size), so this
>> implementation is subjected to fragmented iova ranges with contiguous
>> HVA address. I don't see the current SVQ handles this well, and the
>> reason of iova fragmentation is due to overlapped region or memory
>> aliasing (unlike the GPA tree implementation, we have no additional info
>> to help us identify the exact IOVA when two or more overlapped HVA
>> ranges are given), which is orthogonal to the HVA fragmentation (with
>> contiguous GPA) handling in virtqueue_map_desc().
>>
>> How to handle this situation? Well, I guess Eugenio may have different
>> opinion, but to me the only way seems to continue to search till a
>> well-covered IOVA range can be found, or we may have to return multiple
>> IOVA ranges splitting a contiguous HVA buffer...
>>
> Not sure if I followed this 100%, but we must be capable of returning
> multiple IOVA ranges if we trust in SVQ to do the translation anyway.
>
> When SVQ starts, the guest memory listener already gives the GPA ->
> HVA translations fragmented and unordered, as QEMU can hotplug memory
> for example. Let me put a simple example:
>
> GPA [0, 0x1000) -> HVA [0x10000, 0x10100)
> GPA [0x1000, 0x2000) -> HVA [0x20000, 0x20100)
>
> Now we send the IOVA to the device this way:
> IOVA [0x2000, 0x3000) -> HVA [0x20000, 0x20100)
> IOVA [0x3000, 0x4000) -> HVA [0x10000, 0x10100)
>
> So we have this translation tree (assuming we already store them as GPA->IOVA):
> GPA [0, 0x1000) -> IOVA [0x3000, 0x4000)
> GPA [0x1000, 0x2000) -> IOVA [0x2000, 0x3000)
>
> Now the guest sends this single buffer in a single descriptor:
> GPA 0x500, len 0x1000.
>
> We need to split it into two descriptors anyway. Without memory
> aliases we're covered at this moment, as virtqueue_pop already solves
> these splits for us in virtqueue_map_desc.
That is what I said above, this case (contiguous GA backed by fragmented 
HVA ranges) is well covered by DMA API called by virtqueue_map_desc(). I 
was aware this case has been handled in the current SVQ code, i.e. 
reflected by below comments in vhost_handle_guest_kick(). This is 
definitively clear to me.

                     /*
                      * This condition is possible since a contiguous 
buffer in
                      * GPA does not imply a contiguous buffer in qemu's VA
                      * scatter-gather segments. If that happens, the buffer
                      * exposed to the device needs to be a chain of 
descriptors
                      * at this moment.
                      *
                      * SVQ cannot hold more available buffers if we are 
here:
                      * queue the current guest descriptor and ignore kicks
                      * until some elements are used.
                      */


> Now I realize that SVQ may not be able to cover these splits with
> aliases, as we might need to return more addresses than MAX(out_num,
> in_num) in vhost_svq_vring_write_descs.
Right, this is the case (fragmented iova ranges with contiguous HVA 
range due to aliasing) I was referring to, and is not handled at all in 
the current SVQ or vhost-iova-tree code. This means we have to handle 
the 1:N mapping for each iovec, which essentially introduces unnecessary 
complexity rather than simple or straightforward change as claimed to 
be. Sorry for speaking straight, but I don't really see major benefit in 
either performance, extensibility or flexibility following this 
seemingly interim solution.

On the contrary, instead of simplicity this is becoming more than a 
headache now - Jonah and I have to take extra caution of not breaking 
any other device which is not using aliased map in the same way as how 
the mch device does, as the latter is the only case Jonah used to test 
this patch series. That's why in the morning meeting I asked if more 
test case can be shared like how Lei Yang did in his previous testing of 
your early page pinning series (I recall one failing case seems to be 
related to hot plug) as the early pinning would open up Pandora's box 
that potentially break a lot of things while memory layout keeps 
changing in the early initialization time.

Even so with one additional test case, it is not sufficient to prove 
everything gets done in the right way that can unblock all the memory 
aliasing use cases, as our direct customer may use vdpa device in 
various VM setups together with whatever devices, GPU, VGA, VNC or other 
device we don't know yet but is using aliased map. It'd be going crazy 
to exercise every possible device, and I don't see the real benefit to 
build yet another sub-system out of QEMU memory system's realm just to 
support all these existing devices, given the relevant alias supporting 
code is built-in and working quite well within QEMU's own memory system. 
Don't get me wrong, it's not me who asked to add this code but patch #2 
is definitely not an optimization patch - we have to make sure the code 
is being done in the right way no matter what.  Functional correctness 
always speak than performance or other thing.

>   So maybe we need to allocate
> the translated addresses array in vhost_svq_vring_write_descs and
> return it, or allow it to grow.
>
> Having said that, I'd keep the plan as building something simple that
> works first, re-prioritize if we want to focus on reducing the
> downtime or in increasing the SVQ throughput, and then act in
> consequence by profiling the advantages of the changes. I'm really
> looking forward to the switching to (GPA|HVA)->IOVA tree, as I'm sure
> it will increase the performance of the solution, but to review all of
> it in one shot is out of my bandwith capabilities at the moment.
No problem, and thank you for your time in reviewing. I think after 
almost get done with the other assigned work I can work with Jonah to 
implement the desired code changes for (GPA|HVA)->IOVA tree. It's not 
for optimization or performance - just I feel it's much simpler than 
take extra burden to implement and maintain those unwarranted, 
duplicated or preventative code in SVQ or vhost-iova-tree layer just to 
get around various issues in memory aliasing or overrlapping use cases. 
Actually I'm already getting regretted not giving Jonan the right 
guidance in the first place to implement all the necessary virtqueue 
layer supporting code than use the ram block API hack, believe me it's 
not a lot, and the code change is really simple and easy to follow or 
review, there's zero cost in GPA-HVA synchronization than what the 
current code already did. Otherwise past lesson learned told me it would 
spend 10x more if not  building it right in the first place - there'll 
be more cost in future time and effort to refactor or rewrite all the 
code in later point of time once all kinds of features are already built 
up.


Thanks,
-Siwei




>
>> Regards,
>> -Siwei
>>
>>
>>
>>> During the device setup phase, vhost_svq_add_split() ->
>>> vhost_svq_vring_write_descs() -> vhost_svq_translate_addr() ->
>>> vhost_iova_tree_find_iova() -> iova_tree_find_iova() is called, but in
>>> iova_tree_find_address_iterator() map->size & needle->size mismatch. I
>>> inserted some printf's to give more information:
>>>
>>> ...
>>> [    8.019672] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
>>> [    8.020694] ahci 0000:00:1f.2: flags: 64bit ncq only
>>> [    8.022403] scsi host0: ahci
>>> [    8.023511] scsi host1: ahci
>>> [    8.024446] scsi host2: ahci
>>> [    8.025308
>>> vhost_svq_translate_addr: iovec[i].iov_len = 0x8
>>> iova_tree_find_address_iterator: mismatched sizes
>>> map->size: 0xfff, needle->size: 0x8
>>> map->perm: 1, needle->perm: 1
>>> vhost_svq_add_split: _write_descs fail, sgs: 0x7fd85018ea80, out_sg:
>>> 0x7ff8649fbb50, out_num: 1, in_sg: 0x7ff8649fbb60, in_num: 1,
>>> more_descs: true, write: false
>>> vhost_vdpa_svq_unmap_ring
>>> vhost_vdpa_svq_unmap_ring
>>> vhost_vdpa_listener_region_del
>>> vhost_vdpa_listener_region_del
>>> vhost_vdpa_listener_region_del
>>> vhost_vdpa_listener_region_del
>>> vhost_vdpa_listener_region_del
>>> vhost_vdpa_svq_unmap_ring
>>> vhost_vdpa_svq_unmap_ring
>>> vhost_vdpa_svq_unmap_ring
>>> vhost_vdpa_svq_unmap_ring
>>> 2024-10-08T15:12:22.184902Z qemu-system-x86_64: unable to start vhost
>>> net: 22: falling back on userspace virtio
>>> ] scsi host3: ahci
>>> [   10.921733] scsi host4: ahci
>>> [   10.922946] scsi host5: ahci
>>> [   10.923486] virtio_net virtio1 enp0s2: renamed from eth0
>>> ...
>>>
>>> This is with similar Qemu args as Si-Wei's from way back when:
>>>
>>> -m size=128G,slots=8,maxmem=256G
>>>
>>> There are also some error catches with just the permission check but
>>> it appears the vhost-net device is still able to start up (when not
>>> matching sizes also).
> I think this is because of the forced check for equal sizes, but let
> me know if it is not caused by that!
>
> Thanks!
>


Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Eugenio Perez Martin 1 month, 2 weeks ago
On Thu, Oct 10, 2024 at 9:00 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 10/9/2024 2:29 AM, Eugenio Perez Martin wrote:
> > On Tue, Oct 8, 2024 at 10:30 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/8/2024 8:40 AM, Jonah Palmer wrote:
> >>>
> >>> On 10/8/24 2:51 AM, Eugenio Perez Martin wrote:
> >>>> On Tue, Oct 8, 2024 at 2:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>
> >>>>>
> >>>>> On 10/7/2024 6:51 AM, Eugenio Perez Martin wrote:
> >>>>>> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer
> >>>>>> <jonah.palmer@oracle.com> wrote:
> >>>>>>>
> >>>>>>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
> >>>>>>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer
> >>>>>>>> <jonah.palmer@oracle.com> wrote:
> >>>>>>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
> >>>>>>>>> translations for guest memory regions.
> >>>>>>>>>
> >>>>>>>>> When the guest has overlapping memory regions, an HVA to IOVA
> >>>>>>>>> translation
> >>>>>>>>> may return an incorrect IOVA when searching the IOVA->HVA tree.
> >>>>>>>>> This is
> >>>>>>>>> due to one HVA range being contained (overlapping) in another
> >>>>>>>>> HVA range
> >>>>>>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use
> >>>>>>>>> GPAs to
> >>>>>>>>> translate and find the correct IOVA for guest memory regions.
> >>>>>>>>>
> >>>>>>>> Yes, this first patch is super close to what I meant, just one issue
> >>>>>>>> and a pair of nits here and there.
> >>>>>>>>
> >>>>>>>> I'd leave the second patch as an optimization on top, if the numbers
> >>>>>>>> prove that adding the code is worth it.
> >>>>>>>>
> >>>>>>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on
> >>>>>>> the
> >>>>>>> previous series you also wanted implemented or if these would also be
> >>>>>>> optimizations on top.
> >>>>>>>
> >>>>>>> [Adding code to the vhost_iova_tree layer for handling multiple
> >>>>>>> buffers
> >>>>>>> returned from translation for the memory area where each iovec
> >>>>>>> covers]:
> >>>>>>> -----------------------------------------------------------------------
> >>>>>>>
> >>>>>>> "Let's say that SVQ wants to translate the HVA range
> >>>>>>> 0xfeda0000-0xfedd0000. So it makes available for the device two
> >>>>>>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
> >>>>>>> with addr=(0x20000c1000 len=0x10000).
> >>>>>>>
> >>>>>>> The VirtIO device should be able to translate these two buffers in
> >>>>>>> isolation and chain them. Not optimal but it helps to keep QEMU
> >>>>>>> source
> >>>>>>> clean, as the device already must support it."
> >>>>>>>
> >>>>>> This is 100% in the device and QEMU is already able to split the
> >>>>>> buffers that way, so we don't need any change in QEMU.
> >>>>> Noted that if working with the full HVA tree directly, the internal
> >>>>> iova
> >>>>> tree linear iterator iova_tree_find_address_iterator() today doesn't
> >>>>> guarantee the iova range returned can cover the entire length of the
> >>>>> iov, so things could happen like that the aliased range with smaller
> >>>>> size (than the requested) happens to be hit first in the linear search
> >>>>> and be returned, the fragmentation of which can't be guarded against by
> >>>>> the VirtIO device or the DMA API mentioned above.
> >>>>>
> >>>>> The second patch in this series kind of mitigated this side effect by
> >>>>> sorting out the backing ram_block with the help of
> >>>>> qemu_ram_block_from_host() in case of guest memory backed iov, so it
> >>>>> doesn't really count on vhost_iova_gpa_tree_find_iova() to find the
> >>>>> matching IOVA, but instead (somehow implicitly) avoids the
> >>>>> fragmentation
> >>>>> side effect as mentioned above would never happen. Not saying I like
> >>>>> the
> >>>>> way how it is implemented, but just wanted to point out the implication
> >>>>> if the second patch has to be removed - either add special handling
> >>>>> code
> >>>>> to the iova-tree iterator sizing the range (same as how range selection
> >>>>> based upon permission will be done), or add special code in SVQ
> >>>>> layer to
> >>>>> deal with fragmented IOVA ranges due to aliasing.
> >>>>>
> >>>> This special code in SVQ is already there. And it will be needed even
> >>>> if we look for the buffers by GPA instead of by vaddr, the same way
> >>>> virtqueue_map_desc needs to handle it even if it works with GPA.
> >>>> Continuous GPA does not imply continuous vaddr.
> >>>>
> >>> My apologies if I misunderstood something here regarding size &
> >>> permission matching, but I attempted to implement both the size and
> >>> permission check to iova_tree_find_address_iterator(), however,
> >>> there's a sizing mismatch in the vhost_svq_translate_addr() code path
> >>> that's causing vhost-net to fail to start.
> >>>
> >>> qemu-system-x86_64: unable to start vhost net: 22: falling back on
> >>> userspace virtio
> >>>
> >>> More specifically, in vhost_svq_add_split(), the first call to
> >>> vhost_svq_vring_write_descs() returns false:
> >>>
> >>>      ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num >
> >>>                                       0, false);
> >>>      if (unlikely(!ok)) {
> >>>          return false;
> >>>      }
> >>>
> >>> Maybe I misunderstood the proposal, but in
> >>> iova_tree_find_address_iterator I'm checking for an exact match for
> >>> sizes:
> >>>
> >>>      if (map->size != needle->size || map->perm != needle->perm) {
> >>>          return false;
> >>>      }
> >> The permission needs to exact match,
> > Care with this, read only buffers can live in the guest's RW memory. I
> > think the only actual condition is that if the buffer is writable, the
> > memory area must be writable. If they don't match, we should keep
> > looking for the right area.
> Indeed.
>
> >
> >
> >> while the size doesn't have to. The
> >> current iova_tree_find_address_iterator() has the following check:
> >>
> >>       if (map->translated_addr + map->size < needle->translated_addr ||
> >>           needle->translated_addr + needle->size < map->translated_addr) {
> >>           return false;
> >>       }
> >>
> >> So essentially it does make sure the starting translated_addr on the
> >> needle is greater than or equal to the starting translated_addr on the
> >> map, and the first match of the map range in an ordered linear search
> >> will be returned, but it doesn't guarantee the ending address on the
> >> needle (needle->translated_addr + needle->size) will be covered by the
> >> ending address on the map (map->translated_addr + map->size), so this
> >> implementation is subjected to fragmented iova ranges with contiguous
> >> HVA address. I don't see the current SVQ handles this well, and the
> >> reason of iova fragmentation is due to overlapped region or memory
> >> aliasing (unlike the GPA tree implementation, we have no additional info
> >> to help us identify the exact IOVA when two or more overlapped HVA
> >> ranges are given), which is orthogonal to the HVA fragmentation (with
> >> contiguous GPA) handling in virtqueue_map_desc().
> >>
> >> How to handle this situation? Well, I guess Eugenio may have different
> >> opinion, but to me the only way seems to continue to search till a
> >> well-covered IOVA range can be found, or we may have to return multiple
> >> IOVA ranges splitting a contiguous HVA buffer...
> >>
> > Not sure if I followed this 100%, but we must be capable of returning
> > multiple IOVA ranges if we trust in SVQ to do the translation anyway.
> >
> > When SVQ starts, the guest memory listener already gives the GPA ->
> > HVA translations fragmented and unordered, as QEMU can hotplug memory
> > for example. Let me put a simple example:
> >
> > GPA [0, 0x1000) -> HVA [0x10000, 0x10100)
> > GPA [0x1000, 0x2000) -> HVA [0x20000, 0x20100)
> >
> > Now we send the IOVA to the device this way:
> > IOVA [0x2000, 0x3000) -> HVA [0x20000, 0x20100)
> > IOVA [0x3000, 0x4000) -> HVA [0x10000, 0x10100)
> >
> > So we have this translation tree (assuming we already store them as GPA->IOVA):
> > GPA [0, 0x1000) -> IOVA [0x3000, 0x4000)
> > GPA [0x1000, 0x2000) -> IOVA [0x2000, 0x3000)
> >
> > Now the guest sends this single buffer in a single descriptor:
> > GPA 0x500, len 0x1000.
> >
> > We need to split it into two descriptors anyway. Without memory
> > aliases we're covered at this moment, as virtqueue_pop already solves
> > these splits for us in virtqueue_map_desc.
> That is what I said above, this case (contiguous GA backed by fragmented
> HVA ranges) is well covered by DMA API called by virtqueue_map_desc(). I
> was aware this case has been handled in the current SVQ code, i.e.
> reflected by below comments in vhost_handle_guest_kick(). This is
> definitively clear to me.
>
>                      /*
>                       * This condition is possible since a contiguous
> buffer in
>                       * GPA does not imply a contiguous buffer in qemu's VA
>                       * scatter-gather segments. If that happens, the buffer
>                       * exposed to the device needs to be a chain of
> descriptors
>                       * at this moment.
>                       *
>                       * SVQ cannot hold more available buffers if we are
> here:
>                       * queue the current guest descriptor and ignore kicks
>                       * until some elements are used.
>                       */
>
>
> > Now I realize that SVQ may not be able to cover these splits with
> > aliases, as we might need to return more addresses than MAX(out_num,
> > in_num) in vhost_svq_vring_write_descs.
> Right, this is the case (fragmented iova ranges with contiguous HVA
> range due to aliasing) I was referring to, and is not handled at all in
> the current SVQ or vhost-iova-tree code. This means we have to handle
> the 1:N mapping for each iovec, which essentially introduces unnecessary
> complexity rather than simple or straightforward change as claimed to
> be. Sorry for speaking straight, but I don't really see major benefit in
> either performance, extensibility or flexibility following this
> seemingly interim solution.
>
> On the contrary, instead of simplicity this is becoming more than a
> headache now - Jonah and I have to take extra caution of not breaking
> any other device which is not using aliased map in the same way as how
> the mch device does, as the latter is the only case Jonah used to test
> this patch series. That's why in the morning meeting I asked if more
> test case can be shared like how Lei Yang did in his previous testing of
> your early page pinning series (I recall one failing case seems to be
> related to hot plug) as the early pinning would open up Pandora's box
> that potentially break a lot of things while memory layout keeps
> changing in the early initialization time.
>
> Even so with one additional test case, it is not sufficient to prove
> everything gets done in the right way that can unblock all the memory
> aliasing use cases, as our direct customer may use vdpa device in
> various VM setups together with whatever devices, GPU, VGA, VNC or other
> device we don't know yet but is using aliased map. It'd be going crazy
> to exercise every possible device, and I don't see the real benefit to
> build yet another sub-system out of QEMU memory system's realm just to
> support all these existing devices, given the relevant alias supporting
> code is built-in and working quite well within QEMU's own memory system.
> Don't get me wrong, it's not me who asked to add this code but patch #2
> is definitely not an optimization patch - we have to make sure the code
> is being done in the right way no matter what.  Functional correctness
> always speak than performance or other thing.
>
> >   So maybe we need to allocate
> > the translated addresses array in vhost_svq_vring_write_descs and
> > return it, or allow it to grow.
> >
> > Having said that, I'd keep the plan as building something simple that
> > works first, re-prioritize if we want to focus on reducing the
> > downtime or in increasing the SVQ throughput, and then act in
> > consequence by profiling the advantages of the changes. I'm really
> > looking forward to the switching to (GPA|HVA)->IOVA tree, as I'm sure
> > it will increase the performance of the solution, but to review all of
> > it in one shot is out of my bandwith capabilities at the moment.
> No problem, and thank you for your time in reviewing. I think after
> almost get done with the other assigned work I can work with Jonah to
> implement the desired code changes for (GPA|HVA)->IOVA tree. It's not
> for optimization or performance - just I feel it's much simpler than
> take extra burden to implement and maintain those unwarranted,
> duplicated or preventative code in SVQ or vhost-iova-tree layer just to
> get around various issues in memory aliasing or overrlapping use cases.
> Actually I'm already getting regretted not giving Jonan the right
> guidance in the first place to implement all the necessary virtqueue
> layer supporting code than use the ram block API hack, believe me it's
> not a lot, and the code change is really simple and easy to follow or
> review, there's zero cost in GPA-HVA synchronization than what the
> current code already did. Otherwise past lesson learned told me it would
> spend 10x more if not  building it right in the first place - there'll
> be more cost in future time and effort to refactor or rewrite all the
> code in later point of time once all kinds of features are already built
> up.
>

Got it. Looking forward to the next version then!
Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Jonah Palmer 1 month, 2 weeks ago

On 10/7/24 9:51 AM, Eugenio Perez Martin wrote:
> On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
>>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
>>>> translations for guest memory regions.
>>>>
>>>> When the guest has overlapping memory regions, an HVA to IOVA translation
>>>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
>>>> due to one HVA range being contained (overlapping) in another HVA range
>>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
>>>> translate and find the correct IOVA for guest memory regions.
>>>>
>>>
>>> Yes, this first patch is super close to what I meant, just one issue
>>> and a pair of nits here and there.
>>>
>>> I'd leave the second patch as an optimization on top, if the numbers
>>> prove that adding the code is worth it.
>>>
>>
>> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
>> previous series you also wanted implemented or if these would also be
>> optimizations on top.
>>
>> [Adding code to the vhost_iova_tree layer for handling multiple buffers
>> returned from translation for the memory area where each iovec covers]:
>> -----------------------------------------------------------------------
>> "Let's say that SVQ wants to translate the HVA range
>> 0xfeda0000-0xfedd0000. So it makes available for the device two
>> chained buffers: One with addr=0x1000 len=0x20000 and the other one
>> with addr=(0x20000c1000 len=0x10000).
>>
>> The VirtIO device should be able to translate these two buffers in
>> isolation and chain them. Not optimal but it helps to keep QEMU source
>> clean, as the device already must support it."
>>
> 
> This is 100% in the device and QEMU is already able to split the
> buffers that way, so we don't need any change in QEMU.
> 
>> [Adding a permission check to iova_tree_find_address_iterator and match
>> the range by permissions]:
>> -----------------------------------------------------------------------
>> "About the permissions, maybe we can make the permissions to be part of
>> the lookup? Instead of returning them at iova_tree_find_iova, make
>> them match at iova_tree_find_address_iterator."
>>
> 
> Ouch, I forgot this part. This is a small change we also need, can you
> add it for the next version? Thanks for remind it!
> 

Sure can!

I apologize for my lack of understanding on this, but for example in 
vhost_svq_translate_addr, how do we know what the permissions are for 
the addresses we're translating?

I'm not sure if this is always true or not, but if the address that 
we're translating is backed by guest memory, then can we always say that 
the permission for the mapping would be IOMMU_RW (since the OS needs to 
be able to modify it)?. Likewise, for addresses backed by host-only 
memory, can we always say that the permission for the mapping would be 
IOMMU_RO to avoid accidental modifications by the guest?

If so, this would mean that these mappings would never have the 
IOMMU_NONE or IOMMU_WO permissions, right?

>> But I understand that the problems with this is that we're assuming the
>> SVQ translation will always be done in a transient manner.
>>
>>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
>>>> ---
>>>>    hw/virtio/vhost-iova-tree.c | 78 +++++++++++++++++++++++++++++++++++--
>>>>    hw/virtio/vhost-iova-tree.h |  5 +++
>>>>    hw/virtio/vhost-vdpa.c      | 20 ++++++----
>>>>    3 files changed, 92 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>> index 3d03395a77..e33fd56225 100644
>>>> --- a/hw/virtio/vhost-iova-tree.c
>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>> @@ -28,12 +28,15 @@ struct VhostIOVATree {
>>>>
>>>>        /* IOVA address to qemu memory maps. */
>>>>        IOVATree *iova_taddr_map;
>>>> +
>>>> +    /* IOVA address to guest memory maps. */
>>>> +    IOVATree *iova_gpa_map;
>>>>    };
>>>>
>>>>    /**
>>>> - * Create a new IOVA tree
>>>> + * Create a new VhostIOVATree
>>>>     *
>>>> - * Returns the new IOVA tree
>>>> + * Returns the new VhostIOVATree
>>>>     */
>>>>    VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    {
>>>> @@ -44,6 +47,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>        tree->iova_last = iova_last;
>>>>
>>>>        tree->iova_taddr_map = iova_tree_new();
>>>> +    tree->iova_gpa_map = iova_tree_new();
>>>>        return tree;
>>>>    }
>>>>
>>>> @@ -53,6 +57,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>>>>    void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>>>>    {
>>>>        iova_tree_destroy(iova_tree->iova_taddr_map);
>>>> +    iova_tree_destroy(iova_tree->iova_gpa_map);
>>>>        g_free(iova_tree);
>>>>    }
>>>>
>>>> @@ -71,7 +76,7 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
>>>>    }
>>>>
>>>>    /**
>>>> - * Allocate a new mapping
>>>> + * Allocate a new mapping in the IOVA->HVA tree
>>>>     *
>>>>     * @tree: The iova tree
>>>>     * @map: The iova map
>>>> @@ -108,3 +113,70 @@ void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>>>>    {
>>>>        iova_tree_remove(iova_tree->iova_taddr_map, map);
>>>>    }
>>>> +
>>>> +/**
>>>> + * Find the IOVA address stored from a guest memory address
>>>> + *
>>>> + * @tree: The VhostIOVATree
>>>> + * @map: The map with the guest memory address
>>>> + *
>>>> + * Return the stored mapping, or NULL if not found.
>>>> + */
>>>> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *tree,
>>>> +                                            const DMAMap *map)
>>>
>>> Nit: Not an english native, but I find vhost_iova_tree should not be
>>> broken for coherency with the rest of the functions. What about
>>> vhost_iova_tree_find_iova_gpa, like _gpa variant?
>>>
>>
>> Yea, I totally understand what you mean here and I have *no problem*
>> making it into vhost_iova_tree_find_iova_gpa.
>>
>> Just to add my two cents on this, for what it's worth, now that we have
>> both an IOVA->HVA tree and a IOVA->GPA tree, coming up with function
>> names that operate on this new tree while conforming to the
>> vhost_iova_tree convention and being descriptive in the naming is a bit
>> difficult.
>>
>> For example, to me, vhost_iova_tree_find_iova_gpa would seem a bit
>> misleading to me if I didn't know about it beforehand (or was just
>> seeing it for the first time). Like, are we finding the IOVA or GPA or
>> both? And what tree are we operating on?
>>
>> If this was some personal code I was writing and I had free reign over
>> it, I personally would go with a format like:
>>
>> vhost_<tree this function concerns>_tree_<action>
>>
>> So a name like vhost_iova_gpa_tree_find_iova communicates to me that
>> we're operating on the iova_gpa (IOVA->GPA) tree and our action is to
>> find_iova (find the IOVA).
>>
>> Similarly for something like vhost_iova_gpa_tree_remove or
>> vhost_iova_hva_tree_remove, etc.
>>
>> But obviously this is the complete opposite of personal code and
>> certainly not something that's needed so I'm totally okay with renaming
>> it to vhost_iova_tree_find_iova_gpa :)
>>
> 
> You're creating the patch and you (and everybody) can comment on it,
> of course :).
> 
> To me, the fact that GPA is stored in a separated *tree* should be an
> implementation detail not exposed by the function names. Ideally, the
> user of the VhostIOVATree just knows that some entries are tagged with
> the GPA and others are not. Saying otherwise, we could replace the GPA
> for another struct, or merge both trees, and the API would remain
> unchanged if we just add the _gpa as suffix.
> 
> Having said that, I'm happy with both names so feel free to keep yours.
> 
> 
> 

Ok! I'll change it to vhost_iova_tree_find_iova_gpa and maybe in the 
future if this kind of change is warranted then we can rename all 
concerned functions and not just this one :)

>>>> +{
>>>> +    return iova_tree_find_iova(tree->iova_gpa_map, map);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Allocate new mappings in the IOVA->HVA & IOVA->GPA trees
>>>> + *
>>>> + * @tree: The VhostIOVATree
>>>> + * @map: The iova map
>>>> + * @gpa: The guest physical address (GPA)
>>>> + *
>>>> + * Returns:
>>>> + * - IOVA_OK if the map fits both containers
>>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
>>>> + * - IOVA_ERR_NOMEM if the IOVA->HVA tree cannot allocate more space
>>>> + *
>>>> + * It returns an assigned iova in map->iova if return value is IOVA_OK.
>>>> + */
>>>> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *tree, DMAMap *map, hwaddr gpa)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    /* Some vhost devices don't like addr 0. Skip first page */
>>>> +    hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>>>> +
>>>> +    if (map->translated_addr + map->size < map->translated_addr ||
>>>> +        map->perm == IOMMU_NONE) {
>>>> +        return IOVA_ERR_INVALID;
>>>> +    }
>>>> +
>>>> +    /* Allocate a node in the IOVA->HVA tree */
>>>> +    ret = iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
>>>> +                              tree->iova_last);
>>>
>>> Why not call vhost_iova_tree_map_alloc instead of duplicating it here?
>>>
>>
>> Great question with no good answer! For some reason I thought against
>> putting it in there but will do that in the next series.
>>
>>>> +    if (unlikely(ret != IOVA_OK)) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    /* Insert a node in the IOVA->GPA tree */
>>>> +    map->translated_addr = gpa;
>>>> +    return iova_tree_insert(tree->iova_gpa_map, map);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Remove existing mappings from the IOVA->HVA & IOVA->GPA trees
>>>> + *
>>>> + * @iova_tree: The VhostIOVATree
>>>> + * @map: The map to remove
>>>> + */
>>>> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map)
>>>> +{
>>>> +    /* Remove the existing mapping from the IOVA->GPA tree */
>>>> +    iova_tree_remove(iova_tree->iova_gpa_map, map);
>>>> +
>>>> +    /* Remove the corresponding mapping from the IOVA->HVA tree */
>>>> +    iova_tree_remove(iova_tree->iova_taddr_map, map);
>>>
>>> If we remove it blindly from both trees, we are keeping the bug, isn't it?
>>>
>>> I think the remove should receive the "gpa" as a parameter, same as
>>> alloc_gpa. After that, vhost_iova_tree_remove_gpa looks the right iova
>>> into iova_gpa_map. And only after that, it removes that iova from
>>> iova_tree_remove.
>>>
>>> If it makes things easier it could receive (hwaddr gpa, size_t len) or
>>> all of the info in a DMAMap. What do you think?
>>>
>>
>> Initially that was my plan but this only gets called in
>> vhost_vdpa_listener_region_add/del and in both functions, by the time
>> this removal function is called, we already have the correct IOVA.
>>
>> In vhost_vdpa_listener_region_add, we just allocated that IOVA and saved
>> it in mem_region. In vhost_vdpa_listener_region_del, we already found
>> the IOVA via vhost_iova_gpa_tree_find_iova prior to calling the removal
>> function.
>>
>> But I could be misunderstanding something here. Let me know if I am.
>>
> 
> Ok I missed that. I think you're totally right.
> 
>>>> +}
>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>> index 4adfd79ff0..511c6d18ae 100644
>>>> --- a/hw/virtio/vhost-iova-tree.h
>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>> @@ -23,5 +23,10 @@ const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
>>>>                                            const DMAMap *map);
>>>>    int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
>>>>    void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
>>>> +const DMAMap *vhost_iova_gpa_tree_find_iova(const VhostIOVATree *iova_tree,
>>>> +                                            const DMAMap *map);
>>>> +int vhost_iova_tree_map_alloc_gpa(VhostIOVATree *iova_tree, DMAMap *map,
>>>> +                                  hwaddr gpa);
>>>> +void vhost_iova_tree_remove_gpa(VhostIOVATree *iova_tree, DMAMap map);
>>>>
>>>>    #endif
>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>> index 3cdaa12ed5..591ff426e7 100644
>>>> --- a/hw/virtio/vhost-vdpa.c
>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>> @@ -365,9 +365,16 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>            mem_region.size = int128_get64(llsize) - 1,
>>>>            mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
>>>>
>>>> -        r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
>>>> +        r = vhost_iova_tree_map_alloc_gpa(s->iova_tree, &mem_region,
>>>> +                                          section->offset_within_address_space);
>>>>            if (unlikely(r != IOVA_OK)) {
>>>>                error_report("Can't allocate a mapping (%d)", r);
>>>> +
>>>> +            /* Insertion to IOVA->GPA tree failed */
>>>> +            if (mem_region.translated_addr ==
>>>> +                section->offset_within_address_space) {
>>>> +                goto fail_map;
>>>> +            }
>>>
>>> We can move this cleanup code into vhost_iova_tree_map_alloc_gpa, isn't it?
>>>
>>
>> Sure can. We'd still need to check if vhost_iova_tree_map_alloc_gpa
>> returned IOVA_OK though and goto the fail label.
>>
> 
> Yes, right.
> 
> Thanks!
> 
>>>>                goto fail;
>>>>            }
>>>>
>>>> @@ -386,7 +393,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>>>>
>>>>    fail_map:
>>>>        if (s->shadow_data) {
>>>> -        vhost_iova_tree_remove(s->iova_tree, mem_region);
>>>> +        vhost_iova_tree_remove_gpa(s->iova_tree, mem_region);
>>>>        }
>>>>
>>>>    fail:
>>>> @@ -440,21 +447,18 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
>>>>
>>>>        if (s->shadow_data) {
>>>>            const DMAMap *result;
>>>> -        const void *vaddr = memory_region_get_ram_ptr(section->mr) +
>>>> -            section->offset_within_region +
>>>> -            (iova - section->offset_within_address_space);
>>>>            DMAMap mem_region = {
>>>> -            .translated_addr = (hwaddr)(uintptr_t)vaddr,
>>>> +            .translated_addr = section->offset_within_address_space,
>>>>                .size = int128_get64(llsize) - 1,
>>>>            };
>>>>
>>>> -        result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
>>>> +        result = vhost_iova_gpa_tree_find_iova(s->iova_tree, &mem_region);
>>>>            if (!result) {
>>>>                /* The memory listener map wasn't mapped */
>>>>                return;
>>>>            }
>>>>            iova = result->iova;
>>>> -        vhost_iova_tree_remove(s->iova_tree, *result);
>>>> +        vhost_iova_tree_remove_gpa(s->iova_tree, *result);
>>>>        }
>>>>        vhost_vdpa_iotlb_batch_begin_once(s);
>>>>        /*
>>>> --
>>>> 2.43.5
>>>>
>>>
>>
> 


Re: [RFC v2 1/2] vhost-vdpa: Implement IOVA->GPA tree
Posted by Eugenio Perez Martin 1 month, 2 weeks ago
On Mon, Oct 7, 2024 at 5:38 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 10/7/24 9:51 AM, Eugenio Perez Martin wrote:
> > On Fri, Oct 4, 2024 at 8:48 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >>
> >> On 10/4/24 11:17 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Oct 4, 2024 at 2:45 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>
> >>>> Implements the IOVA->GPA tree for handling mapping, unmapping, and
> >>>> translations for guest memory regions.
> >>>>
> >>>> When the guest has overlapping memory regions, an HVA to IOVA translation
> >>>> may return an incorrect IOVA when searching the IOVA->HVA tree. This is
> >>>> due to one HVA range being contained (overlapping) in another HVA range
> >>>> in the IOVA->HVA tree. By creating an IOVA->GPA tree, we can use GPAs to
> >>>> translate and find the correct IOVA for guest memory regions.
> >>>>
> >>>
> >>> Yes, this first patch is super close to what I meant, just one issue
> >>> and a pair of nits here and there.
> >>>
> >>> I'd leave the second patch as an optimization on top, if the numbers
> >>> prove that adding the code is worth it.
> >>>
> >>
> >> Ah okay, gotcha. I also wasn't sure if what you mentioned below on the
> >> previous series you also wanted implemented or if these would also be
> >> optimizations on top.
> >>
> >> [Adding code to the vhost_iova_tree layer for handling multiple buffers
> >> returned from translation for the memory area where each iovec covers]:
> >> -----------------------------------------------------------------------
> >> "Let's say that SVQ wants to translate the HVA range
> >> 0xfeda0000-0xfedd0000. So it makes available for the device two
> >> chained buffers: One with addr=0x1000 len=0x20000 and the other one
> >> with addr=(0x20000c1000 len=0x10000).
> >>
> >> The VirtIO device should be able to translate these two buffers in
> >> isolation and chain them. Not optimal but it helps to keep QEMU source
> >> clean, as the device already must support it."
> >>
> >
> > This is 100% in the device and QEMU is already able to split the
> > buffers that way, so we don't need any change in QEMU.
> >
> >> [Adding a permission check to iova_tree_find_address_iterator and match
> >> the range by permissions]:
> >> -----------------------------------------------------------------------
> >> "About the permissions, maybe we can make the permissions to be part of
> >> the lookup? Instead of returning them at iova_tree_find_iova, make
> >> them match at iova_tree_find_address_iterator."
> >>
> >
> > Ouch, I forgot this part. This is a small change we also need, can you
> > add it for the next version? Thanks for remind it!
> >
>
> Sure can!
>
> I apologize for my lack of understanding on this, but for example in
> vhost_svq_translate_addr, how do we know what the permissions are for
> the addresses we're translating?
>

No need to apologize :).

The calling function, vhost_svq_vring_write_descs, knows if they are
RW (write=true) or RO (write=false). We need to pass the parameter to
vhost_svq_translate_addr.

> I'm not sure if this is always true or not, but if the address that
> we're translating is backed by guest memory, then can we always say that
> the permission for the mapping would be IOMMU_RW (since the OS needs to
> be able to modify it)?. Likewise, for addresses backed by host-only
> memory, can we always say that the permission for the mapping would be
> IOMMU_RO to avoid accidental modifications by the guest?
>

Not really, there are parts of the guest memory that are RO and the
SVQ's used ring in the host memory is RW.

It should be enough to check that the write bit is enabled in the
flags & is requested.

> If so, this would mean that these mappings would never have the
> IOMMU_NONE or IOMMU_WO permissions, right?
>

That's right.