[v1] Enable shared device assignment

[PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 5 months ago

As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO might
disable ram block discard. However, guest_memfd relies on the discard
operation to perform page conversion between private and shared memory.
This can lead to stale IOMMU mapping issue when assigning a hardware
device to a confidential VM via shared memory (unprotected memory
pages). Blocking shared page discard can solve this problem, but it
could cause guests to consume twice the memory with VFIO, which is not
acceptable in some cases. An alternative solution is to convey other
systems like VFIO to refresh its outdated IOMMU mappings.

RamDiscardManager is an existing concept (used by virtio-mem) to adjust
VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other, so the similar work that needs to happen in response
to virtio-mem changes needs to happen for page conversion events.
Introduce the RamDiscardManager to guest_memfd to achieve it.

However, guest_memfd is not an object so it cannot directly implement
the RamDiscardManager interface.

One solution is to implement the interface in HostMemoryBackend. Any
guest_memfd-backed host memory backend can register itself in the target
MemoryRegion. However, this solution doesn't cover the scenario where a
guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
the virtual BIOS MemoryRegion.

Thus, choose the second option, i.e. define an object type named
guest_memfd_manager with RamDiscardManager interface. Upon creation of
guest_memfd, a new guest_memfd_manager object can be instantiated and
registered to the managed guest_memfd MemoryRegion to handle the page
conversion events.

In the context of guest_memfd, the discarded state signifies that the
page is private, while the populated state indicated that the page is
shared. The state of the memory is tracked at the granularity of the
host page size (i.e. block_size), as the minimum conversion size can be
one page per request.

In addition, VFIO expects the DMA mapping for a specific iova to be
mapped and unmapped with the same granularity. However, the confidential
VMs may do partial conversion, e.g. conversion happens on a small region
within a large region. To prevent such invalid cases and before any
potential optimization comes out, all operations are performed with 4K
granularity.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h |  46 +++++
 system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
 system/meson.build                   |   1 +
 3 files changed, 297 insertions(+)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
new file mode 100644
index 0000000000..ba4a99b614
--- /dev/null
+++ b/include/sysemu/guest-memfd-manager.h
@@ -0,0 +1,46 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
+#define SYSEMU_GUEST_MEMFD_MANAGER_H
+
+#include "sysemu/hostmem.h"
+
+#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
+
+OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
+
+struct GuestMemfdManager {
+    Object parent;
+
+    /* Managed memory region. */
+    MemoryRegion *mr;
+
+    /*
+     * 1-setting of the bit represents the memory is populated (shared).
+     */
+    int32_t bitmap_size;
+    unsigned long *bitmap;
+
+    /* block size and alignment */
+    uint64_t block_size;
+
+    /* listeners to notify on populate/discard activity. */
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+struct GuestMemfdManagerClass {
+    ObjectClass parent_class;
+};
+
+#endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
new file mode 100644
index 0000000000..d7e105fead
--- /dev/null
+++ b/system/guest-memfd-manager.c
@@ -0,0 +1,250 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "sysemu/guest-memfd-manager.h"
+
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
+                                          guest_memfd_manager,
+                                          GUEST_MEMFD_MANAGER,
+                                          OBJECT,
+                                          { TYPE_RAM_DISCARD_MANAGER },
+                                          { })
+
+static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
+                                         const MemoryRegionSection *section)
+{
+    const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    uint64_t first_bit = section->offset_within_region / gmm->block_size;
+    uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
+    unsigned long first_discard_bit;
+
+    first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    return first_discard_bit > last_bit;
+}
+
+typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
+
+static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+
+    return 0;
+}
+
+static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
+                                                  MemoryRegionSection *section,
+                                                  void *arg,
+                                                  guest_memfd_section_cb cb)
+{
+    unsigned long first_one_bit, last_one_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_one_bit = section->offset_within_region / gmm->block_size;
+    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, first_one_bit);
+
+    while (first_one_bit < gmm->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_one_bit * gmm->block_size;
+        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                          first_one_bit + 1) - 1;
+        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
+                                      last_one_bit + 2);
+    }
+
+    return ret;
+}
+
+static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
+                                                  MemoryRegionSection *section,
+                                                  void *arg,
+                                                  guest_memfd_section_cb cb)
+{
+    unsigned long first_zero_bit, last_zero_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_zero_bit = section->offset_within_region / gmm->block_size;
+    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                        first_zero_bit);
+
+    while (first_zero_bit < gmm->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_zero_bit * gmm->block_size;
+        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
+                                      first_zero_bit + 1) - 1;
+        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                            last_zero_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t guest_memfd_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                                    const MemoryRegion *mr)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+
+    g_assert(mr == gmm->mr);
+    return gmm->block_size;
+}
+
+static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
+                                              RamDiscardListener *rdl,
+                                              MemoryRegionSection *section)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(section->mr == gmm->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
+
+    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
+                                                 guest_memfd_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
+                                                RamDiscardListener *rdl)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == gmm->mr);
+
+    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
+                                                 guest_memfd_notify_discard_cb);
+    if (ret) {
+        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+
+}
+
+typedef struct GuestMemfdReplayData {
+    void *fn;
+    void *opaque;
+} GuestMemfdReplayData;
+
+static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamPopulate replay_fn = data->fn;
+
+    return replay_fn(section, data->opaque);
+}
+
+static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
+                                            MemoryRegionSection *section,
+                                            ReplayRamPopulate replay_fn,
+                                            void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    return guest_memfd_for_each_populated_section(gmm, section, &data,
+                                                  guest_memfd_rdm_replay_populated_cb);
+}
+
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamDiscard replay_fn = data->fn;
+
+    replay_fn(section, data->opaque);
+
+    return 0;
+}
+
+static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                             MemoryRegionSection *section,
+                                             ReplayRamDiscard replay_fn,
+                                             void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    guest_memfd_for_each_discarded_section(gmm, section, &data,
+                                           guest_memfd_rdm_replay_discarded_cb);
+}
+
+static void guest_memfd_manager_init(Object *obj)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
+
+    QLIST_INIT(&gmm->rdl_list);
+}
+
+static void guest_memfd_manager_finalize(Object *obj)
+{
+    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
+}
+
+static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
+
+    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
+    rdmc->register_listener = guest_memfd_rdm_register_listener;
+    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
+    rdmc->is_populated = guest_memfd_rdm_is_populated;
+    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
+    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
+}
diff --git a/system/meson.build b/system/meson.build
index 4952f4b2c7..ed4e1137bd 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -15,6 +15,7 @@ system_ss.add(files(
   'dirtylimit.c',
   'dma-helpers.c',
   'globals.c',
+  'guest-memfd-manager.c',
   'memory_mapping.c',
   'qdev-monitor.c',
   'qtest.c',
-- 
2.43.5

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Peter Xu 3 months, 3 weeks ago

Two trivial comments I spot:

On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
> +struct GuestMemfdManager {
> +    Object parent;
> +
> +    /* Managed memory region. */
> +    MemoryRegion *mr;
> +
> +    /*
> +     * 1-setting of the bit represents the memory is populated (shared).
> +     */
> +    int32_t bitmap_size;
> +    unsigned long *bitmap;

Might be clearer to name the bitmap directly as what it represents.  E.g.,
shared_bitmap?

> +
> +    /* block size and alignment */
> +    uint64_t block_size;

Can we always fetch it from the MR/ramblock? If this is needed, better add
some comment explaining why.

> +
> +    /* listeners to notify on populate/discard activity. */
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};

-- 
Peter Xu

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 3 months, 3 weeks ago

Thanks Peter for your review!

On 1/21/2025 2:09 AM, Peter Xu wrote:
> Two trivial comments I spot:
> 
> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>> +struct GuestMemfdManager {
>> +    Object parent;
>> +
>> +    /* Managed memory region. */
>> +    MemoryRegion *mr;
>> +
>> +    /*
>> +     * 1-setting of the bit represents the memory is populated (shared).
>> +     */
>> +    int32_t bitmap_size;
>> +    unsigned long *bitmap;
> 
> Might be clearer to name the bitmap directly as what it represents.  E.g.,
> shared_bitmap?

Make sense.

> 
>> +
>> +    /* block size and alignment */
>> +    uint64_t block_size;
> 
> Can we always fetch it from the MR/ramblock? If this is needed, better add
> some comment explaining why.

The block_size is the granularity used to track the private/shared
attribute in the bitmap. It is currently hardcoded to 4K as guest_memfd
may manipulate the page conversion in at least 4K size and alignment.
I think It is somewhat a variable to cache the size and can avoid many
getpagesize() calls.

> 
>> +
>> +    /* listeners to notify on populate/discard activity. */
>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Peter Xu 3 months, 3 weeks ago

On Tue, Jan 21, 2025 at 05:00:45PM +0800, Chenyi Qiang wrote:
> >> +
> >> +    /* block size and alignment */
> >> +    uint64_t block_size;
> > 
> > Can we always fetch it from the MR/ramblock? If this is needed, better add
> > some comment explaining why.
> 
> The block_size is the granularity used to track the private/shared
> attribute in the bitmap. It is currently hardcoded to 4K as guest_memfd
> may manipulate the page conversion in at least 4K size and alignment.
> I think It is somewhat a variable to cache the size and can avoid many
> getpagesize() calls.

Though qemu does it frequently.. e.g. qemu_real_host_page_size() wraps
that.  So IIUC that's not a major concern, and if it's a concern maybe we
can cache it globally instead.

OTOH, this is not a per-ramblock limitation either, IIUC.  So maybe instead
of caching it per manager, we could have memory_attr_manager_get_psize()
helper (or any better name..):

memory_attr_manager_get_psize(MemoryAttrManager *mgr)
{
        /* Due to limitation of ... always notify with host psize */
        return qemu_real_host_page_size();
}

Then in the future if necessary, switch to:

memory_attr_manager_get_psize(MemoryAttrManager *mgr)
{
        return mgr->mr->ramblock->pagesize;
}

Thanks,

-- 
Peter Xu

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 3 months, 2 weeks ago

Sorry I missed this mail.

On 1/21/2025 11:38 PM, Peter Xu wrote:
> On Tue, Jan 21, 2025 at 05:00:45PM +0800, Chenyi Qiang wrote:
>>>> +
>>>> +    /* block size and alignment */
>>>> +    uint64_t block_size;
>>>
>>> Can we always fetch it from the MR/ramblock? If this is needed, better add
>>> some comment explaining why.
>>
>> The block_size is the granularity used to track the private/shared
>> attribute in the bitmap. It is currently hardcoded to 4K as guest_memfd
>> may manipulate the page conversion in at least 4K size and alignment.
>> I think It is somewhat a variable to cache the size and can avoid many
>> getpagesize() calls.
> 
> Though qemu does it frequently.. e.g. qemu_real_host_page_size() wraps
> that.  So IIUC that's not a major concern, and if it's a concern maybe we
> can cache it globally instead.
> 
> OTOH, this is not a per-ramblock limitation either, IIUC.  So maybe instead
> of caching it per manager, we could have memory_attr_manager_get_psize()
> helper (or any better name..):
> 
> memory_attr_manager_get_psize(MemoryAttrManager *mgr)
> {
>         /* Due to limitation of ... always notify with host psize */
>         return qemu_real_host_page_size();
> }
> 
> Then in the future if necessary, switch to:
> 
> memory_attr_manager_get_psize(MemoryAttrManager *mgr)
> {
>         return mgr->mr->ramblock->pagesize;
> }

This looks good to me. I'll change in this way.

> 
> Thanks,
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by David Hildenbrand 3 months, 3 weeks ago

On 21.01.25 10:00, Chenyi Qiang wrote:
> Thanks Peter for your review!
> 
> On 1/21/2025 2:09 AM, Peter Xu wrote:
>> Two trivial comments I spot:
>>
>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>> +struct GuestMemfdManager {
>>> +    Object parent;
>>> +
>>> +    /* Managed memory region. */
>>> +    MemoryRegion *mr;
>>> +
>>> +    /*
>>> +     * 1-setting of the bit represents the memory is populated (shared).
>>> +     */
>>> +    int32_t bitmap_size;
>>> +    unsigned long *bitmap;
>>
>> Might be clearer to name the bitmap directly as what it represents.  E.g.,
>> shared_bitmap?
> 
> Make sense.
> 

BTW, I was wondering if this information should be stored/linked from 
the RAMBlock, where we already store the guest_memdfd "int guest_memfd;".

For example, having a "struct guest_memfd_state", and either embedding 
it in the RAMBlock or dynamically allocating and linking it.

Alternatively, it would be such an object that we would simply link from 
the RAMBlock. (depending on which object will implement the manager 
interface)

In any case, having all guest_memfd state that belongs to a RAMBlock at 
a single location might be cleanest.

-- 
Cheers,

David / dhildenb

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 3 months, 3 weeks ago

On 1/21/2025 5:26 PM, David Hildenbrand wrote:
> On 21.01.25 10:00, Chenyi Qiang wrote:
>> Thanks Peter for your review!
>>
>> On 1/21/2025 2:09 AM, Peter Xu wrote:
>>> Two trivial comments I spot:
>>>
>>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>>> +struct GuestMemfdManager {
>>>> +    Object parent;
>>>> +
>>>> +    /* Managed memory region. */
>>>> +    MemoryRegion *mr;
>>>> +
>>>> +    /*
>>>> +     * 1-setting of the bit represents the memory is populated
>>>> (shared).
>>>> +     */
>>>> +    int32_t bitmap_size;
>>>> +    unsigned long *bitmap;
>>>
>>> Might be clearer to name the bitmap directly as what it represents. 
>>> E.g.,
>>> shared_bitmap?
>>
>> Make sense.
>>
> 
> BTW, I was wondering if this information should be stored/linked from
> the RAMBlock, where we already store the guest_memdfd "int guest_memfd;".
> 
> For example, having a "struct guest_memfd_state", and either embedding
> it in the RAMBlock or dynamically allocating and linking it.
> 
> Alternatively, it would be such an object that we would simply link from
> the RAMBlock. (depending on which object will implement the manager
> interface)
> 
> In any case, having all guest_memfd state that belongs to a RAMBlock at
> a single location might be cleanest.

Good suggestion. Follow the design of this series, we can add link to
the guest_memfd_manager object in RAMBlock.

>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by David Hildenbrand 3 months, 3 weeks ago

On 21.01.25 11:16, Chenyi Qiang wrote:
> 
> 
> On 1/21/2025 5:26 PM, David Hildenbrand wrote:
>> On 21.01.25 10:00, Chenyi Qiang wrote:
>>> Thanks Peter for your review!
>>>
>>> On 1/21/2025 2:09 AM, Peter Xu wrote:
>>>> Two trivial comments I spot:
>>>>
>>>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>>>> +struct GuestMemfdManager {
>>>>> +    Object parent;
>>>>> +
>>>>> +    /* Managed memory region. */
>>>>> +    MemoryRegion *mr;
>>>>> +
>>>>> +    /*
>>>>> +     * 1-setting of the bit represents the memory is populated
>>>>> (shared).
>>>>> +     */
>>>>> +    int32_t bitmap_size;
>>>>> +    unsigned long *bitmap;
>>>>
>>>> Might be clearer to name the bitmap directly as what it represents.
>>>> E.g.,
>>>> shared_bitmap?
>>>
>>> Make sense.
>>>
>>
>> BTW, I was wondering if this information should be stored/linked from
>> the RAMBlock, where we already store the guest_memdfd "int guest_memfd;".
>>
>> For example, having a "struct guest_memfd_state", and either embedding
>> it in the RAMBlock or dynamically allocating and linking it.
>>
>> Alternatively, it would be such an object that we would simply link from
>> the RAMBlock. (depending on which object will implement the manager
>> interface)
>>
>> In any case, having all guest_memfd state that belongs to a RAMBlock at
>> a single location might be cleanest.
> 
> Good suggestion. Follow the design of this series, we can add link to
> the guest_memfd_manager object in RAMBlock.

Or we'll move / link that to the RAM memory region, because that's what 
the object actually controls.

It starts getting a bit blury what should be part of the RAMBlock and 
what should be part of the "owning" RAM memory region :(

-- 
Cheers,

David / dhildenb

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 3 months, 3 weeks ago

On 1/21/2025 6:26 PM, David Hildenbrand wrote:
> On 21.01.25 11:16, Chenyi Qiang wrote:
>>
>>
>> On 1/21/2025 5:26 PM, David Hildenbrand wrote:
>>> On 21.01.25 10:00, Chenyi Qiang wrote:
>>>> Thanks Peter for your review!
>>>>
>>>> On 1/21/2025 2:09 AM, Peter Xu wrote:
>>>>> Two trivial comments I spot:
>>>>>
>>>>> On Fri, Dec 13, 2024 at 03:08:44PM +0800, Chenyi Qiang wrote:
>>>>>> +struct GuestMemfdManager {
>>>>>> +    Object parent;
>>>>>> +
>>>>>> +    /* Managed memory region. */
>>>>>> +    MemoryRegion *mr;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * 1-setting of the bit represents the memory is populated
>>>>>> (shared).
>>>>>> +     */
>>>>>> +    int32_t bitmap_size;
>>>>>> +    unsigned long *bitmap;
>>>>>
>>>>> Might be clearer to name the bitmap directly as what it represents.
>>>>> E.g.,
>>>>> shared_bitmap?
>>>>
>>>> Make sense.
>>>>
>>>
>>> BTW, I was wondering if this information should be stored/linked from
>>> the RAMBlock, where we already store the guest_memdfd "int
>>> guest_memfd;".
>>>
>>> For example, having a "struct guest_memfd_state", and either embedding
>>> it in the RAMBlock or dynamically allocating and linking it.
>>>
>>> Alternatively, it would be such an object that we would simply link from
>>> the RAMBlock. (depending on which object will implement the manager
>>> interface)
>>>
>>> In any case, having all guest_memfd state that belongs to a RAMBlock at
>>> a single location might be cleanest.
>>
>> Good suggestion. Follow the design of this series, we can add link to
>> the guest_memfd_manager object in RAMBlock.
> 
> Or we'll move / link that to the RAM memory region, because that's what
> the object actually controls.
> 
> It starts getting a bit blury what should be part of the RAMBlock and
> what should be part of the "owning" RAM memory region :(

Maybe still part of RAMBlock. I think guest_memfd state should go along
with "int guest_memfd" as it is only valid when guest_memfd > 0; And
guest_memfd is only valid for ram MemoryRegion.

>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Alexey Kardashevskiy 4 months ago

On 13/12/24 18:08, Chenyi Qiang wrote:
> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") highlighted, some subsystems like VFIO might
> disable ram block discard. However, guest_memfd relies on the discard
> operation to perform page conversion between private and shared memory.
> This can lead to stale IOMMU mapping issue when assigning a hardware
> device to a confidential VM via shared memory (unprotected memory
> pages). Blocking shared page discard can solve this problem, but it
> could cause guests to consume twice the memory with VFIO, which is not
> acceptable in some cases. An alternative solution is to convey other
> systems like VFIO to refresh its outdated IOMMU mappings.
> 
> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
> VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other, so the similar work that needs to happen in response
> to virtio-mem changes needs to happen for page conversion events.
> Introduce the RamDiscardManager to guest_memfd to achieve it.
> 
> However, guest_memfd is not an object so it cannot directly implement
> the RamDiscardManager interface.
> 
> One solution is to implement the interface in HostMemoryBackend. Any

This sounds about right.

> guest_memfd-backed host memory backend can register itself in the target
> MemoryRegion. However, this solution doesn't cover the scenario where a
> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
> the virtual BIOS MemoryRegion.

What is this virtual BIOS MemoryRegion exactly? What does it look like 
in "info mtree -f"? Do we really want this memory to be DMAable?


> Thus, choose the second option, i.e. define an object type named
> guest_memfd_manager with RamDiscardManager interface. Upon creation of
> guest_memfd, a new guest_memfd_manager object can be instantiated and
> registered to the managed guest_memfd MemoryRegion to handle the page
> conversion events.
> 
> In the context of guest_memfd, the discarded state signifies that the
> page is private, while the populated state indicated that the page is
> shared. The state of the memory is tracked at the granularity of the
> host page size (i.e. block_size), as the minimum conversion size can be
> one page per request.
> 
> In addition, VFIO expects the DMA mapping for a specific iova to be
> mapped and unmapped with the same granularity. However, the confidential
> VMs may do partial conversion, e.g. conversion happens on a small region
> within a large region. To prevent such invalid cases and before any
> potential optimization comes out, all operations are performed with 4K
> granularity.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   include/sysemu/guest-memfd-manager.h |  46 +++++
>   system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>   system/meson.build                   |   1 +
>   3 files changed, 297 insertions(+)
>   create mode 100644 include/sysemu/guest-memfd-manager.h
>   create mode 100644 system/guest-memfd-manager.c
> 
> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
> new file mode 100644
> index 0000000000..ba4a99b614
> --- /dev/null
> +++ b/include/sysemu/guest-memfd-manager.h
> @@ -0,0 +1,46 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
> +
> +#include "sysemu/hostmem.h"
> +
> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
> +
> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
> +
> +struct GuestMemfdManager {
> +    Object parent;
> +
> +    /* Managed memory region. */

Do not need this comment. And the period.

> +    MemoryRegion *mr;
> +
> +    /*
> +     * 1-setting of the bit represents the memory is populated (shared).
> +     */

Could be 1 line comment.

> +    int32_t bitmap_size;

int or unsigned

> +    unsigned long *bitmap;
> +
> +    /* block size and alignment */
> +    uint64_t block_size;

unsigned?

(u)int(32|64)_t make sense for migrations which is not the case (yet?). 
Thanks,

> +
> +    /* listeners to notify on populate/discard activity. */

Do not really need this comment either imho.

> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +struct GuestMemfdManagerClass {
> +    ObjectClass parent_class;
> +};
> +
> +#endif
> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
> new file mode 100644
> index 0000000000..d7e105fead
> --- /dev/null
> +++ b/system/guest-memfd-manager.c
> @@ -0,0 +1,250 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/guest-memfd-manager.h"
> +
> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
> +                                          guest_memfd_manager,
> +                                          GUEST_MEMFD_MANAGER,
> +                                          OBJECT,
> +                                          { TYPE_RAM_DISCARD_MANAGER },
> +                                          { })
> +
> +static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
> +                                         const MemoryRegionSection *section)
> +{
> +    const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    uint64_t first_bit = section->offset_within_region / gmm->block_size;
> +    uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
> +    unsigned long first_discard_bit;
> +
> +    first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
> +    return first_discard_bit > last_bit;
> +}
> +
> +typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
> +
> +static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    return rdl->notify_populate(rdl, section);
> +}
> +
> +static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
> +{
> +    RamDiscardListener *rdl = arg;
> +
> +    rdl->notify_discard(rdl, section);
> +
> +    return 0;
> +}
> +
> +static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
> +                                                  MemoryRegionSection *section,
> +                                                  void *arg,
> +                                                  guest_memfd_section_cb cb)
> +{
> +    unsigned long first_one_bit, last_one_bit;
> +    uint64_t offset, size;
> +    int ret = 0;
> +
> +    first_one_bit = section->offset_within_region / gmm->block_size;
> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, first_one_bit);
> +
> +    while (first_one_bit < gmm->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_one_bit * gmm->block_size;
> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
> +                                          first_one_bit + 1) - 1;
> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;

This tries calling cb() on bigger chunks even though we say from the 
beginning that only page size is supported?

May be simplify this for now and extend if/when VFIO learns to split 
mappings,  or  just drop it when we get in-place page state convertion 
(which will make this all irrelevant)?


> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            break;
> +        }
> +
> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
> +                                      last_one_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
> +                                                  MemoryRegionSection *section,
> +                                                  void *arg,
> +                                                  guest_memfd_section_cb cb)
> +{
> +    unsigned long first_zero_bit, last_zero_bit;
> +    uint64_t offset, size;
> +    int ret = 0;
> +
> +    first_zero_bit = section->offset_within_region / gmm->block_size;
> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
> +                                        first_zero_bit);
> +
> +    while (first_zero_bit < gmm->bitmap_size) {
> +        MemoryRegionSection tmp = *section;
> +
> +        offset = first_zero_bit * gmm->block_size;
> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
> +                                      first_zero_bit + 1) - 1;
> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            break;
> +        }
> +
> +        ret = cb(&tmp, arg);
> +        if (ret) {
> +            break;
> +        }
> +
> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
> +                                            last_zero_bit + 2);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint64_t guest_memfd_rdm_get_min_granularity(const RamDiscardManager *rdm,
> +                                                    const MemoryRegion *mr)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +
> +    g_assert(mr == gmm->mr);
> +    return gmm->block_size;
> +}
> +
> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
> +                                              RamDiscardListener *rdl,
> +                                              MemoryRegionSection *section)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(section->mr == gmm->mr);
> +    rdl->section = memory_region_section_new_copy(section);
> +
> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
> +
> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
> +                                                 guest_memfd_notify_populate_cb);
> +    if (ret) {
> +        error_report("%s: Failed to register RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +}
> +
> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
> +                                                RamDiscardListener *rdl)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    int ret;
> +
> +    g_assert(rdl->section);
> +    g_assert(rdl->section->mr == gmm->mr);
> +
> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
> +                                                 guest_memfd_notify_discard_cb);
> +    if (ret) {
> +        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
> +                     strerror(-ret));
> +    }
> +
> +    memory_region_section_free_copy(rdl->section);
> +    rdl->section = NULL;
> +    QLIST_REMOVE(rdl, next);
> +
> +}
> +
> +typedef struct GuestMemfdReplayData {
> +    void *fn;

s/void */ReplayRamPopulate/

> +    void *opaque;
> +} GuestMemfdReplayData;
> +
> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
> +{
> +    struct GuestMemfdReplayData *data = arg;

Drop "struct" here and below.

> +    ReplayRamPopulate replay_fn = data->fn;
> +
> +    return replay_fn(section, data->opaque);
> +}
> +
> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
> +                                            MemoryRegionSection *section,
> +                                            ReplayRamPopulate replay_fn,
> +                                            void *opaque)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == gmm->mr);
> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
> +                                                  guest_memfd_rdm_replay_populated_cb);
> +}
> +
> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
> +{
> +    struct GuestMemfdReplayData *data = arg;
> +    ReplayRamDiscard replay_fn = data->fn;
> +
> +    replay_fn(section, data->opaque);


guest_memfd_rdm_replay_populated_cb() checks for errors though.

> +
> +    return 0;
> +}
> +
> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                             MemoryRegionSection *section,
> +                                             ReplayRamDiscard replay_fn,
> +                                             void *opaque)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
> +
> +    g_assert(section->mr == gmm->mr);
> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
> +                                           guest_memfd_rdm_replay_discarded_cb);
> +}
> +
> +static void guest_memfd_manager_init(Object *obj)
> +{
> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
> +
> +    QLIST_INIT(&gmm->rdl_list);
> +}
> +
> +static void guest_memfd_manager_finalize(Object *obj)
> +{
> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);


bitmap is not allocated though. And 5/7 removes this anyway. Thanks,


> +}
> +
> +static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
> +{
> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
> +
> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
> +}
> diff --git a/system/meson.build b/system/meson.build
> index 4952f4b2c7..ed4e1137bd 100644
> --- a/system/meson.build
> +++ b/system/meson.build
> @@ -15,6 +15,7 @@ system_ss.add(files(
>     'dirtylimit.c',
>     'dma-helpers.c',
>     'globals.c',
> +  'guest-memfd-manager.c',
>     'memory_mapping.c',
>     'qdev-monitor.c',
>     'qtest.c',

-- 
Alexey

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
> On 13/12/24 18:08, Chenyi Qiang wrote:
>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>> uncoordinated discard") highlighted, some subsystems like VFIO might
>> disable ram block discard. However, guest_memfd relies on the discard
>> operation to perform page conversion between private and shared memory.
>> This can lead to stale IOMMU mapping issue when assigning a hardware
>> device to a confidential VM via shared memory (unprotected memory
>> pages). Blocking shared page discard can solve this problem, but it
>> could cause guests to consume twice the memory with VFIO, which is not
>> acceptable in some cases. An alternative solution is to convey other
>> systems like VFIO to refresh its outdated IOMMU mappings.
>>
>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>> VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other, so the similar work that needs to happen in response
>> to virtio-mem changes needs to happen for page conversion events.
>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>
>> However, guest_memfd is not an object so it cannot directly implement
>> the RamDiscardManager interface.
>>
>> One solution is to implement the interface in HostMemoryBackend. Any
> 
> This sounds about right.
> 
>> guest_memfd-backed host memory backend can register itself in the target
>> MemoryRegion. However, this solution doesn't cover the scenario where a
>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>> the virtual BIOS MemoryRegion.
> 
> What is this virtual BIOS MemoryRegion exactly? What does it look like
> in "info mtree -f"? Do we really want this memory to be DMAable?

virtual BIOS shows in a separate region:

 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
  ...
  00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
  0000000100000000-000000017fffffff (prio 0, ram): pc.ram
@0000000080000000 KVM

We also consider to implement the interface in HostMemoryBackend, but
maybe implement with guest_memfd region is more general. We don't know
if any DMAable memory would belong to HostMemoryBackend although at
present it is.

If it is more appropriate to implement it with HostMemoryBackend, I can
change to this way.

> 
> 
>> Thus, choose the second option, i.e. define an object type named
>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>> registered to the managed guest_memfd MemoryRegion to handle the page
>> conversion events.
>>
>> In the context of guest_memfd, the discarded state signifies that the
>> page is private, while the populated state indicated that the page is
>> shared. The state of the memory is tracked at the granularity of the
>> host page size (i.e. block_size), as the minimum conversion size can be
>> one page per request.
>>
>> In addition, VFIO expects the DMA mapping for a specific iova to be
>> mapped and unmapped with the same granularity. However, the confidential
>> VMs may do partial conversion, e.g. conversion happens on a small region
>> within a large region. To prevent such invalid cases and before any
>> potential optimization comes out, all operations are performed with 4K
>> granularity.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   include/sysemu/guest-memfd-manager.h |  46 +++++
>>   system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>>   system/meson.build                   |   1 +
>>   3 files changed, 297 insertions(+)
>>   create mode 100644 include/sysemu/guest-memfd-manager.h
>>   create mode 100644 system/guest-memfd-manager.c
>>
>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>> guest-memfd-manager.h
>> new file mode 100644
>> index 0000000000..ba4a99b614
>> --- /dev/null
>> +++ b/include/sysemu/guest-memfd-manager.h
>> @@ -0,0 +1,46 @@
>> +/*
>> + * QEMU guest memfd manager
>> + *
>> + * Copyright Intel
>> + *
>> + * Author:
>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> + * See the COPYING file in the top-level directory
>> + *
>> + */
>> +
>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>> +
>> +#include "sysemu/hostmem.h"
>> +
>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>> +
>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>> GUEST_MEMFD_MANAGER)
>> +
>> +struct GuestMemfdManager {
>> +    Object parent;
>> +
>> +    /* Managed memory region. */
> 
> Do not need this comment. And the period.

[...]

> 
>> +    MemoryRegion *mr;
>> +
>> +    /*
>> +     * 1-setting of the bit represents the memory is populated (shared).
>> +     */

Will fix it.

> 
> Could be 1 line comment.
> 
>> +    int32_t bitmap_size;
> 
> int or unsigned
> 
>> +    unsigned long *bitmap;
>> +
>> +    /* block size and alignment */
>> +    uint64_t block_size;
> 
> unsigned?
> 
> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
> Thanks,

I think these fields would be helpful for future migration support.
Maybe defining as this way is more straightforward.

> 
>> +
>> +    /* listeners to notify on populate/discard activity. */
> 
> Do not really need this comment either imho.
> 

I prefer to provide the comment for each field as virtio-mem do. If it
is not necessary, I would remove those obvious ones.

>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>> +};
>> +
>> +struct GuestMemfdManagerClass {
>> +    ObjectClass parent_class;
>> +};
>> +
>> +#endif

[...]

           void *arg,
>> +                                                 
>> guest_memfd_section_cb cb)
>> +{
>> +    unsigned long first_one_bit, last_one_bit;
>> +    uint64_t offset, size;
>> +    int ret = 0;
>> +
>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>> first_one_bit);
>> +
>> +    while (first_one_bit < gmm->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_one_bit * gmm->block_size;
>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                          first_one_bit + 1) - 1;
>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
> 
> This tries calling cb() on bigger chunks even though we say from the
> beginning that only page size is supported?
> 
> May be simplify this for now and extend if/when VFIO learns to split
> mappings,  or  just drop it when we get in-place page state convertion
> (which will make this all irrelevant)?

The cb() will call with big chunks but actually it do the split with the
granularity of block_size in the cb(). See the
vfio_ram_discard_notify_populate(), which do the DMA_MAP with
granularity size.

> 
> 
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            break;
>> +        }
>> +
>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                      last_one_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int guest_memfd_for_each_discarded_section(const
>> GuestMemfdManager *gmm,
>> +                                                  MemoryRegionSection
>> *section,
>> +                                                  void *arg,
>> +                                                 
>> guest_memfd_section_cb cb)
>> +{
>> +    unsigned long first_zero_bit, last_zero_bit;
>> +    uint64_t offset, size;
>> +    int ret = 0;
>> +
>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                        first_zero_bit);
>> +
>> +    while (first_zero_bit < gmm->bitmap_size) {
>> +        MemoryRegionSection tmp = *section;
>> +
>> +        offset = first_zero_bit * gmm->block_size;
>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>> +                                      first_zero_bit + 1) - 1;
>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            break;
>> +        }
>> +
>> +        ret = cb(&tmp, arg);
>> +        if (ret) {
>> +            break;
>> +        }
>> +
>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>> >bitmap_size,
>> +                                            last_zero_bit + 2);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>> RamDiscardManager *rdm,
>> +                                                    const
>> MemoryRegion *mr)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +
>> +    g_assert(mr == gmm->mr);
>> +    return gmm->block_size;
>> +}
>> +
>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>> +                                              RamDiscardListener *rdl,
>> +                                              MemoryRegionSection
>> *section)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    int ret;
>> +
>> +    g_assert(section->mr == gmm->mr);
>> +    rdl->section = memory_region_section_new_copy(section);
>> +
>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>> +
>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>> +                                                
>> guest_memfd_notify_populate_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to register RAM discard listener:
>> %s", __func__,
>> +                     strerror(-ret));
>> +    }
>> +}
>> +
>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
>> +                                                RamDiscardListener *rdl)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    int ret;
>> +
>> +    g_assert(rdl->section);
>> +    g_assert(rdl->section->mr == gmm->mr);
>> +
>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
>> +                                                
>> guest_memfd_notify_discard_cb);
>> +    if (ret) {
>> +        error_report("%s: Failed to unregister RAM discard listener:
>> %s", __func__,
>> +                     strerror(-ret));
>> +    }
>> +
>> +    memory_region_section_free_copy(rdl->section);
>> +    rdl->section = NULL;
>> +    QLIST_REMOVE(rdl, next);
>> +
>> +}
>> +
>> +typedef struct GuestMemfdReplayData {
>> +    void *fn;
> 
> s/void */ReplayRamPopulate/

[...]

> 
>> +    void *opaque;
>> +} GuestMemfdReplayData;
>> +
>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    struct GuestMemfdReplayData *data = arg;
> 
> Drop "struct" here and below.

Fixed. Thanks!

> 
>> +    ReplayRamPopulate replay_fn = data->fn;
>> +
>> +    return replay_fn(section, data->opaque);
>> +}
>> +
>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>> *rdm,
>> +                                            MemoryRegionSection
>> *section,
>> +                                            ReplayRamPopulate replay_fn,
>> +                                            void *opaque)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == gmm->mr);
>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>> +                                                 
>> guest_memfd_rdm_replay_populated_cb);
>> +}
>> +
>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>> *section, void *arg)
>> +{
>> +    struct GuestMemfdReplayData *data = arg;
>> +    ReplayRamDiscard replay_fn = data->fn;
>> +
>> +    replay_fn(section, data->opaque);
> 
> 
> guest_memfd_rdm_replay_populated_cb() checks for errors though.

It follows current definiton of ReplayRamDiscard() and
ReplayRamPopulate() where replay_discard() doesn't return errors and
replay_populate() returns errors.

> 
>> +
>> +    return 0;
>> +}
>> +
>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>> *rdm,
>> +                                             MemoryRegionSection
>> *section,
>> +                                             ReplayRamDiscard replay_fn,
>> +                                             void *opaque)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>> opaque };
>> +
>> +    g_assert(section->mr == gmm->mr);
>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>> +                                          
>> guest_memfd_rdm_replay_discarded_cb);
>> +}
>> +
>> +static void guest_memfd_manager_init(Object *obj)
>> +{
>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>> +
>> +    QLIST_INIT(&gmm->rdl_list);
>> +}
>> +
>> +static void guest_memfd_manager_finalize(Object *obj)
>> +{
>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
> 
> 
> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,

Will remove it. Thanks.

> 
> 
>> +}
>> +
>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>> +{
>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>> +
>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>> +}
>> diff --git a/system/meson.build b/system/meson.build
>> index 4952f4b2c7..ed4e1137bd 100644
>> --- a/system/meson.build
>> +++ b/system/meson.build
>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>     'dirtylimit.c',
>>     'dma-helpers.c',
>>     'globals.c',
>> +  'guest-memfd-manager.c',
>>     'memory_mapping.c',
>>     'qdev-monitor.c',
>>     'qtest.c',
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by David Hildenbrand 4 months ago

On 08.01.25 11:56, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>> disable ram block discard. However, guest_memfd relies on the discard
>>> operation to perform page conversion between private and shared memory.
>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>> device to a confidential VM via shared memory (unprotected memory
>>> pages). Blocking shared page discard can solve this problem, but it
>>> could cause guests to consume twice the memory with VFIO, which is not
>>> acceptable in some cases. An alternative solution is to convey other
>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>
>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>> VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other, so the similar work that needs to happen in response
>>> to virtio-mem changes needs to happen for page conversion events.
>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>
>>> However, guest_memfd is not an object so it cannot directly implement
>>> the RamDiscardManager interface.
>>>
>>> One solution is to implement the interface in HostMemoryBackend. Any
>>
>> This sounds about right.
>>
>>> guest_memfd-backed host memory backend can register itself in the target
>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>> the virtual BIOS MemoryRegion.
>>
>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>> in "info mtree -f"? Do we really want this memory to be DMAable?
> 
> virtual BIOS shows in a separate region:
> 
>   Root memory region: system
>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>    ...
>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
> @0000000080000000 KVM
> 
> We also consider to implement the interface in HostMemoryBackend, but
> maybe implement with guest_memfd region is more general. We don't know
> if any DMAable memory would belong to HostMemoryBackend although at
> present it is.
> 
> If it is more appropriate to implement it with HostMemoryBackend, I can
> change to this way.

Not sure that's the right place. Isn't it the (cc) machine that controls 
the state?

It's not really the memory backend, that's just the memory provider.

-- 
Cheers,

David / dhildenb

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Alexey Kardashevskiy 3 months, 4 weeks ago

On 13/1/25 21:54, David Hildenbrand wrote:
> On 08.01.25 11:56, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory (unprotected memory
>>>> pages). Blocking shared page discard can solve this problem, but it
>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>> acceptable in some cases. An alternative solution is to convey other
>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other, so the similar work that needs to happen in response
>>>> to virtio-mem changes needs to happen for page conversion events.
>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>
>>>> However, guest_memfd is not an object so it cannot directly implement
>>>> the RamDiscardManager interface.
>>>>
>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>
>>> This sounds about right.
>>>
>>>> guest_memfd-backed host memory backend can register itself in the 
>>>> target
>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>> the virtual BIOS MemoryRegion.
>>>
>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>
>> virtual BIOS shows in a separate region:
>>
>>   Root memory region: system
>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>    ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>> @0000000080000000 KVM
>>
>> We also consider to implement the interface in HostMemoryBackend, but
>> maybe implement with guest_memfd region is more general. We don't know
>> if any DMAable memory would belong to HostMemoryBackend although at
>> present it is.
>>
>> If it is more appropriate to implement it with HostMemoryBackend, I can
>> change to this way.
> 
> Not sure that's the right place. Isn't it the (cc) machine that controls 
> the state?

KVM does, via MemoryRegion->RAMBlock->guest_memfd.

> It's not really the memory backend, that's just the memory provider.

Sorry but is not "providing memory" the purpose of "memory backend"? :)


-- 
Alexey

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

Thanks David for your review!

On 1/13/2025 6:54 PM, David Hildenbrand wrote:
> On 08.01.25 11:56, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory (unprotected memory
>>>> pages). Blocking shared page discard can solve this problem, but it
>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>> acceptable in some cases. An alternative solution is to convey other
>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other, so the similar work that needs to happen in response
>>>> to virtio-mem changes needs to happen for page conversion events.
>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>
>>>> However, guest_memfd is not an object so it cannot directly implement
>>>> the RamDiscardManager interface.
>>>>
>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>
>>> This sounds about right.
>>>
>>>> guest_memfd-backed host memory backend can register itself in the
>>>> target
>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>> the virtual BIOS MemoryRegion.
>>>
>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>
>> virtual BIOS shows in a separate region:
>>
>>   Root memory region: system
>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>    ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>> @0000000080000000 KVM
>>
>> We also consider to implement the interface in HostMemoryBackend, but
>> maybe implement with guest_memfd region is more general. We don't know
>> if any DMAable memory would belong to HostMemoryBackend although at
>> present it is.
>>
>> If it is more appropriate to implement it with HostMemoryBackend, I can
>> change to this way.
> 
> Not sure that's the right place. Isn't it the (cc) machine that controls
> the state?
> 
> It's not really the memory backend, that's just the memory provider.

Yes, the cc machine defines the require_guest_memfd. And besides the
normal memory, there's also some other memory region requires the state
control. See in another thread, For example, private mmio may require
the state change notification but not belong to memory backend. So I
think it is still better to define a specific object to control the state.

>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Alexey Kardashevskiy 4 months ago

On 8/1/25 21:56, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>> disable ram block discard. However, guest_memfd relies on the discard
>>> operation to perform page conversion between private and shared memory.
>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>> device to a confidential VM via shared memory (unprotected memory
>>> pages). Blocking shared page discard can solve this problem, but it
>>> could cause guests to consume twice the memory with VFIO, which is not
>>> acceptable in some cases. An alternative solution is to convey other
>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>
>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>> VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other, so the similar work that needs to happen in response
>>> to virtio-mem changes needs to happen for page conversion events.
>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>
>>> However, guest_memfd is not an object so it cannot directly implement
>>> the RamDiscardManager interface.
>>>
>>> One solution is to implement the interface in HostMemoryBackend. Any
>>
>> This sounds about right.
>>
>>> guest_memfd-backed host memory backend can register itself in the target
>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>> the virtual BIOS MemoryRegion.
>>
>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>> in "info mtree -f"? Do we really want this memory to be DMAable?
> 
> virtual BIOS shows in a separate region:
> 
>   Root memory region: system
>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>    ...
>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM

Looks like a normal MR which can be backed by guest_memfd.

>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
> @0000000080000000 KVM

Anyway if there is no guest_memfd backing it and 
memory_region_has_ram_discard_manager() returns false, then the MR is 
just going to be mapped for VFIO as usual which seems... alright, right?


> We also consider to implement the interface in HostMemoryBackend, but
> maybe implement with guest_memfd region is more general. We don't know
> if any DMAable memory would belong to HostMemoryBackend although at
> present it is.
> 
> If it is more appropriate to implement it with HostMemoryBackend, I can
> change to this way.

Seems cleaner imho.

>>
>>
>>> Thus, choose the second option, i.e. define an object type named
>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>>> registered to the managed guest_memfd MemoryRegion to handle the page
>>> conversion events.
>>>
>>> In the context of guest_memfd, the discarded state signifies that the
>>> page is private, while the populated state indicated that the page is
>>> shared. The state of the memory is tracked at the granularity of the
>>> host page size (i.e. block_size), as the minimum conversion size can be
>>> one page per request.
>>>
>>> In addition, VFIO expects the DMA mapping for a specific iova to be
>>> mapped and unmapped with the same granularity. However, the confidential
>>> VMs may do partial conversion, e.g. conversion happens on a small region
>>> within a large region. To prevent such invalid cases and before any
>>> potential optimization comes out, all operations are performed with 4K
>>> granularity.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>>    include/sysemu/guest-memfd-manager.h |  46 +++++
>>>    system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>>>    system/meson.build                   |   1 +
>>>    3 files changed, 297 insertions(+)
>>>    create mode 100644 include/sysemu/guest-memfd-manager.h
>>>    create mode 100644 system/guest-memfd-manager.c
>>>
>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>> guest-memfd-manager.h
>>> new file mode 100644
>>> index 0000000000..ba4a99b614
>>> --- /dev/null
>>> +++ b/include/sysemu/guest-memfd-manager.h
>>> @@ -0,0 +1,46 @@
>>> +/*
>>> + * QEMU guest memfd manager
>>> + *
>>> + * Copyright Intel
>>> + *
>>> + * Author:
>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>> later.
>>> + * See the COPYING file in the top-level directory
>>> + *
>>> + */
>>> +
>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>>> +
>>> +#include "sysemu/hostmem.h"
>>> +
>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>>> +
>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>>> GUEST_MEMFD_MANAGER)
>>> +
>>> +struct GuestMemfdManager {
>>> +    Object parent;
>>> +
>>> +    /* Managed memory region. */
>>
>> Do not need this comment. And the period.
> 
> [...]
> 
>>
>>> +    MemoryRegion *mr;
>>> +
>>> +    /*
>>> +     * 1-setting of the bit represents the memory is populated (shared).
>>> +     */
> 
> Will fix it.
> 
>>
>> Could be 1 line comment.
>>
>>> +    int32_t bitmap_size;
>>
>> int or unsigned
>>
>>> +    unsigned long *bitmap;
>>> +
>>> +    /* block size and alignment */
>>> +    uint64_t block_size;
>>
>> unsigned?
>>
>> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
>> Thanks,
> 
> I think these fields would be helpful for future migration support.
> Maybe defining as this way is more straightforward.
 >
>>
>>> +
>>> +    /* listeners to notify on populate/discard activity. */
>>
>> Do not really need this comment either imho.
>>
> 
> I prefer to provide the comment for each field as virtio-mem do. If it
> is not necessary, I would remove those obvious ones.

[bikeshedding on] But the "RamDiscardListener" word says that already, 
why repeating? :) It should add information, not duplicate. Like the 
block_size comment which mentions "alignment" [bikeshedding off]

>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>> +};
>>> +
>>> +struct GuestMemfdManagerClass {
>>> +    ObjectClass parent_class;
>>> +};
>>> +
>>> +#endif
> 
> [...]
> 
>             void *arg,
>>> +
>>> guest_memfd_section_cb cb)
>>> +{
>>> +    unsigned long first_one_bit, last_one_bit;
>>> +    uint64_t offset, size;
>>> +    int ret = 0;
>>> +
>>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>> first_one_bit);
>>> +
>>> +    while (first_one_bit < gmm->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_one_bit * gmm->block_size;
>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                          first_one_bit + 1) - 1;
>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
>>
>> This tries calling cb() on bigger chunks even though we say from the
>> beginning that only page size is supported?
>>
>> May be simplify this for now and extend if/when VFIO learns to split
>> mappings,  or  just drop it when we get in-place page state convertion
>> (which will make this all irrelevant)?
> 
> The cb() will call with big chunks but actually it do the split with the
> granularity of block_size in the cb(). See the
> vfio_ram_discard_notify_populate(), which do the DMA_MAP with
> granularity size.


Right, and this all happens inside QEMU - first the code finds bigger 
chunks and then it splits them anyway to call the VFIO driver. Seems 
pointless to bother about bigger chunks here.

> 
>>
>>
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +
>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                      last_one_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int guest_memfd_for_each_discarded_section(const
>>> GuestMemfdManager *gmm,
>>> +                                                  MemoryRegionSection
>>> *section,
>>> +                                                  void *arg,
>>> +
>>> guest_memfd_section_cb cb)
>>> +{
>>> +    unsigned long first_zero_bit, last_zero_bit;
>>> +    uint64_t offset, size;
>>> +    int ret = 0;
>>> +
>>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                        first_zero_bit);
>>> +
>>> +    while (first_zero_bit < gmm->bitmap_size) {
>>> +        MemoryRegionSection tmp = *section;
>>> +
>>> +        offset = first_zero_bit * gmm->block_size;
>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>> +                                      first_zero_bit + 1) - 1;
>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            break;
>>> +        }
>>> +
>>> +        ret = cb(&tmp, arg);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +
>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>> bitmap_size,
>>> +                                            last_zero_bit + 2);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>>> RamDiscardManager *rdm,
>>> +                                                    const
>>> MemoryRegion *mr)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +
>>> +    g_assert(mr == gmm->mr);
>>> +    return gmm->block_size;
>>> +}
>>> +
>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>>> +                                              RamDiscardListener *rdl,
>>> +                                              MemoryRegionSection
>>> *section)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(section->mr == gmm->mr);
>>> +    rdl->section = memory_region_section_new_copy(section);
>>> +
>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>>> +
>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>>> +
>>> guest_memfd_notify_populate_cb);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to register RAM discard listener:
>>> %s", __func__,
>>> +                     strerror(-ret));
>>> +    }
>>> +}
>>> +
>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
>>> +                                                RamDiscardListener *rdl)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    int ret;
>>> +
>>> +    g_assert(rdl->section);
>>> +    g_assert(rdl->section->mr == gmm->mr);
>>> +
>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
>>> +
>>> guest_memfd_notify_discard_cb);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>> %s", __func__,
>>> +                     strerror(-ret));
>>> +    }
>>> +
>>> +    memory_region_section_free_copy(rdl->section);
>>> +    rdl->section = NULL;
>>> +    QLIST_REMOVE(rdl, next);
>>> +
>>> +}
>>> +
>>> +typedef struct GuestMemfdReplayData {
>>> +    void *fn;
>>
>> s/void */ReplayRamPopulate/
> 
> [...]
> 
>>
>>> +    void *opaque;
>>> +} GuestMemfdReplayData;
>>> +
>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>>> *section, void *arg)
>>> +{
>>> +    struct GuestMemfdReplayData *data = arg;
>>
>> Drop "struct" here and below.
> 
> Fixed. Thanks!
> 
>>
>>> +    ReplayRamPopulate replay_fn = data->fn;
>>> +
>>> +    return replay_fn(section, data->opaque);
>>> +}
>>> +
>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>> *rdm,
>>> +                                            MemoryRegionSection
>>> *section,
>>> +                                            ReplayRamPopulate replay_fn,
>>> +                                            void *opaque)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == gmm->mr);
>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>>> +
>>> guest_memfd_rdm_replay_populated_cb);
>>> +}
>>> +
>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>> *section, void *arg)
>>> +{
>>> +    struct GuestMemfdReplayData *data = arg;
>>> +    ReplayRamDiscard replay_fn = data->fn;
>>> +
>>> +    replay_fn(section, data->opaque);
>>
>>
>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
> 
> It follows current definiton of ReplayRamDiscard() and
> ReplayRamPopulate() where replay_discard() doesn't return errors and
> replay_populate() returns errors.

A trace would be appropriate imho. Thanks,

>>
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>> *rdm,
>>> +                                             MemoryRegionSection
>>> *section,
>>> +                                             ReplayRamDiscard replay_fn,
>>> +                                             void *opaque)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>> opaque };
>>> +
>>> +    g_assert(section->mr == gmm->mr);
>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>> +
>>> guest_memfd_rdm_replay_discarded_cb);
>>> +}
>>> +
>>> +static void guest_memfd_manager_init(Object *obj)
>>> +{
>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>> +
>>> +    QLIST_INIT(&gmm->rdl_list);
>>> +}
>>> +
>>> +static void guest_memfd_manager_finalize(Object *obj)
>>> +{
>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>
>>
>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
> 
> Will remove it. Thanks.
> 
>>
>>
>>> +}
>>> +
>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
>>> +{
>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>> +
>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>> +}
>>> diff --git a/system/meson.build b/system/meson.build
>>> index 4952f4b2c7..ed4e1137bd 100644
>>> --- a/system/meson.build
>>> +++ b/system/meson.build
>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>      'dirtylimit.c',
>>>      'dma-helpers.c',
>>>      'globals.c',
>>> +  'guest-memfd-manager.c',
>>>      'memory_mapping.c',
>>>      'qdev-monitor.c',
>>>      'qtest.c',
>>
> 

-- 
Alexey

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 8/1/25 21:56, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory (unprotected memory
>>>> pages). Blocking shared page discard can solve this problem, but it
>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>> acceptable in some cases. An alternative solution is to convey other
>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other, so the similar work that needs to happen in response
>>>> to virtio-mem changes needs to happen for page conversion events.
>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>
>>>> However, guest_memfd is not an object so it cannot directly implement
>>>> the RamDiscardManager interface.
>>>>
>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>
>>> This sounds about right.
>>>
>>>> guest_memfd-backed host memory backend can register itself in the
>>>> target
>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>> the virtual BIOS MemoryRegion.
>>>
>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>
>> virtual BIOS shows in a separate region:
>>
>>   Root memory region: system
>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>    ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
> 
> Looks like a normal MR which can be backed by guest_memfd.

Yes, virtual BIOS memory region is initialized by
memory_region_init_ram_guest_memfd() which will be backed by a guest_memfd.

The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
BIOS image will be loaded and then copied to private region. After that,
the loaded image will be discarded and this region become useless. So I
feel like this virtual BIOS should not be backed by guest_memfd?

> 
>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>> @0000000080000000 KVM
> 
> Anyway if there is no guest_memfd backing it and
> memory_region_has_ram_discard_manager() returns false, then the MR is
> just going to be mapped for VFIO as usual which seems... alright, right?

Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.

If we go with the HostMemoryBackend instead of guest_memfd_manager, this
MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
just ignore it since the MR is useless (but looks not so good).

> 
> 
>> We also consider to implement the interface in HostMemoryBackend, but
>> maybe implement with guest_memfd region is more general. We don't know
>> if any DMAable memory would belong to HostMemoryBackend although at
>> present it is.
>>
>> If it is more appropriate to implement it with HostMemoryBackend, I can
>> change to this way.
> 
> Seems cleaner imho.

I can go this way.

> 
>>>
>>>
>>>> Thus, choose the second option, i.e. define an object type named
>>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>>>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>>>> registered to the managed guest_memfd MemoryRegion to handle the page
>>>> conversion events.
>>>>
>>>> In the context of guest_memfd, the discarded state signifies that the
>>>> page is private, while the populated state indicated that the page is
>>>> shared. The state of the memory is tracked at the granularity of the
>>>> host page size (i.e. block_size), as the minimum conversion size can be
>>>> one page per request.
>>>>
>>>> In addition, VFIO expects the DMA mapping for a specific iova to be
>>>> mapped and unmapped with the same granularity. However, the
>>>> confidential
>>>> VMs may do partial conversion, e.g. conversion happens on a small
>>>> region
>>>> within a large region. To prevent such invalid cases and before any
>>>> potential optimization comes out, all operations are performed with 4K
>>>> granularity.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>>    include/sysemu/guest-memfd-manager.h |  46 +++++
>>>>    system/guest-memfd-manager.c         | 250 ++++++++++++++++++++++
>>>> +++++
>>>>    system/meson.build                   |   1 +
>>>>    3 files changed, 297 insertions(+)
>>>>    create mode 100644 include/sysemu/guest-memfd-manager.h
>>>>    create mode 100644 system/guest-memfd-manager.c
>>>>
>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>> guest-memfd-manager.h
>>>> new file mode 100644
>>>> index 0000000000..ba4a99b614
>>>> --- /dev/null
>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>> @@ -0,0 +1,46 @@
>>>> +/*
>>>> + * QEMU guest memfd manager
>>>> + *
>>>> + * Copyright Intel
>>>> + *
>>>> + * Author:
>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>> later.
>>>> + * See the COPYING file in the top-level directory
>>>> + *
>>>> + */
>>>> +
>>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>>>> +
>>>> +#include "sysemu/hostmem.h"
>>>> +
>>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>>>> +
>>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>>>> GUEST_MEMFD_MANAGER)
>>>> +
>>>> +struct GuestMemfdManager {
>>>> +    Object parent;
>>>> +
>>>> +    /* Managed memory region. */
>>>
>>> Do not need this comment. And the period.
>>
>> [...]
>>
>>>
>>>> +    MemoryRegion *mr;
>>>> +
>>>> +    /*
>>>> +     * 1-setting of the bit represents the memory is populated
>>>> (shared).
>>>> +     */
>>
>> Will fix it.
>>
>>>
>>> Could be 1 line comment.
>>>
>>>> +    int32_t bitmap_size;
>>>
>>> int or unsigned
>>>
>>>> +    unsigned long *bitmap;
>>>> +
>>>> +    /* block size and alignment */
>>>> +    uint64_t block_size;
>>>
>>> unsigned?
>>>
>>> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
>>> Thanks,
>>
>> I think these fields would be helpful for future migration support.
>> Maybe defining as this way is more straightforward.
>>
>>>
>>>> +
>>>> +    /* listeners to notify on populate/discard activity. */
>>>
>>> Do not really need this comment either imho.
>>>
>>
>> I prefer to provide the comment for each field as virtio-mem do. If it
>> is not necessary, I would remove those obvious ones.
> 
> [bikeshedding on] But the "RamDiscardListener" word says that already,
> why repeating? :) It should add information, not duplicate. Like the
> block_size comment which mentions "alignment" [bikeshedding off]

Got it. Thanks!

> 
>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>> +};
>>>> +
>>>> +struct GuestMemfdManagerClass {
>>>> +    ObjectClass parent_class;
>>>> +};
>>>> +
>>>> +#endif
>>
>> [...]
>>
>>             void *arg,
>>>> +
>>>> guest_memfd_section_cb cb)
>>>> +{
>>>> +    unsigned long first_one_bit, last_one_bit;
>>>> +    uint64_t offset, size;
>>>> +    int ret = 0;
>>>> +
>>>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> first_one_bit);
>>>> +
>>>> +    while (first_one_bit < gmm->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_one_bit * gmm->block_size;
>>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>> >bitmap_size,
>>>> +                                          first_one_bit + 1) - 1;
>>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
>>>
>>> This tries calling cb() on bigger chunks even though we say from the
>>> beginning that only page size is supported?
>>>
>>> May be simplify this for now and extend if/when VFIO learns to split
>>> mappings,  or  just drop it when we get in-place page state convertion
>>> (which will make this all irrelevant)?
>>
>> The cb() will call with big chunks but actually it do the split with the
>> granularity of block_size in the cb(). See the
>> vfio_ram_discard_notify_populate(), which do the DMA_MAP with
>> granularity size.
> 
> 
> Right, and this all happens inside QEMU - first the code finds bigger
> chunks and then it splits them anyway to call the VFIO driver. Seems
> pointless to bother about bigger chunks here.
> 
>>
>>>
>>>
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                      last_one_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int guest_memfd_for_each_discarded_section(const
>>>> GuestMemfdManager *gmm,
>>>> +                                                  MemoryRegionSection
>>>> *section,
>>>> +                                                  void *arg,
>>>> +
>>>> guest_memfd_section_cb cb)
>>>> +{
>>>> +    unsigned long first_zero_bit, last_zero_bit;
>>>> +    uint64_t offset, size;
>>>> +    int ret = 0;
>>>> +
>>>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                        first_zero_bit);
>>>> +
>>>> +    while (first_zero_bit < gmm->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_zero_bit * gmm->block_size;
>>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                      first_zero_bit + 1) - 1;
>>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>>> bitmap_size,
>>>> +                                            last_zero_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>>>> RamDiscardManager *rdm,
>>>> +                                                    const
>>>> MemoryRegion *mr)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +
>>>> +    g_assert(mr == gmm->mr);
>>>> +    return gmm->block_size;
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>>>> +                                              RamDiscardListener *rdl,
>>>> +                                              MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    rdl->section = memory_region_section_new_copy(section);
>>>> +
>>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>>>> +
>>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>>>> +
>>>> guest_memfd_notify_populate_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to register RAM discard listener:
>>>> %s", __func__,
>>>> +                     strerror(-ret));
>>>> +    }
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager
>>>> *rdm,
>>>> +                                                RamDiscardListener
>>>> *rdl)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(rdl->section);
>>>> +    g_assert(rdl->section->mr == gmm->mr);
>>>> +
>>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section,
>>>> rdl,
>>>> +
>>>> guest_memfd_notify_discard_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>>> %s", __func__,
>>>> +                     strerror(-ret));
>>>> +    }
>>>> +
>>>> +    memory_region_section_free_copy(rdl->section);
>>>> +    rdl->section = NULL;
>>>> +    QLIST_REMOVE(rdl, next);
>>>> +
>>>> +}
>>>> +
>>>> +typedef struct GuestMemfdReplayData {
>>>> +    void *fn;
>>>
>>> s/void */ReplayRamPopulate/
>>
>> [...]
>>
>>>
>>>> +    void *opaque;
>>>> +} GuestMemfdReplayData;
>>>> +
>>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    struct GuestMemfdReplayData *data = arg;
>>>
>>> Drop "struct" here and below.
>>
>> Fixed. Thanks!
>>
>>>
>>>> +    ReplayRamPopulate replay_fn = data->fn;
>>>> +
>>>> +    return replay_fn(section, data->opaque);
>>>> +}
>>>> +
>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>> *rdm,
>>>> +                                            MemoryRegionSection
>>>> *section,
>>>> +                                            ReplayRamPopulate
>>>> replay_fn,
>>>> +                                            void *opaque)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>>>> +
>>>> guest_memfd_rdm_replay_populated_cb);
>>>> +}
>>>> +
>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    struct GuestMemfdReplayData *data = arg;
>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>> +
>>>> +    replay_fn(section, data->opaque);
>>>
>>>
>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>
>> It follows current definiton of ReplayRamDiscard() and
>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>> replay_populate() returns errors.
> 
> A trace would be appropriate imho. Thanks,

Sorry, can't catch you. What kind of info to be traced? The errors
returned by replay_populate()?

> 
>>>
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>> *rdm,
>>>> +                                             MemoryRegionSection
>>>> *section,
>>>> +                                             ReplayRamDiscard
>>>> replay_fn,
>>>> +                                             void *opaque)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>> +
>>>> guest_memfd_rdm_replay_discarded_cb);
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_init(Object *obj)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>> +
>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>> +{
>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>
>>>
>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>
>> Will remove it. Thanks.
>>
>>>
>>>
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>> *data)
>>>> +{
>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>> +
>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>> +}
>>>> diff --git a/system/meson.build b/system/meson.build
>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>> --- a/system/meson.build
>>>> +++ b/system/meson.build
>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>      'dirtylimit.c',
>>>>      'dma-helpers.c',
>>>>      'globals.c',
>>>> +  'guest-memfd-manager.c',
>>>>      'memory_mapping.c',
>>>>      'qdev-monitor.c',
>>>>      'qtest.c',
>>>
>>
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Alexey Kardashevskiy 4 months ago

On 9/1/25 13:11, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>> operation to perform page conversion between private and shared memory.
>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>
>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>>> back in the other, so the similar work that needs to happen in response
>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>
>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>> the RamDiscardManager interface.
>>>>>
>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>
>>>> This sounds about right.
>>>>
>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>> target
>>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>>> the virtual BIOS MemoryRegion.
>>>>
>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>
>>> virtual BIOS shows in a separate region:
>>>
>>>    Root memory region: system
>>>     0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>     ...
>>>     00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>
>> Looks like a normal MR which can be backed by guest_memfd.
> 
> Yes, virtual BIOS memory region is initialized by
> memory_region_init_ram_guest_memfd() which will be backed by a guest_memfd.
> 
> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
> BIOS image will be loaded and then copied to private region.
> After that,
> the loaded image will be discarded and this region become useless.

I'd think it is loaded as "struct Rom" and then copied to the 
MR-ram_guest_memfd() which does not leave MR useless - we still see 
"pc.bios" in the list so it is not discarded. What piece of code are you 
referring to exactly?


> So I
> feel like this virtual BIOS should not be backed by guest_memfd?

 From the above it sounds like the opposite, i.e. it should :)

>>
>>>     0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>> @0000000080000000 KVM
>>
>> Anyway if there is no guest_memfd backing it and
>> memory_region_has_ram_discard_manager() returns false, then the MR is
>> just going to be mapped for VFIO as usual which seems... alright, right?
> 
> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
> 
> If we go with the HostMemoryBackend instead of guest_memfd_manager, this
> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
> just ignore it since the MR is useless (but looks not so good).

Sorry I am missing necessary details here, let's figure out the above.

> 
>>
>>
>>> We also consider to implement the interface in HostMemoryBackend, but
>>> maybe implement with guest_memfd region is more general. We don't know
>>> if any DMAable memory would belong to HostMemoryBackend although at
>>> present it is.
>>>
>>> If it is more appropriate to implement it with HostMemoryBackend, I can
>>> change to this way.
>>
>> Seems cleaner imho.
> 
> I can go this way.
> 
>>
>>>>
>>>>
>>>>> Thus, choose the second option, i.e. define an object type named
>>>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>>>>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>>>>> registered to the managed guest_memfd MemoryRegion to handle the page
>>>>> conversion events.
>>>>>
>>>>> In the context of guest_memfd, the discarded state signifies that the
>>>>> page is private, while the populated state indicated that the page is
>>>>> shared. The state of the memory is tracked at the granularity of the
>>>>> host page size (i.e. block_size), as the minimum conversion size can be
>>>>> one page per request.
>>>>>
>>>>> In addition, VFIO expects the DMA mapping for a specific iova to be
>>>>> mapped and unmapped with the same granularity. However, the
>>>>> confidential
>>>>> VMs may do partial conversion, e.g. conversion happens on a small
>>>>> region
>>>>> within a large region. To prevent such invalid cases and before any
>>>>> potential optimization comes out, all operations are performed with 4K
>>>>> granularity.
>>>>>
>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> ---
>>>>>     include/sysemu/guest-memfd-manager.h |  46 +++++
>>>>>     system/guest-memfd-manager.c         | 250 ++++++++++++++++++++++
>>>>> +++++
>>>>>     system/meson.build                   |   1 +
>>>>>     3 files changed, 297 insertions(+)
>>>>>     create mode 100644 include/sysemu/guest-memfd-manager.h
>>>>>     create mode 100644 system/guest-memfd-manager.c
>>>>>
>>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>>> guest-memfd-manager.h
>>>>> new file mode 100644
>>>>> index 0000000000..ba4a99b614
>>>>> --- /dev/null
>>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>>> @@ -0,0 +1,46 @@
>>>>> +/*
>>>>> + * QEMU guest memfd manager
>>>>> + *
>>>>> + * Copyright Intel
>>>>> + *
>>>>> + * Author:
>>>>> + *      Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> + *
>>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>>> later.
>>>>> + * See the COPYING file in the top-level directory
>>>>> + *
>>>>> + */
>>>>> +
>>>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>>>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>>>>> +
>>>>> +#include "sysemu/hostmem.h"
>>>>> +
>>>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>>>>> +
>>>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>>>>> GUEST_MEMFD_MANAGER)
>>>>> +
>>>>> +struct GuestMemfdManager {
>>>>> +    Object parent;
>>>>> +
>>>>> +    /* Managed memory region. */
>>>>
>>>> Do not need this comment. And the period.
>>>
>>> [...]
>>>
>>>>
>>>>> +    MemoryRegion *mr;
>>>>> +
>>>>> +    /*
>>>>> +     * 1-setting of the bit represents the memory is populated
>>>>> (shared).
>>>>> +     */
>>>
>>> Will fix it.
>>>
>>>>
>>>> Could be 1 line comment.
>>>>
>>>>> +    int32_t bitmap_size;
>>>>
>>>> int or unsigned
>>>>
>>>>> +    unsigned long *bitmap;
>>>>> +
>>>>> +    /* block size and alignment */
>>>>> +    uint64_t block_size;
>>>>
>>>> unsigned?
>>>>
>>>> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
>>>> Thanks,
>>>
>>> I think these fields would be helpful for future migration support.
>>> Maybe defining as this way is more straightforward.
>>>
>>>>
>>>>> +
>>>>> +    /* listeners to notify on populate/discard activity. */
>>>>
>>>> Do not really need this comment either imho.
>>>>
>>>
>>> I prefer to provide the comment for each field as virtio-mem do. If it
>>> is not necessary, I would remove those obvious ones.
>>
>> [bikeshedding on] But the "RamDiscardListener" word says that already,
>> why repeating? :) It should add information, not duplicate. Like the
>> block_size comment which mentions "alignment" [bikeshedding off]
> 
> Got it. Thanks!
> 
>>
>>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>>> +};
>>>>> +
>>>>> +struct GuestMemfdManagerClass {
>>>>> +    ObjectClass parent_class;
>>>>> +};
>>>>> +
>>>>> +#endif
>>>
>>> [...]
>>>
>>>              void *arg,
>>>>> +
>>>>> guest_memfd_section_cb cb)
>>>>> +{
>>>>> +    unsigned long first_one_bit, last_one_bit;
>>>>> +    uint64_t offset, size;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>>>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> first_one_bit);
>>>>> +
>>>>> +    while (first_one_bit < gmm->bitmap_size) {
>>>>> +        MemoryRegionSection tmp = *section;
>>>>> +
>>>>> +        offset = first_one_bit * gmm->block_size;
>>>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>>>> bitmap_size,
>>>>> +                                          first_one_bit + 1) - 1;
>>>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
>>>>
>>>> This tries calling cb() on bigger chunks even though we say from the
>>>> beginning that only page size is supported?
>>>>
>>>> May be simplify this for now and extend if/when VFIO learns to split
>>>> mappings,  or  just drop it when we get in-place page state convertion
>>>> (which will make this all irrelevant)?
>>>
>>> The cb() will call with big chunks but actually it do the split with the
>>> granularity of block_size in the cb(). See the
>>> vfio_ram_discard_notify_populate(), which do the DMA_MAP with
>>> granularity size.
>>
>>
>> Right, and this all happens inside QEMU - first the code finds bigger
>> chunks and then it splits them anyway to call the VFIO driver. Seems
>> pointless to bother about bigger chunks here.
>>
>>>
>>>>
>>>>
>>>>> +
>>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>>> size)) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        ret = cb(&tmp, arg);
>>>>> +        if (ret) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> +                                      last_one_bit + 2);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static int guest_memfd_for_each_discarded_section(const
>>>>> GuestMemfdManager *gmm,
>>>>> +                                                  MemoryRegionSection
>>>>> *section,
>>>>> +                                                  void *arg,
>>>>> +
>>>>> guest_memfd_section_cb cb)
>>>>> +{
>>>>> +    unsigned long first_zero_bit, last_zero_bit;
>>>>> +    uint64_t offset, size;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>>>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> +                                        first_zero_bit);
>>>>> +
>>>>> +    while (first_zero_bit < gmm->bitmap_size) {
>>>>> +        MemoryRegionSection tmp = *section;
>>>>> +
>>>>> +        offset = first_zero_bit * gmm->block_size;
>>>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>>> +                                      first_zero_bit + 1) - 1;
>>>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>>>>> +
>>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>>> size)) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        ret = cb(&tmp, arg);
>>>>> +        if (ret) {
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>>>> bitmap_size,
>>>>> +                                            last_zero_bit + 2);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>>>>> RamDiscardManager *rdm,
>>>>> +                                                    const
>>>>> MemoryRegion *mr)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +
>>>>> +    g_assert(mr == gmm->mr);
>>>>> +    return gmm->block_size;
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>>>>> +                                              RamDiscardListener *rdl,
>>>>> +                                              MemoryRegionSection
>>>>> *section)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    int ret;
>>>>> +
>>>>> +    g_assert(section->mr == gmm->mr);
>>>>> +    rdl->section = memory_region_section_new_copy(section);
>>>>> +
>>>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>>>>> +
>>>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>>>>> +
>>>>> guest_memfd_notify_populate_cb);
>>>>> +    if (ret) {
>>>>> +        error_report("%s: Failed to register RAM discard listener:
>>>>> %s", __func__,
>>>>> +                     strerror(-ret));
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager
>>>>> *rdm,
>>>>> +                                                RamDiscardListener
>>>>> *rdl)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    int ret;
>>>>> +
>>>>> +    g_assert(rdl->section);
>>>>> +    g_assert(rdl->section->mr == gmm->mr);
>>>>> +
>>>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section,
>>>>> rdl,
>>>>> +
>>>>> guest_memfd_notify_discard_cb);
>>>>> +    if (ret) {
>>>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>>>> %s", __func__,
>>>>> +                     strerror(-ret));
>>>>> +    }
>>>>> +
>>>>> +    memory_region_section_free_copy(rdl->section);
>>>>> +    rdl->section = NULL;
>>>>> +    QLIST_REMOVE(rdl, next);
>>>>> +
>>>>> +}
>>>>> +
>>>>> +typedef struct GuestMemfdReplayData {
>>>>> +    void *fn;
>>>>
>>>> s/void */ReplayRamPopulate/
>>>
>>> [...]
>>>
>>>>
>>>>> +    void *opaque;
>>>>> +} GuestMemfdReplayData;
>>>>> +
>>>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>>>>> *section, void *arg)
>>>>> +{
>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>
>>>> Drop "struct" here and below.
>>>
>>> Fixed. Thanks!
>>>
>>>>
>>>>> +    ReplayRamPopulate replay_fn = data->fn;
>>>>> +
>>>>> +    return replay_fn(section, data->opaque);
>>>>> +}
>>>>> +
>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>>> *rdm,
>>>>> +                                            MemoryRegionSection
>>>>> *section,
>>>>> +                                            ReplayRamPopulate
>>>>> replay_fn,
>>>>> +                                            void *opaque)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>> opaque };
>>>>> +
>>>>> +    g_assert(section->mr == gmm->mr);
>>>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>>>>> +
>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>> +}
>>>>> +
>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>> *section, void *arg)
>>>>> +{
>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>> +
>>>>> +    replay_fn(section, data->opaque);
>>>>
>>>>
>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>
>>> It follows current definiton of ReplayRamDiscard() and
>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>> replay_populate() returns errors.
>>
>> A trace would be appropriate imho. Thanks,
> 
> Sorry, can't catch you. What kind of info to be traced? The errors
> returned by replay_populate()?

Yeah. imho these are useful as we expect this part to work in general 
too, right? Thanks,

> 
>>
>>>>
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>>> *rdm,
>>>>> +                                             MemoryRegionSection
>>>>> *section,
>>>>> +                                             ReplayRamDiscard
>>>>> replay_fn,
>>>>> +                                             void *opaque)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>> opaque };
>>>>> +
>>>>> +    g_assert(section->mr == gmm->mr);
>>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>>> +
>>>>> guest_memfd_rdm_replay_discarded_cb);
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_manager_init(Object *obj)
>>>>> +{
>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>> +
>>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>>> +{
>>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>
>>>>
>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>>
>>> Will remove it. Thanks.
>>>
>>>>
>>>>
>>>>> +}
>>>>> +
>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>>> *data)
>>>>> +{
>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>> +
>>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>>> +}
>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>>> --- a/system/meson.build
>>>>> +++ b/system/meson.build
>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>       'dirtylimit.c',
>>>>>       'dma-helpers.c',
>>>>>       'globals.c',
>>>>> +  'guest-memfd-manager.c',
>>>>>       'memory_mapping.c',
>>>>>       'qdev-monitor.c',
>>>>>       'qtest.c',
>>>>
>>>
>>
> 

-- 
Alexey

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 13:11, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>> operation to perform page conversion between private and shared
>>>>>> memory.
>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>> not
>>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>
>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>> adjust
>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>> adding it
>>>>>> back in the other, so the similar work that needs to happen in
>>>>>> response
>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>
>>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>>> the RamDiscardManager interface.
>>>>>>
>>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>>
>>>>> This sounds about right.
>>>>>
>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>> target
>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>> where a
>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>> e.g.
>>>>>> the virtual BIOS MemoryRegion.
>>>>>
>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>
>>>> virtual BIOS shows in a separate region:
>>>>
>>>>    Root memory region: system
>>>>     0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>     ...
>>>>     00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>
>>> Looks like a normal MR which can be backed by guest_memfd.
>>
>> Yes, virtual BIOS memory region is initialized by
>> memory_region_init_ram_guest_memfd() which will be backed by a
>> guest_memfd.
>>
>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
>> BIOS image will be loaded and then copied to private region.
>> After that,
>> the loaded image will be discarded and this region become useless.
> 
> I'd think it is loaded as "struct Rom" and then copied to the MR-
> ram_guest_memfd() which does not leave MR useless - we still see
> "pc.bios" in the list so it is not discarded. What piece of code are you
> referring to exactly?

Sorry for confusion, maybe it is different between TDX and SEV-SNP for
the vBIOS handling.

In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
the vBIOS image to the shared part of the guest_memfd MR. For TDX, it
will copy the image to private region (not the vBIOS guest_memfd MR
private part) and discard the shared part. So, although the memory
region still exists, it seems useless.

It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
vBIOS guest_memfd private memory?

> 
> 
>> So I
>> feel like this virtual BIOS should not be backed by guest_memfd?
> 
> From the above it sounds like the opposite, i.e. it should :)
> 
>>>
>>>>     0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>> @0000000080000000 KVM
>>>
>>> Anyway if there is no guest_memfd backing it and
>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>> just going to be mapped for VFIO as usual which seems... alright, right?
>>
>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>
>> If we go with the HostMemoryBackend instead of guest_memfd_manager, this
>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>> just ignore it since the MR is useless (but looks not so good).
> 
> Sorry I am missing necessary details here, let's figure out the above.
> 
>>
>>>
>>>
>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>> maybe implement with guest_memfd region is more general. We don't know
>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>> present it is.
>>>>
>>>> If it is more appropriate to implement it with HostMemoryBackend, I can
>>>> change to this way.
>>>
>>> Seems cleaner imho.
>>
>> I can go this way.

[...]

>>>>>> +
>>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>>>> *rdm,
>>>>>> +                                            MemoryRegionSection
>>>>>> *section,
>>>>>> +                                            ReplayRamPopulate
>>>>>> replay_fn,
>>>>>> +                                            void *opaque)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>> opaque };
>>>>>> +
>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>> &data,
>>>>>> +
>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>> +}
>>>>>> +
>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>> *section, void *arg)
>>>>>> +{
>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>> +
>>>>>> +    replay_fn(section, data->opaque);
>>>>>
>>>>>
>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>
>>>> It follows current definiton of ReplayRamDiscard() and
>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>> replay_populate() returns errors.
>>>
>>> A trace would be appropriate imho. Thanks,
>>
>> Sorry, can't catch you. What kind of info to be traced? The errors
>> returned by replay_populate()?
> 
> Yeah. imho these are useful as we expect this part to work in general
> too, right? Thanks,

Something like?

diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 6b3e1ee9d6..4440ac9e59 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -185,8 +185,14 @@ static int
guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
 {
     struct GuestMemfdReplayData *data = arg;
     ReplayRamPopulate replay_fn = data->fn;
+    int ret;

-    return replay_fn(section, data->opaque);
+    ret = replay_fn(section, data->opaque);
+    if (ret) {
+        trace_guest_memfd_rdm_replay_populated_cb(ret);
+    }
+
+    return ret;
 }

How about just adding some error output in
guest_memfd_for_each_populated_section()/guest_memfd_for_each_discarded_section()
if the cb() (i.e. replay_populate()) returns error?

> 
>>
>>>
>>>>>
>>>>>> +
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>>>> *rdm,
>>>>>> +                                             MemoryRegionSection
>>>>>> *section,
>>>>>> +                                             ReplayRamDiscard
>>>>>> replay_fn,
>>>>>> +                                             void *opaque)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>> opaque };
>>>>>> +
>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>>>> +
>>>>>> guest_memfd_rdm_replay_discarded_cb);
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_init(Object *obj)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>>> +
>>>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>>>> +{
>>>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>>
>>>>>
>>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>>>
>>>> Will remove it. Thanks.
>>>>
>>>>>
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>>>> *data)
>>>>>> +{
>>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>>> +
>>>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>>>> +}
>>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>>>> --- a/system/meson.build
>>>>>> +++ b/system/meson.build
>>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>>       'dirtylimit.c',
>>>>>>       'dma-helpers.c',
>>>>>>       'globals.c',
>>>>>> +  'guest-memfd-manager.c',
>>>>>>       'memory_mapping.c',
>>>>>>       'qdev-monitor.c',
>>>>>>       'qtest.c',
>>>>>
>>>>
>>>
>>
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

On 1/9/2025 12:29 PM, Chenyi Qiang wrote:
> 
> 
> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>>> operation to perform page conversion between private and shared
>>>>>>> memory.
>>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>> not
>>>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>
>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>> adjust
>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>> adding it
>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>> response
>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>
>>>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>>>> the RamDiscardManager interface.
>>>>>>>
>>>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>>>
>>>>>> This sounds about right.
>>>>>>
>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>> target
>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>> where a
>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>> e.g.
>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>
>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>
>>>>> virtual BIOS shows in a separate region:
>>>>>
>>>>>    Root memory region: system
>>>>>     0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>     ...
>>>>>     00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>
>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>
>>> Yes, virtual BIOS memory region is initialized by
>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>> guest_memfd.
>>>
>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
>>> BIOS image will be loaded and then copied to private region.
>>> After that,
>>> the loaded image will be discarded and this region become useless.
>>
>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>> ram_guest_memfd() which does not leave MR useless - we still see
>> "pc.bios" in the list so it is not discarded. What piece of code are you
>> referring to exactly?
> 
> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
> the vBIOS handling.
> 
> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
> the vBIOS image to the shared part of the guest_memfd MR. For TDX, it
> will copy the image to private region (not the vBIOS guest_memfd MR
> private part) and discard the shared part. So, although the memory
> region still exists, it seems useless.

Correct myself. After some discussion internally, I found I
misunderstood the vBIOS handling in TDX. The memory region is valid. It
copies the vBIOS image to the private region (vBIOS guest_memfd private
part). Sorry for confusion.

> 
> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
> vBIOS guest_memfd private memory?
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Alexey Kardashevskiy 4 months ago

On 9/1/25 15:29, Chenyi Qiang wrote:
> 
> 
> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>>> operation to perform page conversion between private and shared
>>>>>>> memory.
>>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>> not
>>>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>
>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>> adjust
>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>> adding it
>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>> response
>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>
>>>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>>>> the RamDiscardManager interface.
>>>>>>>
>>>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>>>
>>>>>> This sounds about right.

btw I am using this for ages:

https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46

but I am not sure if this ever saw the light of the day, did not it? 
(ironically I am using it as a base for encrypted DMA :) )

>>>>>>
>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>> target
>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>> where a
>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>> e.g.
>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>
>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>
>>>>> virtual BIOS shows in a separate region:
>>>>>
>>>>>     Root memory region: system
>>>>>      0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>      ...
>>>>>      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>
>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>
>>> Yes, virtual BIOS memory region is initialized by
>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>> guest_memfd.
>>>
>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
>>> BIOS image will be loaded and then copied to private region.
>>> After that,
>>> the loaded image will be discarded and this region become useless.
>>
>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>> ram_guest_memfd() which does not leave MR useless - we still see
>> "pc.bios" in the list so it is not discarded. What piece of code are you
>> referring to exactly?
> 
> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
> the vBIOS handling.
> 
> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
> the vBIOS image to the shared part of the guest_memfd MR.
> For TDX, it
> will copy the image to private region (not the vBIOS guest_memfd MR
> private part) and discard the shared part. So, although the memory
> region still exists, it seems useless.
> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
> vBIOS guest_memfd private memory?

This is what it looks like on my SNP VM (which, I suspect, is the same 
as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):

  Root memory region: system 

   0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20 

   00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27 

   00000000000e0000-000000001fffffff (prio 0, ram): ram1 
@00000000000e0000 KVM gmemfd=20
...
   00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26

So the pc.bios MR exists and in use (hence its appearance in "info mtree 
-f").


I added the gmemfd dumping:

--- a/system/memory.c
+++ b/system/memory.c
@@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key, 
gpointer value,
                  }
              }
          }
+        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
+            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
+        }


>>
>>
>>> So I
>>> feel like this virtual BIOS should not be backed by guest_memfd?
>>
>>  From the above it sounds like the opposite, i.e. it should :)
>>
>>>>
>>>>>      0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>>> @0000000080000000 KVM
>>>>
>>>> Anyway if there is no guest_memfd backing it and
>>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>>> just going to be mapped for VFIO as usual which seems... alright, right?
>>>
>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>>
>>> If we go with the HostMemoryBackend instead of guest_memfd_manager, this
>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>>> just ignore it since the MR is useless (but looks not so good).
>>
>> Sorry I am missing necessary details here, let's figure out the above.
>>
>>>
>>>>
>>>>
>>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>>> maybe implement with guest_memfd region is more general. We don't know
>>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>>> present it is.
>>>>>
>>>>> If it is more appropriate to implement it with HostMemoryBackend, I can
>>>>> change to this way.
>>>>
>>>> Seems cleaner imho.
>>>
>>> I can go this way.
> 
> [...]
> 
>>>>>>> +
>>>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>>>>> *rdm,
>>>>>>> +                                            MemoryRegionSection
>>>>>>> *section,
>>>>>>> +                                            ReplayRamPopulate
>>>>>>> replay_fn,
>>>>>>> +                                            void *opaque)
>>>>>>> +{
>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>>> opaque };
>>>>>>> +
>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>>> &data,
>>>>>>> +
>>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>>> *section, void *arg)
>>>>>>> +{
>>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>>> +
>>>>>>> +    replay_fn(section, data->opaque);
>>>>>>
>>>>>>
>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>>
>>>>> It follows current definiton of ReplayRamDiscard() and
>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>>> replay_populate() returns errors.
>>>>
>>>> A trace would be appropriate imho. Thanks,
>>>
>>> Sorry, can't catch you. What kind of info to be traced? The errors
>>> returned by replay_populate()?
>>
>> Yeah. imho these are useful as we expect this part to work in general
>> too, right? Thanks,
> 
> Something like?
> 
> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
> index 6b3e1ee9d6..4440ac9e59 100644
> --- a/system/guest-memfd-manager.c
> +++ b/system/guest-memfd-manager.c
> @@ -185,8 +185,14 @@ static int
> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
>   {
>       struct GuestMemfdReplayData *data = arg;
>       ReplayRamPopulate replay_fn = data->fn;
> +    int ret;
> 
> -    return replay_fn(section, data->opaque);
> +    ret = replay_fn(section, data->opaque);
> +    if (ret) {
> +        trace_guest_memfd_rdm_replay_populated_cb(ret);
> +    }
> +
> +    return ret;
>   }
> 
> How about just adding some error output in
> guest_memfd_for_each_populated_section()/guest_memfd_for_each_discarded_section()
> if the cb() (i.e. replay_populate()) returns error?

this will do too, yes. Thanks,


> 
>>
>>>
>>>>
>>>>>>
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>>>>> *rdm,
>>>>>>> +                                             MemoryRegionSection
>>>>>>> *section,
>>>>>>> +                                             ReplayRamDiscard
>>>>>>> replay_fn,
>>>>>>> +                                             void *opaque)
>>>>>>> +{
>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>>> opaque };
>>>>>>> +
>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>>>>> +
>>>>>>> guest_memfd_rdm_replay_discarded_cb);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_manager_init(Object *obj)
>>>>>>> +{
>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>>>> +
>>>>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>>>>> +{
>>>>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>>>
>>>>>>
>>>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>>>>
>>>>> Will remove it. Thanks.
>>>>>
>>>>>>
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>>>>> *data)
>>>>>>> +{
>>>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>>>> +
>>>>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>>>>> +}
>>>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>>>>> --- a/system/meson.build
>>>>>>> +++ b/system/meson.build
>>>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>>>        'dirtylimit.c',
>>>>>>>        'dma-helpers.c',
>>>>>>>        'globals.c',
>>>>>>> +  'guest-memfd-manager.c',
>>>>>>>        'memory_mapping.c',
>>>>>>>        'qdev-monitor.c',
>>>>>>>        'qtest.c',
>>>>>>
>>>>>
>>>>
>>>
>>
> 

-- 
Alexey

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 15:29, Chenyi Qiang wrote:
>>
>>
>> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO
>>>>>>>> might
>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>> discard
>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>> memory.
>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>> hardware
>>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>>> not
>>>>>>>> acceptable in some cases. An alternative solution is to convey
>>>>>>>> other
>>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>>
>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>> adjust
>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>> adding it
>>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>>> response
>>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>>
>>>>>>>> However, guest_memfd is not an object so it cannot directly
>>>>>>>> implement
>>>>>>>> the RamDiscardManager interface.
>>>>>>>>
>>>>>>>> One solution is to implement the interface in HostMemoryBackend.
>>>>>>>> Any
>>>>>>>
>>>>>>> This sounds about right.
> 
> btw I am using this for ages:
> 
> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> 
> but I am not sure if this ever saw the light of the day, did not it?
> (ironically I am using it as a base for encrypted DMA :) )

Yeah, we are doing the same work. I saw a solution from Michael long
time ago (when there was still
a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
(https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)

For your patch, it only implement the interface for
HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
the parent object HostMemoryBackend, because besides the
MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
MEMORY_BACKEND_FILE can also be guest_memfd-backed.

Think more about where to implement this interface. It is still
uncertain to me. As I mentioned in another mail, maybe ram device memory
region would be backed by guest_memfd if we support TEE IO iommufd MMIO
in future. Then a specific object is more appropriate. What's your opinion?

> 
>>>>>>>
>>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>>> target
>>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>>> where a
>>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>>> e.g.
>>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>>
>>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look
>>>>>>> like
>>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>>
>>>>>> virtual BIOS shows in a separate region:
>>>>>>
>>>>>>     Root memory region: system
>>>>>>      0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>>      ...
>>>>>>      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>>
>>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>>
>>>> Yes, virtual BIOS memory region is initialized by
>>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>>> guest_memfd.
>>>>
>>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the
>>>> virtual
>>>> BIOS image will be loaded and then copied to private region.
>>>> After that,
>>>> the loaded image will be discarded and this region become useless.
>>>
>>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>>> ram_guest_memfd() which does not leave MR useless - we still see
>>> "pc.bios" in the list so it is not discarded. What piece of code are you
>>> referring to exactly?
>>
>> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
>> the vBIOS handling.
>>
>> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
>> the vBIOS image to the shared part of the guest_memfd MR.
>> For TDX, it
>> will copy the image to private region (not the vBIOS guest_memfd MR
>> private part) and discard the shared part. So, although the memory
>> region still exists, it seems useless.
>> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
>> vBIOS guest_memfd private memory?
> 
> This is what it looks like on my SNP VM (which, I suspect, is the same
> as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):

Yes, the memory region object is created on both TDX and SEV-SNP.

> 
>  Root memory region: system
>   0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
>   00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
>   00000000000e0000-000000001fffffff (prio 0, ram): ram1
> @00000000000e0000 KVM gmemfd=20
> ...
>   00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26
> 
> So the pc.bios MR exists and in use (hence its appearance in "info mtree
> -f").
> 
> 
> I added the gmemfd dumping:
> 
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
> gpointer value,
>                  }
>              }
>          }
> +        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
> +            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
> +        }
> 

Then I think the virtual BIOS is another case not belonging to
HostMemoryBackend which convince us to implement the interface in a
specific object, no?

> 
>>>
>>>
>>>> So I
>>>> feel like this virtual BIOS should not be backed by guest_memfd?
>>>
>>>  From the above it sounds like the opposite, i.e. it should :)
>>>
>>>>>
>>>>>>      0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>>>> @0000000080000000 KVM
>>>>>
>>>>> Anyway if there is no guest_memfd backing it and
>>>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>>>> just going to be mapped for VFIO as usual which seems... alright,
>>>>> right?
>>>>
>>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>>>
>>>> If we go with the HostMemoryBackend instead of guest_memfd_manager,
>>>> this
>>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>>>> just ignore it since the MR is useless (but looks not so good).
>>>
>>> Sorry I am missing necessary details here, let's figure out the above.
>>>
>>>>
>>>>>
>>>>>
>>>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>>>> maybe implement with guest_memfd region is more general. We don't
>>>>>> know
>>>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>>>> present it is.
>>>>>>
>>>>>> If it is more appropriate to implement it with HostMemoryBackend,
>>>>>> I can
>>>>>> change to this way.
>>>>>
>>>>> Seems cleaner imho.
>>>>
>>>> I can go this way.
>>
>> [...]
>>
>>>>>>>> +
>>>>>>>> +static int guest_memfd_rdm_replay_populated(const
>>>>>>>> RamDiscardManager
>>>>>>>> *rdm,
>>>>>>>> +                                            MemoryRegionSection
>>>>>>>> *section,
>>>>>>>> +                                            ReplayRamPopulate
>>>>>>>> replay_fn,
>>>>>>>> +                                            void *opaque)
>>>>>>>> +{
>>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>>> +    struct GuestMemfdReplayData data = { .fn =
>>>>>>>> replay_fn, .opaque =
>>>>>>>> opaque };
>>>>>>>> +
>>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>>>> &data,
>>>>>>>> +
>>>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>>>> *section, void *arg)
>>>>>>>> +{
>>>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>>>> +
>>>>>>>> +    replay_fn(section, data->opaque);
>>>>>>>
>>>>>>>
>>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>>>
>>>>>> It follows current definiton of ReplayRamDiscard() and
>>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>>>> replay_populate() returns errors.
>>>>>
>>>>> A trace would be appropriate imho. Thanks,
>>>>
>>>> Sorry, can't catch you. What kind of info to be traced? The errors
>>>> returned by replay_populate()?
>>>
>>> Yeah. imho these are useful as we expect this part to work in general
>>> too, right? Thanks,
>>
>> Something like?
>>
>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
>> index 6b3e1ee9d6..4440ac9e59 100644
>> --- a/system/guest-memfd-manager.c
>> +++ b/system/guest-memfd-manager.c
>> @@ -185,8 +185,14 @@ static int
>> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
>>   {
>>       struct GuestMemfdReplayData *data = arg;
>>       ReplayRamPopulate replay_fn = data->fn;
>> +    int ret;
>>
>> -    return replay_fn(section, data->opaque);
>> +    ret = replay_fn(section, data->opaque);
>> +    if (ret) {
>> +        trace_guest_memfd_rdm_replay_populated_cb(ret);
>> +    }
>> +
>> +    return ret;
>>   }
>>
>> How about just adding some error output in
>> guest_memfd_for_each_populated_section()/
>> guest_memfd_for_each_discarded_section()
>> if the cb() (i.e. replay_populate()) returns error?
> 
> this will do too, yes. Thanks,
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Alexey Kardashevskiy 3 months, 4 weeks ago

On 10/1/25 17:38, Chenyi Qiang wrote:
> 
> 
> On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 15:29, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO
>>>>>>>>> might
>>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>>> discard
>>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>>> memory.
>>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>>> hardware
>>>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>>>> not
>>>>>>>>> acceptable in some cases. An alternative solution is to convey
>>>>>>>>> other
>>>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>>>
>>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>>> adjust
>>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>>> adding it
>>>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>>>> response
>>>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>>>
>>>>>>>>> However, guest_memfd is not an object so it cannot directly
>>>>>>>>> implement
>>>>>>>>> the RamDiscardManager interface.
>>>>>>>>>
>>>>>>>>> One solution is to implement the interface in HostMemoryBackend.
>>>>>>>>> Any
>>>>>>>>
>>>>>>>> This sounds about right.
>>
>> btw I am using this for ages:
>>
>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>
>> but I am not sure if this ever saw the light of the day, did not it?
>> (ironically I am using it as a base for encrypted DMA :) )
> 
> Yeah, we are doing the same work. I saw a solution from Michael long
> time ago (when there was still
> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> 
> For your patch, it only implement the interface for
> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> the parent object HostMemoryBackend, because besides the
> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> 
> Think more about where to implement this interface. It is still
> uncertain to me. As I mentioned in another mail, maybe ram device memory
> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
> in future. Then a specific object is more appropriate. What's your opinion?

I do not know about this. Unlike RAM, MMIO can only do "in-place 
conversion" and the interface to do so is not straight forward and VFIO 
owns MMIO anyway so the uAPI will be in iommufd, here is a gist of it:

https://github.com/aik/linux/commit/89e45c0404fa5006b2a4de33a4d582adf1ba9831

"guest request" is a communication channel from the VM to the secure FW 
(AMD's "PSP") to make MMIO allow encrypted access.


>>
>>>>>>>>
>>>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>>>> target
>>>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>>>> where a
>>>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>>>> e.g.
>>>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>>>
>>>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look
>>>>>>>> like
>>>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>>>
>>>>>>> virtual BIOS shows in a separate region:
>>>>>>>
>>>>>>>      Root memory region: system
>>>>>>>       0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>>>       ...
>>>>>>>       00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>>>
>>>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>>>
>>>>> Yes, virtual BIOS memory region is initialized by
>>>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>>>> guest_memfd.
>>>>>
>>>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the
>>>>> virtual
>>>>> BIOS image will be loaded and then copied to private region.
>>>>> After that,
>>>>> the loaded image will be discarded and this region become useless.
>>>>
>>>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>>>> ram_guest_memfd() which does not leave MR useless - we still see
>>>> "pc.bios" in the list so it is not discarded. What piece of code are you
>>>> referring to exactly?
>>>
>>> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
>>> the vBIOS handling.
>>>
>>> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
>>> the vBIOS image to the shared part of the guest_memfd MR.
>>> For TDX, it
>>> will copy the image to private region (not the vBIOS guest_memfd MR
>>> private part) and discard the shared part. So, although the memory
>>> region still exists, it seems useless.
>>> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
>>> vBIOS guest_memfd private memory?
>>
>> This is what it looks like on my SNP VM (which, I suspect, is the same
>> as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):
> 
> Yes, the memory region object is created on both TDX and SEV-SNP.
> 
>>
>>   Root memory region: system
>>    0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
>>    00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
>>    00000000000e0000-000000001fffffff (prio 0, ram): ram1
>> @00000000000e0000 KVM gmemfd=20
>> ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26
>>
>> So the pc.bios MR exists and in use (hence its appearance in "info mtree
>> -f").
>>
>>
>> I added the gmemfd dumping:
>>
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
>> gpointer value,
>>                   }
>>               }
>>           }
>> +        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
>> +            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
>> +        }
>>
> 
> Then I think the virtual BIOS is another case not belonging to
> HostMemoryBackend which convince us to implement the interface in a
> specific object, no?

TBH I have no idea why pc.rom and pc.bios are separate memory regions 
but in any case why do these 2 areas need to be treated any different 
than the rest of RAM? Thanks,


>>
>>>>
>>>>
>>>>> So I
>>>>> feel like this virtual BIOS should not be backed by guest_memfd?
>>>>
>>>>   From the above it sounds like the opposite, i.e. it should :)
>>>>
>>>>>>
>>>>>>>       0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>>>>> @0000000080000000 KVM
>>>>>>
>>>>>> Anyway if there is no guest_memfd backing it and
>>>>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>>>>> just going to be mapped for VFIO as usual which seems... alright,
>>>>>> right?
>>>>>
>>>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>>>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>>>>
>>>>> If we go with the HostMemoryBackend instead of guest_memfd_manager,
>>>>> this
>>>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>>>>> just ignore it since the MR is useless (but looks not so good).
>>>>
>>>> Sorry I am missing necessary details here, let's figure out the above.
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>>>>> maybe implement with guest_memfd region is more general. We don't
>>>>>>> know
>>>>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>>>>> present it is.
>>>>>>>
>>>>>>> If it is more appropriate to implement it with HostMemoryBackend,
>>>>>>> I can
>>>>>>> change to this way.
>>>>>>
>>>>>> Seems cleaner imho.
>>>>>
>>>>> I can go this way.
>>>
>>> [...]
>>>
>>>>>>>>> +
>>>>>>>>> +static int guest_memfd_rdm_replay_populated(const
>>>>>>>>> RamDiscardManager
>>>>>>>>> *rdm,
>>>>>>>>> +                                            MemoryRegionSection
>>>>>>>>> *section,
>>>>>>>>> +                                            ReplayRamPopulate
>>>>>>>>> replay_fn,
>>>>>>>>> +                                            void *opaque)
>>>>>>>>> +{
>>>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>>>> +    struct GuestMemfdReplayData data = { .fn =
>>>>>>>>> replay_fn, .opaque =
>>>>>>>>> opaque };
>>>>>>>>> +
>>>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>>>>> &data,
>>>>>>>>> +
>>>>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>>>>> *section, void *arg)
>>>>>>>>> +{
>>>>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>>>>> +
>>>>>>>>> +    replay_fn(section, data->opaque);
>>>>>>>>
>>>>>>>>
>>>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>>>>
>>>>>>> It follows current definiton of ReplayRamDiscard() and
>>>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>>>>> replay_populate() returns errors.
>>>>>>
>>>>>> A trace would be appropriate imho. Thanks,
>>>>>
>>>>> Sorry, can't catch you. What kind of info to be traced? The errors
>>>>> returned by replay_populate()?
>>>>
>>>> Yeah. imho these are useful as we expect this part to work in general
>>>> too, right? Thanks,
>>>
>>> Something like?
>>>
>>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
>>> index 6b3e1ee9d6..4440ac9e59 100644
>>> --- a/system/guest-memfd-manager.c
>>> +++ b/system/guest-memfd-manager.c
>>> @@ -185,8 +185,14 @@ static int
>>> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
>>>    {
>>>        struct GuestMemfdReplayData *data = arg;
>>>        ReplayRamPopulate replay_fn = data->fn;
>>> +    int ret;
>>>
>>> -    return replay_fn(section, data->opaque);
>>> +    ret = replay_fn(section, data->opaque);
>>> +    if (ret) {
>>> +        trace_guest_memfd_rdm_replay_populated_cb(ret);
>>> +    }
>>> +
>>> +    return ret;
>>>    }
>>>
>>> How about just adding some error output in
>>> guest_memfd_for_each_populated_section()/
>>> guest_memfd_for_each_discarded_section()
>>> if the cb() (i.e. replay_populate()) returns error?
>>
>> this will do too, yes. Thanks,
>>
> 
> 
> 

-- 
Alexey

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 3 months, 4 weeks ago

On 1/15/2025 12:06 PM, Alexey Kardashevskiy wrote:
> On 10/1/25 17:38, Chenyi Qiang wrote:
>>
>>
>> On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 15:29, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO
>>>>>>>>>> might
>>>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>>>> discard
>>>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>>>> memory.
>>>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>>>> hardware
>>>>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>>>>> pages). Blocking shared page discard can solve this problem,
>>>>>>>>>> but it
>>>>>>>>>> could cause guests to consume twice the memory with VFIO,
>>>>>>>>>> which is
>>>>>>>>>> not
>>>>>>>>>> acceptable in some cases. An alternative solution is to convey
>>>>>>>>>> other
>>>>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>>>>
>>>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>>>> adjust
>>>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>>>> adding it
>>>>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>>>>> response
>>>>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>>>>
>>>>>>>>>> However, guest_memfd is not an object so it cannot directly
>>>>>>>>>> implement
>>>>>>>>>> the RamDiscardManager interface.
>>>>>>>>>>
>>>>>>>>>> One solution is to implement the interface in HostMemoryBackend.
>>>>>>>>>> Any
>>>>>>>>>
>>>>>>>>> This sounds about right.
>>>
>>> btw I am using this for ages:
>>>
>>> https://github.com/aik/qemu/
>>> commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>>
>>> but I am not sure if this ever saw the light of the day, did not it?
>>> (ironically I am using it as a base for encrypted DMA :) )
>>
>> Yeah, we are doing the same work. I saw a solution from Michael long
>> time ago (when there was still
>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
>> (https://github.com/AMDESE/qemu/
>> commit/3bf5255fc48d648724d66410485081ace41d8ee6)
>>
>> For your patch, it only implement the interface for
>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
>> the parent object HostMemoryBackend, because besides the
>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
>>
>> Think more about where to implement this interface. It is still
>> uncertain to me. As I mentioned in another mail, maybe ram device memory
>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
>> in future. Then a specific object is more appropriate. What's your
>> opinion?
> 
> I do not know about this. Unlike RAM, MMIO can only do "in-place
> conversion" and the interface to do so is not straight forward and VFIO
> owns MMIO anyway so the uAPI will be in iommufd, here is a gist of it:
> 
> https://github.com/aik/linux/
> commit/89e45c0404fa5006b2a4de33a4d582adf1ba9831
> 
> "guest request" is a communication channel from the VM to the secure FW
> (AMD's "PSP") to make MMIO allow encrypted access.

It is still uncertain how to implement the private MMIO. Our assumption
is the private MMIO would also create a memory region with
guest_memfd-like backend. Its mr->ram is true and should be managed by
RamdDiscardManager which can skip doing DMA_MAP in VFIO's region_add
listener.

> 
> 
>>>
>>>>>>>>>
>>>>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>>>>> target
>>>>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>>>>> where a
>>>>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>>>>> e.g.
>>>>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>>>>
>>>>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look
>>>>>>>>> like
>>>>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>>>>
>>>>>>>> virtual BIOS shows in a separate region:
>>>>>>>>
>>>>>>>>      Root memory region: system
>>>>>>>>       0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>>>>       ...
>>>>>>>>       00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>>>>
>>>>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>>>>
>>>>>> Yes, virtual BIOS memory region is initialized by
>>>>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>>>>> guest_memfd.
>>>>>>
>>>>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the
>>>>>> virtual
>>>>>> BIOS image will be loaded and then copied to private region.
>>>>>> After that,
>>>>>> the loaded image will be discarded and this region become useless.
>>>>>
>>>>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>>>>> ram_guest_memfd() which does not leave MR useless - we still see
>>>>> "pc.bios" in the list so it is not discarded. What piece of code
>>>>> are you
>>>>> referring to exactly?
>>>>
>>>> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
>>>> the vBIOS handling.
>>>>
>>>> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and
>>>> loads
>>>> the vBIOS image to the shared part of the guest_memfd MR.
>>>> For TDX, it
>>>> will copy the image to private region (not the vBIOS guest_memfd MR
>>>> private part) and discard the shared part. So, although the memory
>>>> region still exists, it seems useless.
>>>> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
>>>> vBIOS guest_memfd private memory?
>>>
>>> This is what it looks like on my SNP VM (which, I suspect, is the same
>>> as yours as hw/i386/pc.c does not distinguish Intel/AMD for this
>>> matter):
>>
>> Yes, the memory region object is created on both TDX and SEV-SNP.
>>
>>>
>>>   Root memory region: system
>>>    0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
>>>    00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
>>>    00000000000e0000-000000001fffffff (prio 0, ram): ram1
>>> @00000000000e0000 KVM gmemfd=20
>>> ...
>>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>> gmemfd=26
>>>
>>> So the pc.bios MR exists and in use (hence its appearance in "info mtree
>>> -f").
>>>
>>>
>>> I added the gmemfd dumping:
>>>
>>> --- a/system/memory.c
>>> +++ b/system/memory.c
>>> @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
>>> gpointer value,
>>>                   }
>>>               }
>>>           }
>>> +        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
>>> +            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
>>> +        }
>>>
>>
>> Then I think the virtual BIOS is another case not belonging to
>> HostMemoryBackend which convince us to implement the interface in a
>> specific object, no?
> 
> TBH I have no idea why pc.rom and pc.bios are separate memory regions
> but in any case why do these 2 areas need to be treated any different
> than the rest of RAM? Thanks,

I think no difference. That's why I suggest implementing the RDM
interface in a specific object to cover both instead of the only
HostMemoryBackend.

> 
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Xu Yilun 4 months ago

> > 
> > https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> > 
> > but I am not sure if this ever saw the light of the day, did not it?
> > (ironically I am using it as a base for encrypted DMA :) )
> 
> Yeah, we are doing the same work. I saw a solution from Michael long
> time ago (when there was still
> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> 
> For your patch, it only implement the interface for
> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> the parent object HostMemoryBackend, because besides the
> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> 
> Think more about where to implement this interface. It is still
> uncertain to me. As I mentioned in another mail, maybe ram device memory
> region would be backed by guest_memfd if we support TEE IO iommufd MMIO

It is unlikely an assigned MMIO region would be backed by guest_memfd or be
implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
owned by VFIO types, and I assume it is still true for private MMIO.

But I think with TIO, MMIO regions also need conversion. So I support an
object, but maybe not guest_memfd_manager.

Thanks,
Yilun

> in future. Then a specific object is more appropriate. What's your opinion?
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Xu Yilun 4 months ago

On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
> > > 
> > > https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> > > 
> > > but I am not sure if this ever saw the light of the day, did not it?
> > > (ironically I am using it as a base for encrypted DMA :) )
> > 
> > Yeah, we are doing the same work. I saw a solution from Michael long
> > time ago (when there was still
> > a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> > (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> > 
> > For your patch, it only implement the interface for
> > HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> > the parent object HostMemoryBackend, because besides the
> > MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> > MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> > 
> > Think more about where to implement this interface. It is still
> > uncertain to me. As I mentioned in another mail, maybe ram device memory
> > region would be backed by guest_memfd if we support TEE IO iommufd MMIO
> 
> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
> owned by VFIO types, and I assume it is still true for private MMIO.
> 
> But I think with TIO, MMIO regions also need conversion. So I support an
> object, but maybe not guest_memfd_manager.

Sorry, I mean the name only covers private memory, but not private MMIO.

> 
> Thanks,
> Yilun
> 
> > in future. Then a specific object is more appropriate. What's your opinion?
> > 
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

On 1/10/2025 5:50 AM, Xu Yilun wrote:
> On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
>>>>
>>>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>>>
>>>> but I am not sure if this ever saw the light of the day, did not it?
>>>> (ironically I am using it as a base for encrypted DMA :) )
>>>
>>> Yeah, we are doing the same work. I saw a solution from Michael long
>>> time ago (when there was still
>>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
>>> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
>>>
>>> For your patch, it only implement the interface for
>>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
>>> the parent object HostMemoryBackend, because besides the
>>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
>>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
>>>
>>> Think more about where to implement this interface. It is still
>>> uncertain to me. As I mentioned in another mail, maybe ram device memory
>>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
>>
>> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
>> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
>> owned by VFIO types, and I assume it is still true for private MMIO.
>>
>> But I think with TIO, MMIO regions also need conversion. So I support an
>> object, but maybe not guest_memfd_manager.
> 
> Sorry, I mean the name only covers private memory, but not private MMIO.

So you suggest renaming the object to cover the private MMIO. Then how
about page_conversion_manager, or page_attribute_manager?

> 
>>
>> Thanks,
>> Yilun
>>
>>> in future. Then a specific object is more appropriate. What's your opinion?
>>>
>>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Xu Yilun 4 months ago

On Mon, Jan 13, 2025 at 11:34:44AM +0800, Chenyi Qiang wrote:
> 
> 
> On 1/10/2025 5:50 AM, Xu Yilun wrote:
> > On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
> >>>>
> >>>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> >>>>
> >>>> but I am not sure if this ever saw the light of the day, did not it?
> >>>> (ironically I am using it as a base for encrypted DMA :) )
> >>>
> >>> Yeah, we are doing the same work. I saw a solution from Michael long
> >>> time ago (when there was still
> >>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
> >>> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
> >>>
> >>> For your patch, it only implement the interface for
> >>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
> >>> the parent object HostMemoryBackend, because besides the
> >>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
> >>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
> >>>
> >>> Think more about where to implement this interface. It is still
> >>> uncertain to me. As I mentioned in another mail, maybe ram device memory
> >>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
> >>
> >> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
> >> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
> >> owned by VFIO types, and I assume it is still true for private MMIO.
> >>
> >> But I think with TIO, MMIO regions also need conversion. So I support an
> >> object, but maybe not guest_memfd_manager.
> > 
> > Sorry, I mean the name only covers private memory, but not private MMIO.
> 
> So you suggest renaming the object to cover the private MMIO. Then how

Yes.

> about page_conversion_manager, or page_attribute_manager?

Maybe memory_attribute_manager? Strictly speaking MMIO resource is not
backed by pages.

Thanks,
Yilun

> 
> > 
> >>
> >> Thanks,
> >> Yilun
> >>
> >>> in future. Then a specific object is more appropriate. What's your opinion?
> >>>
> >>
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months ago

On 1/13/2025 6:23 AM, Xu Yilun wrote:
> On Mon, Jan 13, 2025 at 11:34:44AM +0800, Chenyi Qiang wrote:
>>
>>
>> On 1/10/2025 5:50 AM, Xu Yilun wrote:
>>> On Fri, Jan 10, 2025 at 05:00:22AM +0800, Xu Yilun wrote:
>>>>>>
>>>>>> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
>>>>>>
>>>>>> but I am not sure if this ever saw the light of the day, did not it?
>>>>>> (ironically I am using it as a base for encrypted DMA :) )
>>>>>
>>>>> Yeah, we are doing the same work. I saw a solution from Michael long
>>>>> time ago (when there was still
>>>>> a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
>>>>> (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)
>>>>>
>>>>> For your patch, it only implement the interface for
>>>>> HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
>>>>> the parent object HostMemoryBackend, because besides the
>>>>> MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
>>>>> MEMORY_BACKEND_FILE can also be guest_memfd-backed.
>>>>>
>>>>> Think more about where to implement this interface. It is still
>>>>> uncertain to me. As I mentioned in another mail, maybe ram device memory
>>>>> region would be backed by guest_memfd if we support TEE IO iommufd MMIO
>>>>
>>>> It is unlikely an assigned MMIO region would be backed by guest_memfd or be
>>>> implemented as part of HostMemoryBackend. Nowadays assigned MMIO resource is
>>>> owned by VFIO types, and I assume it is still true for private MMIO.
>>>>
>>>> But I think with TIO, MMIO regions also need conversion. So I support an
>>>> object, but maybe not guest_memfd_manager.
>>>
>>> Sorry, I mean the name only covers private memory, but not private MMIO.
>>
>> So you suggest renaming the object to cover the private MMIO. Then how
> 
> Yes.
> 
>> about page_conversion_manager, or page_attribute_manager?
> 
> Maybe memory_attribute_manager? Strictly speaking MMIO resource is not
> backed by pages.

Looks good to me. Thanks!

> 
> Thanks,
> Yilun
> 
>>
>>>
>>>>
>>>> Thanks,
>>>> Yilun
>>>>
>>>>> in future. Then a specific object is more appropriate. What's your opinion?
>>>>>
>>>>
>>
>

Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Posted by Chenyi Qiang 4 months, 3 weeks ago

On 12/13/2024 3:08 PM, Chenyi Qiang wrote:
> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") highlighted, some subsystems like VFIO might
> disable ram block discard. However, guest_memfd relies on the discard
> operation to perform page conversion between private and shared memory.
> This can lead to stale IOMMU mapping issue when assigning a hardware
> device to a confidential VM via shared memory (unprotected memory
> pages). Blocking shared page discard can solve this problem, but it
> could cause guests to consume twice the memory with VFIO, which is not
> acceptable in some cases. An alternative solution is to convey other
> systems like VFIO to refresh its outdated IOMMU mappings.
> 
> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
> VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other, so the similar work that needs to happen in response
> to virtio-mem changes needs to happen for page conversion events.
> Introduce the RamDiscardManager to guest_memfd to achieve it.
> 
> However, guest_memfd is not an object so it cannot directly implement
> the RamDiscardManager interface.
> 
> One solution is to implement the interface in HostMemoryBackend. Any
> guest_memfd-backed host memory backend can register itself in the target
> MemoryRegion. However, this solution doesn't cover the scenario where a
> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
> the virtual BIOS MemoryRegion.
> 
> Thus, choose the second option, i.e. define an object type named
> guest_memfd_manager with RamDiscardManager interface. Upon creation of
> guest_memfd, a new guest_memfd_manager object can be instantiated and
> registered to the managed guest_memfd MemoryRegion to handle the page
> conversion events.
> 
> In the context of guest_memfd, the discarded state signifies that the
> page is private, while the populated state indicated that the page is
> shared. The state of the memory is tracked at the granularity of the
> host page size (i.e. block_size), as the minimum conversion size can be
> one page per request.
> 
> In addition, VFIO expects the DMA mapping for a specific iova to be
> mapped and unmapped with the same granularity. However, the confidential
> VMs may do partial conversion, e.g. conversion happens on a small region
> within a large region. To prevent such invalid cases and before any
> potential optimization comes out, all operations are performed with 4K
> granularity.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>  include/sysemu/guest-memfd-manager.h |  46 +++++
>  system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
>  system/meson.build                   |   1 +
>  3 files changed, 297 insertions(+)
>  create mode 100644 include/sysemu/guest-memfd-manager.h
>  create mode 100644 system/guest-memfd-manager.c
> 
> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
> new file mode 100644
> index 0000000000..ba4a99b614
> --- /dev/null
> +++ b/include/sysemu/guest-memfd-manager.h
> @@ -0,0 +1,46 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
> +
> +#include "sysemu/hostmem.h"
> +
> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
> +
> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
> +
> +struct GuestMemfdManager {
> +    Object parent;
> +
> +    /* Managed memory region. */
> +    MemoryRegion *mr;
> +
> +    /*
> +     * 1-setting of the bit represents the memory is populated (shared).
> +     */
> +    int32_t bitmap_size;
> +    unsigned long *bitmap;
> +
> +    /* block size and alignment */
> +    uint64_t block_size;
> +
> +    /* listeners to notify on populate/discard activity. */
> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
> +};
> +
> +struct GuestMemfdManagerClass {
> +    ObjectClass parent_class;
> +};
> +
> +#endif
> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
> new file mode 100644
> index 0000000000..d7e105fead
> --- /dev/null
> +++ b/system/guest-memfd-manager.c
> @@ -0,0 +1,250 @@
> +/*
> + * QEMU guest memfd manager
> + *
> + * Copyright Intel
> + *
> + * Author:
> + *      Chenyi Qiang <chenyi.qiang@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/guest-memfd-manager.h"
> +
> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
> +                                          guest_memfd_manager,
> +                                          GUEST_MEMFD_MANAGER,
> +                                          OBJECT,
> +                                          { TYPE_RAM_DISCARD_MANAGER },
> +                                          { })
> +

Fixup: Use OBJECT_DEFINE_TYPE_WITH_INTERFACES() instead of
OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES() as we define a class struct.

diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 50802b34d7..f7dc93071a 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -15,12 +15,12 @@
 #include "qemu/error-report.h"
 #include "sysemu/guest-memfd-manager.h"

-OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
-                                          guest_memfd_manager,
-                                          GUEST_MEMFD_MANAGER,
-                                          OBJECT,
-                                          { TYPE_RAM_DISCARD_MANAGER },
-                                          { })
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(GuestMemfdManager,
+                                   guest_memfd_manager,
+                                   GUEST_MEMFD_MANAGER,
+                                   OBJECT,
+                                   { TYPE_RAM_DISCARD_MANAGER },
+                                   { })

 static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,

[PATCH 1/7] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
[PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager
[PATCH 3/7] guest_memfd: Introduce a callback to notify the shared/private state change
[PATCH 4/7] KVM: Notify the state change event during shared/private conversion
[PATCH 5/7] memory: Register the RamDiscardManager instance upon guest_memfd creation
[PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard
[RFC PATCH 7/7] memory: Add a new argument to indicate the request attribute in RamDismcardManager helpers