MAINTAINERS | 1 + accel/kvm/kvm-all.c | 9 + hw/virtio/virtio-mem.c | 83 +++--- include/system/memory.h | 100 +++++-- include/system/ramblock.h | 22 ++ migration/ram.c | 5 +- system/memory.c | 22 +- system/meson.build | 1 + system/physmem.c | 18 +- system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++ system/trace-events | 3 + 11 files changed, 660 insertions(+), 84 deletions(-) create mode 100644 system/ram-block-attributes.c
This is the v6 series of the shared device assignment support.
Compared with the last version [1], this series retains the basic support
and removes the additional complex error handling, which can be added
back when necessary. Meanwhile, the patchset has been re-organized to
be clearer.
Overview of this series:
- Patch 1-3: Preparation patches. These include function exposure and
some function prototype changes.
- Patch 4: Introduce a new object to implement RamDiscardManager
interface and a helper to notify the shared/private state change.
- Patch 5: Enable coordinated discarding of RAM with guest_memfd through
the RamDiscardManager interface.
More small changes or details can be found in the individual patches.
---
Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared and private at runtime.
"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [2] when we try to enable shared
device assignment in confidential VMs.
Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversions is similar to
hot-removing a page in one mode and adding it back in the other.
This series implements a RamDiscardmanager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.
Limitation and future extension
===============================
This series only supports the basic shared device assignment functionality.
There are still some limitations and areas that can be extended and
optimized in the future.
Relationship with in-place conversion
-------------------------------------
In-place page conversion is the ongoing work to allow mmap() of
guest_memfd to userspace so that both private and shared memory can use
the same physical memory as the backend. This new design eliminates the
need to discard pages during shared/private conversions. When it is
ready, shared device assignment needs be adjusted to achieve an
unmap-before-conversion-to-private and map-after-conversion-to-shared
sequence to be compatible with the change.
Partial unmap limitation
------------------------
VFIO expects the DMA mapping for a specific IOVA to be mapped and
unmapped with the same granularity. The guest may perform partial
conversion, such as converting a small region within a larger one. To
prevent such invalid cases, current operations are performed with 4K
granularity. This could be optimized after DMA mapping cut operation
[3] is introduced in the future. We can always perform a
split-before-unmap if partial conversions happens. If the split
succeeds, the unmap will succeed and be atomic. If the split fails, the
unmap process fails.
More attributes management
--------------------------
Current RamDiscardManager can only manage a pair of opposite states like
populated/discared or shared/private. If more states need to be
considered, for example, support virtio-mem in confidential VMs, three
states would be possible (shared populated/private populated/discard).
Current framework cannot handle such scenario and we need to think of
some new framework at that time [4].
Memory overhead optimization
----------------------------
A comment from Baolu [5] suggests considering using Maple Tree or a generic
interval tree to manage private/shared state instead of a bitmap, which
can reduce memory consumption. This optmization can also be considered
in other bitmap use cases like dirty bitmaps for guest RAM.
Testing
=======
This patch series is tested based on mainline kernel since TDX base
support has been merged. The QEMU repo is available at
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2
To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:
qemu-system-x86_64 [...]
-device vfio-pci,host=XX:XX.X
The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.
If use the iommufd-backed VFIO with the qemu command:
qemu-system-x86_64 [...]
-object iommufd,id=iommufd0 \
-device vfio-pci,host=XX:XX.X,iommufd=iommufd0
Because the new features like cut_mapping operation will only be support in iommufd.
It is more recommended to use the iommufd-backed VFIO.
Related link
============
[1] https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qiang@intel.com/
[2] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
[3] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/
[4] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/
[5] https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23decf@linux.intel.com/
Chenyi Qiang (5):
memory: Export a helper to get intersection of a MemoryRegionSection
with a given range
memory: Change memory_region_set_ram_discard_manager() to return the
result
memory: Unify the definiton of ReplayRamPopulate() and
ReplayRamDiscard()
ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock
with guest_memfd
physmem: Support coordinated discarding of RAM with guest_memfd
MAINTAINERS | 1 +
accel/kvm/kvm-all.c | 9 +
hw/virtio/virtio-mem.c | 83 +++---
include/system/memory.h | 100 +++++--
include/system/ramblock.h | 22 ++
migration/ram.c | 5 +-
system/memory.c | 22 +-
system/meson.build | 1 +
system/physmem.c | 18 +-
system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++
system/trace-events | 3 +
11 files changed, 660 insertions(+), 84 deletions(-)
create mode 100644 system/ram-block-attributes.c
--
2.43.5
Hi Paolo, Since this series has received Reviewed-by/Acked-by on all patches, besides some coding style comments from Alexey and the suggestion to document the bitmap consistency from David in patch #4, Any other comments? Or I would send the next version to resolve them. Thanks Chenyi On 5/30/2025 4:32 PM, Chenyi Qiang wrote: > This is the v6 series of the shared device assignment support. > > Compared with the last version [1], this series retains the basic support > and removes the additional complex error handling, which can be added > back when necessary. Meanwhile, the patchset has been re-organized to > be clearer. > > Overview of this series: > > - Patch 1-3: Preparation patches. These include function exposure and > some function prototype changes. > - Patch 4: Introduce a new object to implement RamDiscardManager > interface and a helper to notify the shared/private state change. > - Patch 5: Enable coordinated discarding of RAM with guest_memfd through > the RamDiscardManager interface. > > More small changes or details can be found in the individual patches. > > --- > > Background > ========== > Confidential VMs have two classes of memory: shared and private memory. > Shared memory is accessible from the host/VMM while private memory is > not. Confidential VMs can decide which memory is shared/private and > convert memory between shared and private at runtime. > > "guest_memfd" is a new kind of fd whose primary goal is to serve guest > private memory. In current implementation, shared memory is allocated > with normal methods (e.g. mmap or fallocate) while private memory is > allocated from guest_memfd. When a VM performs memory conversions, QEMU > frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from > one side, and allocates new pages from the other side. This will cause a > stale IOMMU mapping issue mentioned in [2] when we try to enable shared > device assignment in confidential VMs. > > Solution > ======== > The key to enable shared device assignment is to update the IOMMU mappings > on page conversion. RamDiscardManager, an existing interface currently > utilized by virtio-mem, offers a means to modify IOMMU mappings in > accordance with VM page assignment. Page conversions is similar to > hot-removing a page in one mode and adding it back in the other. > > This series implements a RamDiscardmanager for confidential VMs and > utilizes its infrastructure to notify VFIO of page conversions. > > Limitation and future extension > =============================== > This series only supports the basic shared device assignment functionality. > There are still some limitations and areas that can be extended and > optimized in the future. > > Relationship with in-place conversion > ------------------------------------- > In-place page conversion is the ongoing work to allow mmap() of > guest_memfd to userspace so that both private and shared memory can use > the same physical memory as the backend. This new design eliminates the > need to discard pages during shared/private conversions. When it is > ready, shared device assignment needs be adjusted to achieve an > unmap-before-conversion-to-private and map-after-conversion-to-shared > sequence to be compatible with the change. > > Partial unmap limitation > ------------------------ > VFIO expects the DMA mapping for a specific IOVA to be mapped and > unmapped with the same granularity. The guest may perform partial > conversion, such as converting a small region within a larger one. To > prevent such invalid cases, current operations are performed with 4K > granularity. This could be optimized after DMA mapping cut operation > [3] is introduced in the future. We can always perform a > split-before-unmap if partial conversions happens. If the split > succeeds, the unmap will succeed and be atomic. If the split fails, the > unmap process fails. > > More attributes management > -------------------------- > Current RamDiscardManager can only manage a pair of opposite states like > populated/discared or shared/private. If more states need to be > considered, for example, support virtio-mem in confidential VMs, three > states would be possible (shared populated/private populated/discard). > Current framework cannot handle such scenario and we need to think of > some new framework at that time [4]. > > Memory overhead optimization > ---------------------------- > A comment from Baolu [5] suggests considering using Maple Tree or a generic > interval tree to manage private/shared state instead of a bitmap, which > can reduce memory consumption. This optmization can also be considered > in other bitmap use cases like dirty bitmaps for guest RAM. > > Testing > ======= > This patch series is tested based on mainline kernel since TDX base > support has been merged. The QEMU repo is available at > QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-30-v2 > > To facilitate shared device assignment with the NIC, employ the legacy > type1 VFIO with the QEMU command: > > qemu-system-x86_64 [...] > -device vfio-pci,host=XX:XX.X > > The parameter of dma_entry_limit needs to be adjusted. For example, a > 16GB guest needs to adjust the parameter like > vfio_iommu_type1.dma_entry_limit=4194304. > > If use the iommufd-backed VFIO with the qemu command: > > qemu-system-x86_64 [...] > -object iommufd,id=iommufd0 \ > -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 > > > Because the new features like cut_mapping operation will only be support in iommufd. > It is more recommended to use the iommufd-backed VFIO. > > Related link > ============ > [1] https://lore.kernel.org/qemu-devel/20250520102856.132417-1-chenyi.qiang@intel.com/ > [2] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/ > [3] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/ > [4] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/ > [5] https://lore.kernel.org/qemu-devel/013b36a9-9310-4073-b54c-9c511f23decf@linux.intel.com/ > > Chenyi Qiang (5): > memory: Export a helper to get intersection of a MemoryRegionSection > with a given range > memory: Change memory_region_set_ram_discard_manager() to return the > result > memory: Unify the definiton of ReplayRamPopulate() and > ReplayRamDiscard() > ram-block-attributes: Introduce RamBlockAttributes to manage RAMBlock > with guest_memfd > physmem: Support coordinated discarding of RAM with guest_memfd > > MAINTAINERS | 1 + > accel/kvm/kvm-all.c | 9 + > hw/virtio/virtio-mem.c | 83 +++--- > include/system/memory.h | 100 +++++-- > include/system/ramblock.h | 22 ++ > migration/ram.c | 5 +- > system/memory.c | 22 +- > system/meson.build | 1 + > system/physmem.c | 18 +- > system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++ > system/trace-events | 3 + > 11 files changed, 660 insertions(+), 84 deletions(-) > create mode 100644 system/ram-block-attributes.c >
© 2016 - 2026 Red Hat, Inc.