accel/kvm/kvm-all.c | 4 + hw/vfio/common.c | 22 +- hw/virtio/virtio-mem.c | 55 ++-- include/exec/memory.h | 36 ++- include/sysemu/guest-memfd-manager.h | 91 ++++++ migration/ram.c | 14 +- system/guest-memfd-manager.c | 456 +++++++++++++++++++++++++++ system/memory.c | 30 +- system/memory_mapping.c | 4 +- system/meson.build | 1 + system/physmem.c | 9 +- 11 files changed, 659 insertions(+), 63 deletions(-) create mode 100644 include/sysemu/guest-memfd-manager.h create mode 100644 system/guest-memfd-manager.c
Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated discard") effectively disables device assignment when using guest_memfd. This poses a significant challenge as guest_memfd is essential for confidential guests, thereby blocking device assignment to these VMs. The initial rationale for disabling device assignment was due to stale IOMMU mappings (see Problem section) and the assumption that TEE I/O (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment problem for confidential guests [1]. However, this assumption has proven to be incorrect. TEE I/O relies on the ability to operate devices against "shared" or untrusted memory, which is crucial for device initialization and error recovery scenarios. As a result, the current implementation does not adequately support device assignment for confidential guests, necessitating a reevaluation of the approach to ensure compatibility and functionality. This series enables shared device assignment by notifying VFIO of page conversions using an existing framework named RamDiscardListener. Additionally, there is an ongoing patch set [2] that aims to add 1G page support for guest_memfd. This patch set introduces in-place page conversion, where private and shared memory share the same physical pages as the backend. This development may impact our solution. We presented our solution in the guest_memfd meeting to discuss its compatibility with the new changes and potential future directions (see [3] for more details). The conclusion was that, although our solution may not be the most elegant (see the Limitation section), it is sufficient for now and can be easily adapted to future changes. We are re-posting the patch series with some cleanup and have removed the RFC label for the main enabling patches (1-6). The newly-added patch 7 is still marked as RFC as it tries to resolve some extension concerns related to RamDiscardManager for future usage. The overview of the patches: - Patch 1: Export a helper to get intersection of a MemoryRegionSection with a given range. - Patch 2-6: Introduce a new object to manage the guest-memfd with RamDiscardManager, and notify the shared/private state change during conversion. - Patch 7: Try to resolve a semantics concern related to RamDiscardManager i.e. RamDiscardManager is used to manage memory plug/unplug state instead of shared/private state. It would affect future users of RamDiscardManger in confidential VMs. Attach it behind as a RFC patch[4]. Changes since last version: - Add a patch to export some generic helper functions from virtio-mem code. - Change the bitmap in guest_memfd_manager from default shared to default private. This keeps alignment with virtio-mem that 1-setting in bitmap represents the populated state and may help to export more generic code if necessary. - Add the helpers to initialize/uninitialize the guest_memfd_manager instance to make it more clear. - Add a patch to distinguish between the shared/private state change and the memory plug/unplug state change in RamDiscardManager. - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-chenyi.qiang@intel.com/ --- Background ========== Confidential VMs have two classes of memory: shared and private memory. Shared memory is accessible from the host/VMM while private memory is not. Confidential VMs can decide which memory is shared/private and convert memory between shared/private at runtime. "guest_memfd" is a new kind of fd whose primary goal is to serve guest private memory. The key differences between guest_memfd and normal memfd are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and cannot be mapped, read or written by userspace. In QEMU's implementation, shared memory is allocated with normal methods (e.g. mmap or fallocate) while private memory is allocated from guest_memfd. When a VM performs memory conversions, QEMU frees pages via madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and allocates new pages from the other side. Problem ======= Device assignment in QEMU is implemented via VFIO system. In the normal VM, VM memory is pinned at the beginning of time by VFIO. In the confidential VM, the VM can convert memory and when that happens nothing currently tells VFIO that its mappings are stale. This means that page conversion leaks memory and leaves stale IOMMU mappings. For example, sequence like the following can result in stale IOMMU mappings: 1. allocate shared page 2. convert page shared->private 3. discard shared page 4. convert page private->shared 5. allocate shared page 6. issue DMA operations against that shared page After step 3, VFIO is still pinning the page. However, DMA operations in step 6 will hit the old mapping that was allocated in step 1, which causes the device to access the invalid data. Solution ======== The key to enable shared device assignment is to update the IOMMU mappings on page conversion. Given the constraints and assumptions here is a solution that satisfied the use cases. RamDiscardManager, an existing interface currently utilized by virtio-mem, offers a means to modify IOMMU mappings in accordance with VM page assignment. Page conversion is similar to hot-removing a page in one mode and adding it back in the other. This series implements a RamDiscardManager for confidential VMs and utilizes its infrastructure to notify VFIO of page conversions. Another possible attempt [5] was to not discard shared pages in step 3 above. This was an incomplete band-aid because guests would consume twice the memory since shared pages wouldn't be freed even after they were converted to private. w/ in-place page conversion =========================== To support 1G page support for guest_memfd, the current direction is to allow mmap() of guest_memfd to userspace so that both private and shared memory can use the same physical pages as the backend. This in-place page conversion design eliminates the need to discard pages during shared/private conversions. However, device assignment will still be blocked because the in-place page conversion will reject the conversion when the page is pinned by VFIO. To address this, the key difference lies in the sequence of VFIO map/unmap operations and the page conversion. This series can be adjusted to achieve unmap-before-conversion-to-private and map-after-conversion-to-shared, ensuring compatibility with guest_memfd. Additionally, with in-place page conversion, the previously mentioned solution to disable the discard of shared pages is not feasible because shared and private memory share the same backend, and no discard operation is performed. Retaining the old mappings in the IOMMU would result in unsafe DMA access to protected memory. Limitation ========== One limitation (also discussed in the guest_memfd meeting) is that VFIO expects the DMA mapping for a specific IOVA to be mapped and unmapped with the same granularity. The guest may perform partial conversions, such as converting a small region within a larger region. To prevent such invalid cases, all operations are performed with 4K granularity. The possible solutions we can think of are either to enable VFIO to support partial unmap or to implement an enlightened guest to avoid partial conversion. The former requires complex changes in VFIO, while the latter requires the page conversion to be a guest-enlightened behavior. It is still uncertain which option is a preferred one. Testing ======= This patch series is tested with the KVM/QEMU branch: KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20 QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2024-12-13 To facilitate shared device assignment with the NIC, employ the legacy type1 VFIO with the QEMU command: qemu-system-x86_64 [...] -device vfio-pci,host=XX:XX.X The parameter of dma_entry_limit needs to be adjusted. For example, a 16GB guest needs to adjust the parameter like vfio_iommu_type1.dma_entry_limit=4194304. If use the iommufd-backed VFIO with the qemu command: qemu-system-x86_64 [...] -object iommufd,id=iommufd0 \ -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 No additional adjustment required. Following the bootup of the TD guest, the guest's IP address becomes visible, and iperf is able to successfully send and receive data. Related link ============ [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/ [2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/ [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.jr4csfgw1uql [4] https://lore.kernel.org/qemu-devel/d299bbad-81bc-462e-91b5-a6d9c27ffe3a@redhat.com/ [5] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/ Chenyi Qiang (7): memory: Export a helper to get intersection of a MemoryRegionSection with a given range guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager guest_memfd: Introduce a callback to notify the shared/private state change KVM: Notify the state change event during shared/private conversion memory: Register the RamDiscardManager instance upon guest_memfd creation RAMBlock: make guest_memfd require coordinate discard memory: Add a new argument to indicate the request attribute in RamDismcardManager helpers accel/kvm/kvm-all.c | 4 + hw/vfio/common.c | 22 +- hw/virtio/virtio-mem.c | 55 ++-- include/exec/memory.h | 36 ++- include/sysemu/guest-memfd-manager.h | 91 ++++++ migration/ram.c | 14 +- system/guest-memfd-manager.c | 456 +++++++++++++++++++++++++++ system/memory.c | 30 +- system/memory_mapping.c | 4 +- system/meson.build | 1 + system/physmem.c | 9 +- 11 files changed, 659 insertions(+), 63 deletions(-) create mode 100644 include/sysemu/guest-memfd-manager.h create mode 100644 system/guest-memfd-manager.c -- 2.43.5
On 13/12/24 18:08, Chenyi Qiang wrote: > Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated > discard") effectively disables device assignment when using guest_memfd. > This poses a significant challenge as guest_memfd is essential for > confidential guests, thereby blocking device assignment to these VMs. > The initial rationale for disabling device assignment was due to stale > IOMMU mappings (see Problem section) and the assumption that TEE I/O > (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment > problem for confidential guests [1]. However, this assumption has proven > to be incorrect. TEE I/O relies on the ability to operate devices against > "shared" or untrusted memory, which is crucial for device initialization > and error recovery scenarios. As a result, the current implementation does > not adequately support device assignment for confidential guests, necessitating > a reevaluation of the approach to ensure compatibility and functionality. > > This series enables shared device assignment by notifying VFIO of page > conversions using an existing framework named RamDiscardListener. > Additionally, there is an ongoing patch set [2] that aims to add 1G page > support for guest_memfd. This patch set introduces in-place page conversion, > where private and shared memory share the same physical pages as the backend. > This development may impact our solution. > > We presented our solution in the guest_memfd meeting to discuss its > compatibility with the new changes and potential future directions (see [3] > for more details). The conclusion was that, although our solution may not be > the most elegant (see the Limitation section), it is sufficient for now and > can be easily adapted to future changes. > > We are re-posting the patch series with some cleanup and have removed the RFC > label for the main enabling patches (1-6). The newly-added patch 7 is still > marked as RFC as it tries to resolve some extension concerns related to > RamDiscardManager for future usage. > > The overview of the patches: > - Patch 1: Export a helper to get intersection of a MemoryRegionSection > with a given range. > - Patch 2-6: Introduce a new object to manage the guest-memfd with > RamDiscardManager, and notify the shared/private state change during > conversion. > - Patch 7: Try to resolve a semantics concern related to RamDiscardManager > i.e. RamDiscardManager is used to manage memory plug/unplug state > instead of shared/private state. It would affect future users of > RamDiscardManger in confidential VMs. Attach it behind as a RFC patch[4]. > > Changes since last version: > - Add a patch to export some generic helper functions from virtio-mem code. > - Change the bitmap in guest_memfd_manager from default shared to default > private. This keeps alignment with virtio-mem that 1-setting in bitmap > represents the populated state and may help to export more generic code > if necessary. > - Add the helpers to initialize/uninitialize the guest_memfd_manager instance > to make it more clear. > - Add a patch to distinguish between the shared/private state change and > the memory plug/unplug state change in RamDiscardManager. > - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-chenyi.qiang@intel.com/ > > --- > > Background > ========== > Confidential VMs have two classes of memory: shared and private memory. > Shared memory is accessible from the host/VMM while private memory is > not. Confidential VMs can decide which memory is shared/private and > convert memory between shared/private at runtime. > > "guest_memfd" is a new kind of fd whose primary goal is to serve guest > private memory. The key differences between guest_memfd and normal memfd > are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and > cannot be mapped, read or written by userspace. The "cannot be mapped" seems to be not true soon anymore (if not already). https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/ > > In QEMU's implementation, shared memory is allocated with normal methods > (e.g. mmap or fallocate) while private memory is allocated from > guest_memfd. When a VM performs memory conversions, QEMU frees pages via > madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and > allocates new pages from the other side. > > Problem > ======= > Device assignment in QEMU is implemented via VFIO system. In the normal > VM, VM memory is pinned at the beginning of time by VFIO. In the > confidential VM, the VM can convert memory and when that happens > nothing currently tells VFIO that its mappings are stale. This means > that page conversion leaks memory and leaves stale IOMMU mappings. For > example, sequence like the following can result in stale IOMMU mappings: > > 1. allocate shared page > 2. convert page shared->private > 3. discard shared page > 4. convert page private->shared > 5. allocate shared page > 6. issue DMA operations against that shared page > > After step 3, VFIO is still pinning the page. However, DMA operations in > step 6 will hit the old mapping that was allocated in step 1, which > causes the device to access the invalid data. > > Solution > ======== > The key to enable shared device assignment is to update the IOMMU mappings > on page conversion. > > Given the constraints and assumptions here is a solution that satisfied > the use cases. RamDiscardManager, an existing interface currently > utilized by virtio-mem, offers a means to modify IOMMU mappings in > accordance with VM page assignment. Page conversion is similar to > hot-removing a page in one mode and adding it back in the other. > > This series implements a RamDiscardManager for confidential VMs and > utilizes its infrastructure to notify VFIO of page conversions. > > Another possible attempt [5] was to not discard shared pages in step 3 > above. This was an incomplete band-aid because guests would consume > twice the memory since shared pages wouldn't be freed even after they > were converted to private. > > w/ in-place page conversion > =========================== > To support 1G page support for guest_memfd, the current direction is to > allow mmap() of guest_memfd to userspace so that both private and shared > memory can use the same physical pages as the backend. This in-place page > conversion design eliminates the need to discard pages during shared/private > conversions. However, device assignment will still be blocked because the > in-place page conversion will reject the conversion when the page is pinned > by VFIO. > > To address this, the key difference lies in the sequence of VFIO map/unmap > operations and the page conversion. This series can be adjusted to achieve > unmap-before-conversion-to-private and map-after-conversion-to-shared, > ensuring compatibility with guest_memfd. > > Additionally, with in-place page conversion, the previously mentioned > solution to disable the discard of shared pages is not feasible because > shared and private memory share the same backend, and no discard operation > is performed. Retaining the old mappings in the IOMMU would result in > unsafe DMA access to protected memory. > > Limitation > ========== > > One limitation (also discussed in the guest_memfd meeting) is that VFIO > expects the DMA mapping for a specific IOVA to be mapped and unmapped with > the same granularity. The guest may perform partial conversions, such as > converting a small region within a larger region. To prevent such invalid > cases, all operations are performed with 4K granularity. The possible > solutions we can think of are either to enable VFIO to support partial unmap > or to implement an enlightened guest to avoid partial conversion. The former > requires complex changes in VFIO, while the latter requires the page > conversion to be a guest-enlightened behavior. It is still uncertain which > option is a preferred one. in-place memory conversion is :) > > Testing > ======= > This patch series is tested with the KVM/QEMU branch: > KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20 > QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2024-12-13 The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have these though. Thanks, > > To facilitate shared device assignment with the NIC, employ the legacy > type1 VFIO with the QEMU command: > > qemu-system-x86_64 [...] > -device vfio-pci,host=XX:XX.X > > The parameter of dma_entry_limit needs to be adjusted. For example, a > 16GB guest needs to adjust the parameter like > vfio_iommu_type1.dma_entry_limit=4194304. > > If use the iommufd-backed VFIO with the qemu command: > > qemu-system-x86_64 [...] > -object iommufd,id=iommufd0 \ > -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 > > No additional adjustment required. > > Following the bootup of the TD guest, the guest's IP address becomes > visible, and iperf is able to successfully send and receive data. > > Related link > ============ > [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/ > [2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/ > [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.jr4csfgw1uql > [4] https://lore.kernel.org/qemu-devel/d299bbad-81bc-462e-91b5-a6d9c27ffe3a@redhat.com/ > [5] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/ > > Chenyi Qiang (7): > memory: Export a helper to get intersection of a MemoryRegionSection > with a given range > guest_memfd: Introduce an object to manage the guest-memfd with > RamDiscardManager > guest_memfd: Introduce a callback to notify the shared/private state > change > KVM: Notify the state change event during shared/private conversion > memory: Register the RamDiscardManager instance upon guest_memfd > creation > RAMBlock: make guest_memfd require coordinate discard > memory: Add a new argument to indicate the request attribute in > RamDismcardManager helpers > > accel/kvm/kvm-all.c | 4 + > hw/vfio/common.c | 22 +- > hw/virtio/virtio-mem.c | 55 ++-- > include/exec/memory.h | 36 ++- > include/sysemu/guest-memfd-manager.h | 91 ++++++ > migration/ram.c | 14 +- > system/guest-memfd-manager.c | 456 +++++++++++++++++++++++++++ > system/memory.c | 30 +- > system/memory_mapping.c | 4 +- > system/meson.build | 1 + > system/physmem.c | 9 +- > 11 files changed, 659 insertions(+), 63 deletions(-) > create mode 100644 include/sysemu/guest-memfd-manager.h > create mode 100644 system/guest-memfd-manager.c > -- Alexey
Thanks Alexey for your review! On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: > On 13/12/24 18:08, Chenyi Qiang wrote: >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >> discard") effectively disables device assignment when using guest_memfd. >> This poses a significant challenge as guest_memfd is essential for >> confidential guests, thereby blocking device assignment to these VMs. >> The initial rationale for disabling device assignment was due to stale >> IOMMU mappings (see Problem section) and the assumption that TEE I/O >> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment >> problem for confidential guests [1]. However, this assumption has proven >> to be incorrect. TEE I/O relies on the ability to operate devices against >> "shared" or untrusted memory, which is crucial for device initialization >> and error recovery scenarios. As a result, the current implementation >> does >> not adequately support device assignment for confidential guests, >> necessitating >> a reevaluation of the approach to ensure compatibility and functionality. >> >> This series enables shared device assignment by notifying VFIO of page >> conversions using an existing framework named RamDiscardListener. >> Additionally, there is an ongoing patch set [2] that aims to add 1G page >> support for guest_memfd. This patch set introduces in-place page >> conversion, >> where private and shared memory share the same physical pages as the >> backend. >> This development may impact our solution. >> >> We presented our solution in the guest_memfd meeting to discuss its >> compatibility with the new changes and potential future directions >> (see [3] >> for more details). The conclusion was that, although our solution may >> not be >> the most elegant (see the Limitation section), it is sufficient for >> now and >> can be easily adapted to future changes. >> >> We are re-posting the patch series with some cleanup and have removed >> the RFC >> label for the main enabling patches (1-6). The newly-added patch 7 is >> still >> marked as RFC as it tries to resolve some extension concerns related to >> RamDiscardManager for future usage. >> >> The overview of the patches: >> - Patch 1: Export a helper to get intersection of a MemoryRegionSection >> with a given range. >> - Patch 2-6: Introduce a new object to manage the guest-memfd with >> RamDiscardManager, and notify the shared/private state change during >> conversion. >> - Patch 7: Try to resolve a semantics concern related to >> RamDiscardManager >> i.e. RamDiscardManager is used to manage memory plug/unplug state >> instead of shared/private state. It would affect future users of >> RamDiscardManger in confidential VMs. Attach it behind as a RFC >> patch[4]. >> >> Changes since last version: >> - Add a patch to export some generic helper functions from virtio-mem >> code. >> - Change the bitmap in guest_memfd_manager from default shared to default >> private. This keeps alignment with virtio-mem that 1-setting in bitmap >> represents the populated state and may help to export more generic >> code >> if necessary. >> - Add the helpers to initialize/uninitialize the guest_memfd_manager >> instance >> to make it more clear. >> - Add a patch to distinguish between the shared/private state change and >> the memory plug/unplug state change in RamDiscardManager. >> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >> chenyi.qiang@intel.com/ >> >> --- >> >> Background >> ========== >> Confidential VMs have two classes of memory: shared and private memory. >> Shared memory is accessible from the host/VMM while private memory is >> not. Confidential VMs can decide which memory is shared/private and >> convert memory between shared/private at runtime. >> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >> private memory. The key differences between guest_memfd and normal memfd >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and >> cannot be mapped, read or written by userspace. > > The "cannot be mapped" seems to be not true soon anymore (if not already). > > https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/ Exactly, allowing guest_memfd to do mmap is the direction. I mentioned it below with in-place page conversion. Maybe I would move it here to make it more clear. > > >> >> In QEMU's implementation, shared memory is allocated with normal methods >> (e.g. mmap or fallocate) while private memory is allocated from >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >> allocates new pages from the other side. >> [...] >> >> One limitation (also discussed in the guest_memfd meeting) is that VFIO >> expects the DMA mapping for a specific IOVA to be mapped and unmapped >> with >> the same granularity. The guest may perform partial conversions, such as >> converting a small region within a larger region. To prevent such invalid >> cases, all operations are performed with 4K granularity. The possible >> solutions we can think of are either to enable VFIO to support partial >> unmap >> or to implement an enlightened guest to avoid partial conversion. The >> former >> requires complex changes in VFIO, while the latter requires the page >> conversion to be a guest-enlightened behavior. It is still uncertain >> which >> option is a preferred one. > > in-place memory conversion is :) > >> >> Testing >> ======= >> This patch series is tested with the KVM/QEMU branch: >> KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20 >> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream- >> snapshot-2024-12-13 > > > The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have > these though. Thanks, Thanks for pointing it out. You're right, tdx-upstream-snapshot-2024-12-18 is the latest branch. I added the fixup for patch 1 and forgot to update the change here. > >> >> To facilitate shared device assignment with the NIC, employ the legacy >> type1 VFIO with the QEMU command: >> >> qemu-system-x86_64 [...] >> -device vfio-pci,host=XX:XX.X >> >> The parameter of dma_entry_limit needs to be adjusted. For example, a >> 16GB guest needs to adjust the parameter like >> vfio_iommu_type1.dma_entry_limit=4194304. >> >> If use the iommufd-backed VFIO with the qemu command: >> >> qemu-system-x86_64 [...] >> -object iommufd,id=iommufd0 \ >> -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 >> >> No additional adjustment required. >> >> Following the bootup of the TD guest, the guest's IP address becomes >> visible, and iperf is able to successfully send and receive data. >
On 8/1/25 17:28, Chenyi Qiang wrote: > Thanks Alexey for your review! > > On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >> On 13/12/24 18:08, Chenyi Qiang wrote: >>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >>> discard") effectively disables device assignment when using guest_memfd. >>> This poses a significant challenge as guest_memfd is essential for >>> confidential guests, thereby blocking device assignment to these VMs. >>> The initial rationale for disabling device assignment was due to stale >>> IOMMU mappings (see Problem section) and the assumption that TEE I/O >>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment >>> problem for confidential guests [1]. However, this assumption has proven >>> to be incorrect. TEE I/O relies on the ability to operate devices against >>> "shared" or untrusted memory, which is crucial for device initialization >>> and error recovery scenarios. As a result, the current implementation >>> does >>> not adequately support device assignment for confidential guests, >>> necessitating >>> a reevaluation of the approach to ensure compatibility and functionality. >>> >>> This series enables shared device assignment by notifying VFIO of page >>> conversions using an existing framework named RamDiscardListener. >>> Additionally, there is an ongoing patch set [2] that aims to add 1G page >>> support for guest_memfd. This patch set introduces in-place page >>> conversion, >>> where private and shared memory share the same physical pages as the >>> backend. >>> This development may impact our solution. >>> >>> We presented our solution in the guest_memfd meeting to discuss its >>> compatibility with the new changes and potential future directions >>> (see [3] >>> for more details). The conclusion was that, although our solution may >>> not be >>> the most elegant (see the Limitation section), it is sufficient for >>> now and >>> can be easily adapted to future changes. >>> >>> We are re-posting the patch series with some cleanup and have removed >>> the RFC >>> label for the main enabling patches (1-6). The newly-added patch 7 is >>> still >>> marked as RFC as it tries to resolve some extension concerns related to >>> RamDiscardManager for future usage. >>> >>> The overview of the patches: >>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection >>> with a given range. >>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>> RamDiscardManager, and notify the shared/private state change during >>> conversion. >>> - Patch 7: Try to resolve a semantics concern related to >>> RamDiscardManager >>> i.e. RamDiscardManager is used to manage memory plug/unplug state >>> instead of shared/private state. It would affect future users of >>> RamDiscardManger in confidential VMs. Attach it behind as a RFC >>> patch[4]. >>> >>> Changes since last version: >>> - Add a patch to export some generic helper functions from virtio-mem >>> code. >>> - Change the bitmap in guest_memfd_manager from default shared to default >>> private. This keeps alignment with virtio-mem that 1-setting in bitmap >>> represents the populated state and may help to export more generic >>> code >>> if necessary. >>> - Add the helpers to initialize/uninitialize the guest_memfd_manager >>> instance >>> to make it more clear. >>> - Add a patch to distinguish between the shared/private state change and >>> the memory plug/unplug state change in RamDiscardManager. >>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>> chenyi.qiang@intel.com/ >>> >>> --- >>> >>> Background >>> ========== >>> Confidential VMs have two classes of memory: shared and private memory. >>> Shared memory is accessible from the host/VMM while private memory is >>> not. Confidential VMs can decide which memory is shared/private and >>> convert memory between shared/private at runtime. >>> >>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >>> private memory. The key differences between guest_memfd and normal memfd >>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and >>> cannot be mapped, read or written by userspace. >> >> The "cannot be mapped" seems to be not true soon anymore (if not already). >> >> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/ > > Exactly, allowing guest_memfd to do mmap is the direction. I mentioned > it below with in-place page conversion. Maybe I would move it here to > make it more clear. > >> >> >>> >>> In QEMU's implementation, shared memory is allocated with normal methods >>> (e.g. mmap or fallocate) while private memory is allocated from >>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via >>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >>> allocates new pages from the other side. >>> > > [...] > >>> >>> One limitation (also discussed in the guest_memfd meeting) is that VFIO >>> expects the DMA mapping for a specific IOVA to be mapped and unmapped >>> with >>> the same granularity. The guest may perform partial conversions, such as >>> converting a small region within a larger region. To prevent such invalid >>> cases, all operations are performed with 4K granularity. The possible >>> solutions we can think of are either to enable VFIO to support partial >>> unmap btw the old VFIO does not split mappings but iommufd seems to be capable of it - there is iopt_area_split(). What happens if you try unmapping a smaller chunk that does not exactly match any mapped chunk? thanks, -- Alexey
On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: > > > On 8/1/25 17:28, Chenyi Qiang wrote: >> Thanks Alexey for your review! >> >> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >>>> discard") effectively disables device assignment when using >>>> guest_memfd. >>>> This poses a significant challenge as guest_memfd is essential for >>>> confidential guests, thereby blocking device assignment to these VMs. >>>> The initial rationale for disabling device assignment was due to stale >>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O >>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment >>>> problem for confidential guests [1]. However, this assumption has >>>> proven >>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>> against >>>> "shared" or untrusted memory, which is crucial for device >>>> initialization >>>> and error recovery scenarios. As a result, the current implementation >>>> does >>>> not adequately support device assignment for confidential guests, >>>> necessitating >>>> a reevaluation of the approach to ensure compatibility and >>>> functionality. >>>> >>>> This series enables shared device assignment by notifying VFIO of page >>>> conversions using an existing framework named RamDiscardListener. >>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>> page >>>> support for guest_memfd. This patch set introduces in-place page >>>> conversion, >>>> where private and shared memory share the same physical pages as the >>>> backend. >>>> This development may impact our solution. >>>> >>>> We presented our solution in the guest_memfd meeting to discuss its >>>> compatibility with the new changes and potential future directions >>>> (see [3] >>>> for more details). The conclusion was that, although our solution may >>>> not be >>>> the most elegant (see the Limitation section), it is sufficient for >>>> now and >>>> can be easily adapted to future changes. >>>> >>>> We are re-posting the patch series with some cleanup and have removed >>>> the RFC >>>> label for the main enabling patches (1-6). The newly-added patch 7 is >>>> still >>>> marked as RFC as it tries to resolve some extension concerns related to >>>> RamDiscardManager for future usage. >>>> >>>> The overview of the patches: >>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection >>>> with a given range. >>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>> RamDiscardManager, and notify the shared/private state change >>>> during >>>> conversion. >>>> - Patch 7: Try to resolve a semantics concern related to >>>> RamDiscardManager >>>> i.e. RamDiscardManager is used to manage memory plug/unplug state >>>> instead of shared/private state. It would affect future users of >>>> RamDiscardManger in confidential VMs. Attach it behind as a RFC >>>> patch[4]. >>>> >>>> Changes since last version: >>>> - Add a patch to export some generic helper functions from virtio-mem >>>> code. >>>> - Change the bitmap in guest_memfd_manager from default shared to >>>> default >>>> private. This keeps alignment with virtio-mem that 1-setting in >>>> bitmap >>>> represents the populated state and may help to export more generic >>>> code >>>> if necessary. >>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager >>>> instance >>>> to make it more clear. >>>> - Add a patch to distinguish between the shared/private state change >>>> and >>>> the memory plug/unplug state change in RamDiscardManager. >>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>> chenyi.qiang@intel.com/ >>>> >>>> --- >>>> >>>> Background >>>> ========== >>>> Confidential VMs have two classes of memory: shared and private memory. >>>> Shared memory is accessible from the host/VMM while private memory is >>>> not. Confidential VMs can decide which memory is shared/private and >>>> convert memory between shared/private at runtime. >>>> >>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >>>> private memory. The key differences between guest_memfd and normal >>>> memfd >>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>> VM and >>>> cannot be mapped, read or written by userspace. >>> >>> The "cannot be mapped" seems to be not true soon anymore (if not >>> already). >>> >>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/ >> >> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned >> it below with in-place page conversion. Maybe I would move it here to >> make it more clear. >> >>> >>> >>>> >>>> In QEMU's implementation, shared memory is allocated with normal >>>> methods >>>> (e.g. mmap or fallocate) while private memory is allocated from >>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages >>>> via >>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >>>> allocates new pages from the other side. >>>> >> >> [...] >> >>>> >>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO >>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped >>>> with >>>> the same granularity. The guest may perform partial conversions, >>>> such as >>>> converting a small region within a larger region. To prevent such >>>> invalid >>>> cases, all operations are performed with 4K granularity. The possible >>>> solutions we can think of are either to enable VFIO to support partial >>>> unmap > > btw the old VFIO does not split mappings but iommufd seems to be capable > of it - there is iopt_area_split(). What happens if you try unmapping a > smaller chunk that does not exactly match any mapped chunk? thanks, iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with disable_large_page=true. That means the large IOPTE is also disabled in IOMMU. So it can do the split easily. See the comment in iommufd_vfio_set_iommu(). iommufd VFIO compatible mode is a transition from legacy VFIO to iommufd. For the normal iommufd, it requires the iova/length must be a superset of a previously mapped range. If not match, will return error. > >
On 9/1/25 18:52, Chenyi Qiang wrote: > > > On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: >> >> >> On 8/1/25 17:28, Chenyi Qiang wrote: >>> Thanks Alexey for your review! >>> >>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >>>>> discard") effectively disables device assignment when using >>>>> guest_memfd. >>>>> This poses a significant challenge as guest_memfd is essential for >>>>> confidential guests, thereby blocking device assignment to these VMs. >>>>> The initial rationale for disabling device assignment was due to stale >>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O >>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment >>>>> problem for confidential guests [1]. However, this assumption has >>>>> proven >>>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>>> against >>>>> "shared" or untrusted memory, which is crucial for device >>>>> initialization >>>>> and error recovery scenarios. As a result, the current implementation >>>>> does >>>>> not adequately support device assignment for confidential guests, >>>>> necessitating >>>>> a reevaluation of the approach to ensure compatibility and >>>>> functionality. >>>>> >>>>> This series enables shared device assignment by notifying VFIO of page >>>>> conversions using an existing framework named RamDiscardListener. >>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>>> page >>>>> support for guest_memfd. This patch set introduces in-place page >>>>> conversion, >>>>> where private and shared memory share the same physical pages as the >>>>> backend. >>>>> This development may impact our solution. >>>>> >>>>> We presented our solution in the guest_memfd meeting to discuss its >>>>> compatibility with the new changes and potential future directions >>>>> (see [3] >>>>> for more details). The conclusion was that, although our solution may >>>>> not be >>>>> the most elegant (see the Limitation section), it is sufficient for >>>>> now and >>>>> can be easily adapted to future changes. >>>>> >>>>> We are re-posting the patch series with some cleanup and have removed >>>>> the RFC >>>>> label for the main enabling patches (1-6). The newly-added patch 7 is >>>>> still >>>>> marked as RFC as it tries to resolve some extension concerns related to >>>>> RamDiscardManager for future usage. >>>>> >>>>> The overview of the patches: >>>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection >>>>> with a given range. >>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>>> RamDiscardManager, and notify the shared/private state change >>>>> during >>>>> conversion. >>>>> - Patch 7: Try to resolve a semantics concern related to >>>>> RamDiscardManager >>>>> i.e. RamDiscardManager is used to manage memory plug/unplug state >>>>> instead of shared/private state. It would affect future users of >>>>> RamDiscardManger in confidential VMs. Attach it behind as a RFC >>>>> patch[4]. >>>>> >>>>> Changes since last version: >>>>> - Add a patch to export some generic helper functions from virtio-mem >>>>> code. >>>>> - Change the bitmap in guest_memfd_manager from default shared to >>>>> default >>>>> private. This keeps alignment with virtio-mem that 1-setting in >>>>> bitmap >>>>> represents the populated state and may help to export more generic >>>>> code >>>>> if necessary. >>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager >>>>> instance >>>>> to make it more clear. >>>>> - Add a patch to distinguish between the shared/private state change >>>>> and >>>>> the memory plug/unplug state change in RamDiscardManager. >>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>>> chenyi.qiang@intel.com/ >>>>> >>>>> --- >>>>> >>>>> Background >>>>> ========== >>>>> Confidential VMs have two classes of memory: shared and private memory. >>>>> Shared memory is accessible from the host/VMM while private memory is >>>>> not. Confidential VMs can decide which memory is shared/private and >>>>> convert memory between shared/private at runtime. >>>>> >>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >>>>> private memory. The key differences between guest_memfd and normal >>>>> memfd >>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>>> VM and >>>>> cannot be mapped, read or written by userspace. >>>> >>>> The "cannot be mapped" seems to be not true soon anymore (if not >>>> already). >>>> >>>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/ >>> >>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned >>> it below with in-place page conversion. Maybe I would move it here to >>> make it more clear. >>> >>>> >>>> >>>>> >>>>> In QEMU's implementation, shared memory is allocated with normal >>>>> methods >>>>> (e.g. mmap or fallocate) while private memory is allocated from >>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages >>>>> via >>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >>>>> allocates new pages from the other side. >>>>> >>> >>> [...] >>> >>>>> >>>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO >>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped >>>>> with >>>>> the same granularity. The guest may perform partial conversions, >>>>> such as >>>>> converting a small region within a larger region. To prevent such >>>>> invalid >>>>> cases, all operations are performed with 4K granularity. The possible >>>>> solutions we can think of are either to enable VFIO to support partial >>>>> unmap >> >> btw the old VFIO does not split mappings but iommufd seems to be capable >> of it - there is iopt_area_split(). What happens if you try unmapping a >> smaller chunk that does not exactly match any mapped chunk? thanks, > > iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make > iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with > disable_large_page=true. That means the large IOPTE is also disabled in > IOMMU. So it can do the split easily. See the comment in > iommufd_vfio_set_iommu(). > > iommufd VFIO compatible mode is a transition from legacy VFIO to > iommufd. For the normal iommufd, it requires the iova/length must be a > superset of a previously mapped range. If not match, will return error. This is all true but this also means that "The former requires complex changes in VFIO" is not entirely true - some code is already there. Thanks, -- Alexey
On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote: > > > On 9/1/25 18:52, Chenyi Qiang wrote: >> >> >> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: >>> >>> >>> On 8/1/25 17:28, Chenyi Qiang wrote: >>>> Thanks Alexey for your review! >>>> >>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >>>>>> discard") effectively disables device assignment when using >>>>>> guest_memfd. >>>>>> This poses a significant challenge as guest_memfd is essential for >>>>>> confidential guests, thereby blocking device assignment to these VMs. >>>>>> The initial rationale for disabling device assignment was due to >>>>>> stale >>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O >>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device- >>>>>> assignment >>>>>> problem for confidential guests [1]. However, this assumption has >>>>>> proven >>>>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>>>> against >>>>>> "shared" or untrusted memory, which is crucial for device >>>>>> initialization >>>>>> and error recovery scenarios. As a result, the current implementation >>>>>> does >>>>>> not adequately support device assignment for confidential guests, >>>>>> necessitating >>>>>> a reevaluation of the approach to ensure compatibility and >>>>>> functionality. >>>>>> >>>>>> This series enables shared device assignment by notifying VFIO of >>>>>> page >>>>>> conversions using an existing framework named RamDiscardListener. >>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>>>> page >>>>>> support for guest_memfd. This patch set introduces in-place page >>>>>> conversion, >>>>>> where private and shared memory share the same physical pages as the >>>>>> backend. >>>>>> This development may impact our solution. >>>>>> >>>>>> We presented our solution in the guest_memfd meeting to discuss its >>>>>> compatibility with the new changes and potential future directions >>>>>> (see [3] >>>>>> for more details). The conclusion was that, although our solution may >>>>>> not be >>>>>> the most elegant (see the Limitation section), it is sufficient for >>>>>> now and >>>>>> can be easily adapted to future changes. >>>>>> >>>>>> We are re-posting the patch series with some cleanup and have removed >>>>>> the RFC >>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is >>>>>> still >>>>>> marked as RFC as it tries to resolve some extension concerns >>>>>> related to >>>>>> RamDiscardManager for future usage. >>>>>> >>>>>> The overview of the patches: >>>>>> - Patch 1: Export a helper to get intersection of a >>>>>> MemoryRegionSection >>>>>> with a given range. >>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>>>> RamDiscardManager, and notify the shared/private state change >>>>>> during >>>>>> conversion. >>>>>> - Patch 7: Try to resolve a semantics concern related to >>>>>> RamDiscardManager >>>>>> i.e. RamDiscardManager is used to manage memory plug/unplug >>>>>> state >>>>>> instead of shared/private state. It would affect future users of >>>>>> RamDiscardManger in confidential VMs. Attach it behind as a RFC >>>>>> patch[4]. >>>>>> >>>>>> Changes since last version: >>>>>> - Add a patch to export some generic helper functions from virtio-mem >>>>>> code. >>>>>> - Change the bitmap in guest_memfd_manager from default shared to >>>>>> default >>>>>> private. This keeps alignment with virtio-mem that 1-setting in >>>>>> bitmap >>>>>> represents the populated state and may help to export more >>>>>> generic >>>>>> code >>>>>> if necessary. >>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager >>>>>> instance >>>>>> to make it more clear. >>>>>> - Add a patch to distinguish between the shared/private state change >>>>>> and >>>>>> the memory plug/unplug state change in RamDiscardManager. >>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>>>> chenyi.qiang@intel.com/ >>>>>> >>>>>> --- >>>>>> >>>>>> Background >>>>>> ========== >>>>>> Confidential VMs have two classes of memory: shared and private >>>>>> memory. >>>>>> Shared memory is accessible from the host/VMM while private memory is >>>>>> not. Confidential VMs can decide which memory is shared/private and >>>>>> convert memory between shared/private at runtime. >>>>>> >>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve >>>>>> guest >>>>>> private memory. The key differences between guest_memfd and normal >>>>>> memfd >>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>>>> VM and >>>>>> cannot be mapped, read or written by userspace. >>>>> >>>>> The "cannot be mapped" seems to be not true soon anymore (if not >>>>> already). >>>>> >>>>> https://lore.kernel.org/all/20240801090117.3841080-1- >>>>> tabba@google.com/T/ >>>> >>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned >>>> it below with in-place page conversion. Maybe I would move it here to >>>> make it more clear. >>>> >>>>> >>>>> >>>>>> >>>>>> In QEMU's implementation, shared memory is allocated with normal >>>>>> methods >>>>>> (e.g. mmap or fallocate) while private memory is allocated from >>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages >>>>>> via >>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >>>>>> allocates new pages from the other side. >>>>>> >>>> >>>> [...] >>>> >>>>>> >>>>>> One limitation (also discussed in the guest_memfd meeting) is that >>>>>> VFIO >>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped >>>>>> with >>>>>> the same granularity. The guest may perform partial conversions, >>>>>> such as >>>>>> converting a small region within a larger region. To prevent such >>>>>> invalid >>>>>> cases, all operations are performed with 4K granularity. The possible >>>>>> solutions we can think of are either to enable VFIO to support >>>>>> partial >>>>>> unmap >>> >>> btw the old VFIO does not split mappings but iommufd seems to be capable >>> of it - there is iopt_area_split(). What happens if you try unmapping a >>> smaller chunk that does not exactly match any mapped chunk? thanks, >> >> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make >> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with >> disable_large_page=true. That means the large IOPTE is also disabled in >> IOMMU. So it can do the split easily. See the comment in >> iommufd_vfio_set_iommu(). >> >> iommufd VFIO compatible mode is a transition from legacy VFIO to >> iommufd. For the normal iommufd, it requires the iova/length must be a >> superset of a previously mapped range. If not match, will return error. > > > This is all true but this also means that "The former requires complex > changes in VFIO" is not entirely true - some code is already there. Thanks, Hmm, my statement is a little confusing. The bottleneck is that the IOMMU driver doesn't support the large page split. So if we want to enable large page and want to do partial unmap, it requires complex change. > > >
On 9/1/25 19:49, Chenyi Qiang wrote: > > > On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote: >> >> >> On 9/1/25 18:52, Chenyi Qiang wrote: >>> >>> >>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: >>>> >>>> >>>> On 8/1/25 17:28, Chenyi Qiang wrote: >>>>> Thanks Alexey for your review! >>>>> >>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>>>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >>>>>>> discard") effectively disables device assignment when using >>>>>>> guest_memfd. >>>>>>> This poses a significant challenge as guest_memfd is essential for >>>>>>> confidential guests, thereby blocking device assignment to these VMs. >>>>>>> The initial rationale for disabling device assignment was due to >>>>>>> stale >>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O >>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device- >>>>>>> assignment >>>>>>> problem for confidential guests [1]. However, this assumption has >>>>>>> proven >>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>>>>> against >>>>>>> "shared" or untrusted memory, which is crucial for device >>>>>>> initialization >>>>>>> and error recovery scenarios. As a result, the current implementation >>>>>>> does >>>>>>> not adequately support device assignment for confidential guests, >>>>>>> necessitating >>>>>>> a reevaluation of the approach to ensure compatibility and >>>>>>> functionality. >>>>>>> >>>>>>> This series enables shared device assignment by notifying VFIO of >>>>>>> page >>>>>>> conversions using an existing framework named RamDiscardListener. >>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>>>>> page >>>>>>> support for guest_memfd. This patch set introduces in-place page >>>>>>> conversion, >>>>>>> where private and shared memory share the same physical pages as the >>>>>>> backend. >>>>>>> This development may impact our solution. >>>>>>> >>>>>>> We presented our solution in the guest_memfd meeting to discuss its >>>>>>> compatibility with the new changes and potential future directions >>>>>>> (see [3] >>>>>>> for more details). The conclusion was that, although our solution may >>>>>>> not be >>>>>>> the most elegant (see the Limitation section), it is sufficient for >>>>>>> now and >>>>>>> can be easily adapted to future changes. >>>>>>> >>>>>>> We are re-posting the patch series with some cleanup and have removed >>>>>>> the RFC >>>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is >>>>>>> still >>>>>>> marked as RFC as it tries to resolve some extension concerns >>>>>>> related to >>>>>>> RamDiscardManager for future usage. >>>>>>> >>>>>>> The overview of the patches: >>>>>>> - Patch 1: Export a helper to get intersection of a >>>>>>> MemoryRegionSection >>>>>>> with a given range. >>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>>>>> RamDiscardManager, and notify the shared/private state change >>>>>>> during >>>>>>> conversion. >>>>>>> - Patch 7: Try to resolve a semantics concern related to >>>>>>> RamDiscardManager >>>>>>> i.e. RamDiscardManager is used to manage memory plug/unplug >>>>>>> state >>>>>>> instead of shared/private state. It would affect future users of >>>>>>> RamDiscardManger in confidential VMs. Attach it behind as a RFC >>>>>>> patch[4]. >>>>>>> >>>>>>> Changes since last version: >>>>>>> - Add a patch to export some generic helper functions from virtio-mem >>>>>>> code. >>>>>>> - Change the bitmap in guest_memfd_manager from default shared to >>>>>>> default >>>>>>> private. This keeps alignment with virtio-mem that 1-setting in >>>>>>> bitmap >>>>>>> represents the populated state and may help to export more >>>>>>> generic >>>>>>> code >>>>>>> if necessary. >>>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager >>>>>>> instance >>>>>>> to make it more clear. >>>>>>> - Add a patch to distinguish between the shared/private state change >>>>>>> and >>>>>>> the memory plug/unplug state change in RamDiscardManager. >>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>>>>> chenyi.qiang@intel.com/ >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> Background >>>>>>> ========== >>>>>>> Confidential VMs have two classes of memory: shared and private >>>>>>> memory. >>>>>>> Shared memory is accessible from the host/VMM while private memory is >>>>>>> not. Confidential VMs can decide which memory is shared/private and >>>>>>> convert memory between shared/private at runtime. >>>>>>> >>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve >>>>>>> guest >>>>>>> private memory. The key differences between guest_memfd and normal >>>>>>> memfd >>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>>>>> VM and >>>>>>> cannot be mapped, read or written by userspace. >>>>>> >>>>>> The "cannot be mapped" seems to be not true soon anymore (if not >>>>>> already). >>>>>> >>>>>> https://lore.kernel.org/all/20240801090117.3841080-1- >>>>>> tabba@google.com/T/ >>>>> >>>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned >>>>> it below with in-place page conversion. Maybe I would move it here to >>>>> make it more clear. >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> In QEMU's implementation, shared memory is allocated with normal >>>>>>> methods >>>>>>> (e.g. mmap or fallocate) while private memory is allocated from >>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages >>>>>>> via >>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >>>>>>> allocates new pages from the other side. >>>>>>> >>>>> >>>>> [...] >>>>> >>>>>>> >>>>>>> One limitation (also discussed in the guest_memfd meeting) is that >>>>>>> VFIO >>>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped >>>>>>> with >>>>>>> the same granularity. The guest may perform partial conversions, >>>>>>> such as >>>>>>> converting a small region within a larger region. To prevent such >>>>>>> invalid >>>>>>> cases, all operations are performed with 4K granularity. The possible >>>>>>> solutions we can think of are either to enable VFIO to support >>>>>>> partial >>>>>>> unmap >>>> >>>> btw the old VFIO does not split mappings but iommufd seems to be capable >>>> of it - there is iopt_area_split(). What happens if you try unmapping a >>>> smaller chunk that does not exactly match any mapped chunk? thanks, >>> >>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make >>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with >>> disable_large_page=true. That means the large IOPTE is also disabled in >>> IOMMU. So it can do the split easily. See the comment in >>> iommufd_vfio_set_iommu(). >>> >>> iommufd VFIO compatible mode is a transition from legacy VFIO to >>> iommufd. For the normal iommufd, it requires the iova/length must be a >>> superset of a previously mapped range. If not match, will return error. >> >> >> This is all true but this also means that "The former requires complex >> changes in VFIO" is not entirely true - some code is already there. Thanks, > > Hmm, my statement is a little confusing. The bottleneck is that the > IOMMU driver doesn't support the large page split. So if we want to > enable large page and want to do partial unmap, it requires complex change. We won't need to split large pages (if we stick to 4K for now), we need to split large mappings (not large pages) to allow partial unmapping and iopt_area_split() seems to be doing this. Thanks, > >> >> >> > -- Alexey
On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote: > > > On 9/1/25 19:49, Chenyi Qiang wrote: >> >> >> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote: >>> >>> >>> On 9/1/25 18:52, Chenyi Qiang wrote: >>>> >>>> >>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: >>>>> >>>>> >>>>> On 8/1/25 17:28, Chenyi Qiang wrote: >>>>>> Thanks Alexey for your review! >>>>>> >>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require >>>>>>>> uncoordinated >>>>>>>> discard") effectively disables device assignment when using >>>>>>>> guest_memfd. >>>>>>>> This poses a significant challenge as guest_memfd is essential for >>>>>>>> confidential guests, thereby blocking device assignment to these >>>>>>>> VMs. >>>>>>>> The initial rationale for disabling device assignment was due to >>>>>>>> stale >>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE >>>>>>>> I/O >>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device- >>>>>>>> assignment >>>>>>>> problem for confidential guests [1]. However, this assumption has >>>>>>>> proven >>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>>>>>> against >>>>>>>> "shared" or untrusted memory, which is crucial for device >>>>>>>> initialization >>>>>>>> and error recovery scenarios. As a result, the current >>>>>>>> implementation >>>>>>>> does >>>>>>>> not adequately support device assignment for confidential guests, >>>>>>>> necessitating >>>>>>>> a reevaluation of the approach to ensure compatibility and >>>>>>>> functionality. >>>>>>>> >>>>>>>> This series enables shared device assignment by notifying VFIO of >>>>>>>> page >>>>>>>> conversions using an existing framework named RamDiscardListener. >>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>>>>>> page >>>>>>>> support for guest_memfd. This patch set introduces in-place page >>>>>>>> conversion, >>>>>>>> where private and shared memory share the same physical pages as >>>>>>>> the >>>>>>>> backend. >>>>>>>> This development may impact our solution. >>>>>>>> >>>>>>>> We presented our solution in the guest_memfd meeting to discuss its >>>>>>>> compatibility with the new changes and potential future directions >>>>>>>> (see [3] >>>>>>>> for more details). The conclusion was that, although our >>>>>>>> solution may >>>>>>>> not be >>>>>>>> the most elegant (see the Limitation section), it is sufficient for >>>>>>>> now and >>>>>>>> can be easily adapted to future changes. >>>>>>>> >>>>>>>> We are re-posting the patch series with some cleanup and have >>>>>>>> removed >>>>>>>> the RFC >>>>>>>> label for the main enabling patches (1-6). The newly-added patch >>>>>>>> 7 is >>>>>>>> still >>>>>>>> marked as RFC as it tries to resolve some extension concerns >>>>>>>> related to >>>>>>>> RamDiscardManager for future usage. >>>>>>>> >>>>>>>> The overview of the patches: >>>>>>>> - Patch 1: Export a helper to get intersection of a >>>>>>>> MemoryRegionSection >>>>>>>> with a given range. >>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>>>>>> RamDiscardManager, and notify the shared/private state change >>>>>>>> during >>>>>>>> conversion. >>>>>>>> - Patch 7: Try to resolve a semantics concern related to >>>>>>>> RamDiscardManager >>>>>>>> i.e. RamDiscardManager is used to manage memory plug/unplug >>>>>>>> state >>>>>>>> instead of shared/private state. It would affect future >>>>>>>> users of >>>>>>>> RamDiscardManger in confidential VMs. Attach it behind as >>>>>>>> a RFC >>>>>>>> patch[4]. >>>>>>>> >>>>>>>> Changes since last version: >>>>>>>> - Add a patch to export some generic helper functions from >>>>>>>> virtio-mem >>>>>>>> code. >>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to >>>>>>>> default >>>>>>>> private. This keeps alignment with virtio-mem that 1- >>>>>>>> setting in >>>>>>>> bitmap >>>>>>>> represents the populated state and may help to export more >>>>>>>> generic >>>>>>>> code >>>>>>>> if necessary. >>>>>>>> - Add the helpers to initialize/uninitialize the >>>>>>>> guest_memfd_manager >>>>>>>> instance >>>>>>>> to make it more clear. >>>>>>>> - Add a patch to distinguish between the shared/private state >>>>>>>> change >>>>>>>> and >>>>>>>> the memory plug/unplug state change in RamDiscardManager. >>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>>>>>> chenyi.qiang@intel.com/ >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> Background >>>>>>>> ========== >>>>>>>> Confidential VMs have two classes of memory: shared and private >>>>>>>> memory. >>>>>>>> Shared memory is accessible from the host/VMM while private >>>>>>>> memory is >>>>>>>> not. Confidential VMs can decide which memory is shared/private and >>>>>>>> convert memory between shared/private at runtime. >>>>>>>> >>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve >>>>>>>> guest >>>>>>>> private memory. The key differences between guest_memfd and normal >>>>>>>> memfd >>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>>>>>> VM and >>>>>>>> cannot be mapped, read or written by userspace. >>>>>>> >>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not >>>>>>> already). >>>>>>> >>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1- >>>>>>> tabba@google.com/T/ >>>>>> >>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I >>>>>> mentioned >>>>>> it below with in-place page conversion. Maybe I would move it here to >>>>>> make it more clear. >>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> In QEMU's implementation, shared memory is allocated with normal >>>>>>>> methods >>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from >>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees >>>>>>>> pages >>>>>>>> via >>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one >>>>>>>> side and >>>>>>>> allocates new pages from the other side. >>>>>>>> >>>>>> >>>>>> [...] >>>>>> >>>>>>>> >>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that >>>>>>>> VFIO >>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and >>>>>>>> unmapped >>>>>>>> with >>>>>>>> the same granularity. The guest may perform partial conversions, >>>>>>>> such as >>>>>>>> converting a small region within a larger region. To prevent such >>>>>>>> invalid >>>>>>>> cases, all operations are performed with 4K granularity. The >>>>>>>> possible >>>>>>>> solutions we can think of are either to enable VFIO to support >>>>>>>> partial >>>>>>>> unmap >>>>> >>>>> btw the old VFIO does not split mappings but iommufd seems to be >>>>> capable >>>>> of it - there is iopt_area_split(). What happens if you try >>>>> unmapping a >>>>> smaller chunk that does not exactly match any mapped chunk? thanks, >>>> >>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make >>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with >>>> disable_large_page=true. That means the large IOPTE is also disabled in >>>> IOMMU. So it can do the split easily. See the comment in >>>> iommufd_vfio_set_iommu(). >>>> >>>> iommufd VFIO compatible mode is a transition from legacy VFIO to >>>> iommufd. For the normal iommufd, it requires the iova/length must be a >>>> superset of a previously mapped range. If not match, will return error. >>> >>> >>> This is all true but this also means that "The former requires complex >>> changes in VFIO" is not entirely true - some code is already there. >>> Thanks, >> >> Hmm, my statement is a little confusing. The bottleneck is that the >> IOMMU driver doesn't support the large page split. So if we want to >> enable large page and want to do partial unmap, it requires complex >> change. > > We won't need to split large pages (if we stick to 4K for now), we need > to split large mappings (not large pages) to allow partial unmapping and > iopt_area_split() seems to be doing this. Thanks, You mean we can disable large page in iommufd and then VFIO will be able to do partial unmap. Yes, I think it is doable and we can avoid many ioctl context switches overhead. > > >> >>> >>> >>> >> >
On 10.01.25 08:06, Chenyi Qiang wrote: > > > On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote: >> >> >> On 9/1/25 19:49, Chenyi Qiang wrote: >>> >>> >>> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote: >>>> >>>> >>>> On 9/1/25 18:52, Chenyi Qiang wrote: >>>>> >>>>> >>>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: >>>>>> >>>>>> >>>>>> On 8/1/25 17:28, Chenyi Qiang wrote: >>>>>>> Thanks Alexey for your review! >>>>>>> >>>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require >>>>>>>>> uncoordinated >>>>>>>>> discard") effectively disables device assignment when using >>>>>>>>> guest_memfd. >>>>>>>>> This poses a significant challenge as guest_memfd is essential for >>>>>>>>> confidential guests, thereby blocking device assignment to these >>>>>>>>> VMs. >>>>>>>>> The initial rationale for disabling device assignment was due to >>>>>>>>> stale >>>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE >>>>>>>>> I/O >>>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device- >>>>>>>>> assignment >>>>>>>>> problem for confidential guests [1]. However, this assumption has >>>>>>>>> proven >>>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>>>>>>> against >>>>>>>>> "shared" or untrusted memory, which is crucial for device >>>>>>>>> initialization >>>>>>>>> and error recovery scenarios. As a result, the current >>>>>>>>> implementation >>>>>>>>> does >>>>>>>>> not adequately support device assignment for confidential guests, >>>>>>>>> necessitating >>>>>>>>> a reevaluation of the approach to ensure compatibility and >>>>>>>>> functionality. >>>>>>>>> >>>>>>>>> This series enables shared device assignment by notifying VFIO of >>>>>>>>> page >>>>>>>>> conversions using an existing framework named RamDiscardListener. >>>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>>>>>>> page >>>>>>>>> support for guest_memfd. This patch set introduces in-place page >>>>>>>>> conversion, >>>>>>>>> where private and shared memory share the same physical pages as >>>>>>>>> the >>>>>>>>> backend. >>>>>>>>> This development may impact our solution. >>>>>>>>> >>>>>>>>> We presented our solution in the guest_memfd meeting to discuss its >>>>>>>>> compatibility with the new changes and potential future directions >>>>>>>>> (see [3] >>>>>>>>> for more details). The conclusion was that, although our >>>>>>>>> solution may >>>>>>>>> not be >>>>>>>>> the most elegant (see the Limitation section), it is sufficient for >>>>>>>>> now and >>>>>>>>> can be easily adapted to future changes. >>>>>>>>> >>>>>>>>> We are re-posting the patch series with some cleanup and have >>>>>>>>> removed >>>>>>>>> the RFC >>>>>>>>> label for the main enabling patches (1-6). The newly-added patch >>>>>>>>> 7 is >>>>>>>>> still >>>>>>>>> marked as RFC as it tries to resolve some extension concerns >>>>>>>>> related to >>>>>>>>> RamDiscardManager for future usage. >>>>>>>>> >>>>>>>>> The overview of the patches: >>>>>>>>> - Patch 1: Export a helper to get intersection of a >>>>>>>>> MemoryRegionSection >>>>>>>>> with a given range. >>>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>>>>>>> RamDiscardManager, and notify the shared/private state change >>>>>>>>> during >>>>>>>>> conversion. >>>>>>>>> - Patch 7: Try to resolve a semantics concern related to >>>>>>>>> RamDiscardManager >>>>>>>>> i.e. RamDiscardManager is used to manage memory plug/unplug >>>>>>>>> state >>>>>>>>> instead of shared/private state. It would affect future >>>>>>>>> users of >>>>>>>>> RamDiscardManger in confidential VMs. Attach it behind as >>>>>>>>> a RFC >>>>>>>>> patch[4]. >>>>>>>>> >>>>>>>>> Changes since last version: >>>>>>>>> - Add a patch to export some generic helper functions from >>>>>>>>> virtio-mem >>>>>>>>> code. >>>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to >>>>>>>>> default >>>>>>>>> private. This keeps alignment with virtio-mem that 1- >>>>>>>>> setting in >>>>>>>>> bitmap >>>>>>>>> represents the populated state and may help to export more >>>>>>>>> generic >>>>>>>>> code >>>>>>>>> if necessary. >>>>>>>>> - Add the helpers to initialize/uninitialize the >>>>>>>>> guest_memfd_manager >>>>>>>>> instance >>>>>>>>> to make it more clear. >>>>>>>>> - Add a patch to distinguish between the shared/private state >>>>>>>>> change >>>>>>>>> and >>>>>>>>> the memory plug/unplug state change in RamDiscardManager. >>>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>>>>>>> chenyi.qiang@intel.com/ >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> Background >>>>>>>>> ========== >>>>>>>>> Confidential VMs have two classes of memory: shared and private >>>>>>>>> memory. >>>>>>>>> Shared memory is accessible from the host/VMM while private >>>>>>>>> memory is >>>>>>>>> not. Confidential VMs can decide which memory is shared/private and >>>>>>>>> convert memory between shared/private at runtime. >>>>>>>>> >>>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve >>>>>>>>> guest >>>>>>>>> private memory. The key differences between guest_memfd and normal >>>>>>>>> memfd >>>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>>>>>>> VM and >>>>>>>>> cannot be mapped, read or written by userspace. >>>>>>>> >>>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not >>>>>>>> already). >>>>>>>> >>>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1- >>>>>>>> tabba@google.com/T/ >>>>>>> >>>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I >>>>>>> mentioned >>>>>>> it below with in-place page conversion. Maybe I would move it here to >>>>>>> make it more clear. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> In QEMU's implementation, shared memory is allocated with normal >>>>>>>>> methods >>>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from >>>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees >>>>>>>>> pages >>>>>>>>> via >>>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one >>>>>>>>> side and >>>>>>>>> allocates new pages from the other side. >>>>>>>>> >>>>>>> >>>>>>> [...] >>>>>>> >>>>>>>>> >>>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that >>>>>>>>> VFIO >>>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and >>>>>>>>> unmapped >>>>>>>>> with >>>>>>>>> the same granularity. The guest may perform partial conversions, >>>>>>>>> such as >>>>>>>>> converting a small region within a larger region. To prevent such >>>>>>>>> invalid >>>>>>>>> cases, all operations are performed with 4K granularity. The >>>>>>>>> possible >>>>>>>>> solutions we can think of are either to enable VFIO to support >>>>>>>>> partial >>>>>>>>> unmap >>>>>> >>>>>> btw the old VFIO does not split mappings but iommufd seems to be >>>>>> capable >>>>>> of it - there is iopt_area_split(). What happens if you try >>>>>> unmapping a >>>>>> smaller chunk that does not exactly match any mapped chunk? thanks, >>>>> >>>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make >>>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with >>>>> disable_large_page=true. That means the large IOPTE is also disabled in >>>>> IOMMU. So it can do the split easily. See the comment in >>>>> iommufd_vfio_set_iommu(). >>>>> >>>>> iommufd VFIO compatible mode is a transition from legacy VFIO to >>>>> iommufd. For the normal iommufd, it requires the iova/length must be a >>>>> superset of a previously mapped range. If not match, will return error. >>>> >>>> >>>> This is all true but this also means that "The former requires complex >>>> changes in VFIO" is not entirely true - some code is already there. >>>> Thanks, >>> >>> Hmm, my statement is a little confusing. The bottleneck is that the >>> IOMMU driver doesn't support the large page split. So if we want to >>> enable large page and want to do partial unmap, it requires complex >>> change. >> >> We won't need to split large pages (if we stick to 4K for now), we need >> to split large mappings (not large pages) to allow partial unmapping and >> iopt_area_split() seems to be doing this. Thanks, > > You mean we can disable large page in iommufd and then VFIO will be able > to do partial unmap. Yes, I think it is doable and we can avoid many > ioctl context switches overhead. So I understand this correctly: the disable_large_pages=true will imply that we never have PMD mappings such that we can atomically poke a hole in a mapping, without temporarily having to remove a PMD mapping in the iommu table to insert a PTE table? batch_iommu_map_small() seems to document that behavior. It's interesting that that comment points out that this is purely "VFIO compatibility", and that it otherwise violates the iommufd invariant: "pairing map/unmap". So, it is against the real iommufd design ... Back when working on virtio-mem support (RAMDiscardManager), thought there was not way to reliably do atomic partial unmappings. -- Cheers, David / dhildenb
On Fri, Jan 10, 2025 at 09:26:02AM +0100, David Hildenbrand wrote: > > > > > > > > > > One limitation (also discussed in the guest_memfd > > > > > > > > > > meeting) is that VFIO expects the DMA mapping for > > > > > > > > > > a specific IOVA to be mapped and unmapped with the > > > > > > > > > > same granularity. Not just same granularity, whatever you map you have to unmap in whole. map/unmap must be perfectly paired by userspace. > > > > > > > > > > such as converting a small region within a larger > > > > > > > > > > region. To prevent such invalid cases, all > > > > > > > > > > operations are performed with 4K granularity. The > > > > > > > > > > possible solutions we can think of are either to > > > > > > > > > > enable VFIO to support partial unmap Yes, you can do that, but it is aweful for performance everywhere > > > > > > iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make > > > > > > iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with > > > > > > disable_large_page=true. That means the large IOPTE is also disabled in > > > > > > IOMMU. So it can do the split easily. See the comment in > > > > > > iommufd_vfio_set_iommu(). Yes. But I am working on a project to make this more general purpose and not have the 4k limitation. There are now several use cases for this kind of cut feature. https://lore.kernel.org/linux-iommu/7-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/ > > > > > This is all true but this also means that "The former requires complex > > > > > changes in VFIO" is not entirely true - some code is already there. Well, to do it without forcing 4k requires complex changes. > > > > Hmm, my statement is a little confusing. The bottleneck is that the > > > > IOMMU driver doesn't support the large page split. So if we want to > > > > enable large page and want to do partial unmap, it requires complex > > > > change. Yes, this is what I'm working on. > > > We won't need to split large pages (if we stick to 4K for now), we need > > > to split large mappings (not large pages) to allow partial unmapping and > > > iopt_area_split() seems to be doing this. Thanks, Correct > > You mean we can disable large page in iommufd and then VFIO will be able > > to do partial unmap. Yes, I think it is doable and we can avoid many > > ioctl context switches overhead. Right > So I understand this correctly: the disable_large_pages=true will imply that > we never have PMD mappings such that we can atomically poke a hole in a > mapping, without temporarily having to remove a PMD mapping in the iommu > table to insert a PTE table? Yes > batch_iommu_map_small() seems to document that behavior. Yes > It's interesting that that comment points out that this is purely "VFIO > compatibility", and that it otherwise violates the iommufd invariant: > "pairing map/unmap". So, it is against the real iommufd design ... IIRC you can only trigger split using the VFIO type 1 legacy API. We would need to formalize split as an IOMMUFD native ioctl. Nobody should use this stuf through the legacy type 1 API!!!! > Back when working on virtio-mem support (RAMDiscardManager), thought there > was not way to reliably do atomic partial unmappings. Correct Jason
On 10.01.25 14:20, Jason Gunthorpe wrote: Thanks for your reply, I knew CCing you would be very helpful :) > On Fri, Jan 10, 2025 at 09:26:02AM +0100, David Hildenbrand wrote: >>>>>>>>>>> One limitation (also discussed in the guest_memfd >>>>>>>>>>> meeting) is that VFIO expects the DMA mapping for >>>>>>>>>>> a specific IOVA to be mapped and unmapped with the >>>>>>>>>>> same granularity. > > Not just same granularity, whatever you map you have to unmap in > whole. map/unmap must be perfectly paired by userspace. Right, that's what virtio-mem ends up doing by mapping each memory block (e.g., 2 MiB) separately that could be unmapped separately. It adds "overhead", but at least you don't run into "no, you cannot split this region because you would be out of memory/slots" or in the past issues with concurrent ongoing DMA. > >>>>>>>>>>> such as converting a small region within a larger >>>>>>>>>>> region. To prevent such invalid cases, all >>>>>>>>>>> operations are performed with 4K granularity. The >>>>>>>>>>> possible solutions we can think of are either to >>>>>>>>>>> enable VFIO to support partial unmap > > Yes, you can do that, but it is aweful for performance everywhere Absolutely. In your commit I read: "Implement the cut operation to be hitless, changes to the page table during cutting must cause zero disruption to any ongoing DMA. This is the expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is incompatible with HW requiring break-before-make." So I guess that would mean that, depending on HW support, one could avoid disabling large pages to still allow for atomic cuts / partial unmaps that don't affect concurrent DMA. What would be your suggestion here to avoid the "map each 4k page individually so we can unmap it individually" ? I didn't completely grasp that, sorry. From "IIRC you can only trigger split using the VFIO type 1 legacy API. We would need to formalize split as an IOMMUFD native ioctl. Nobody should use this stuf through the legacy type 1 API!!!!" I assume you mean that we can only avoid the 4k map/unmap if we add proper support to IOMMUFD native ioctl, and not try making it fly somehow with the legacy type 1 API? -- Cheers, David / dhildenb
On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote: > > In your commit I read: > > "Implement the cut operation to be hitless, changes to the page table > during cutting must cause zero disruption to any ongoing DMA. This is the > expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is > incompatible with HW requiring break-before-make." > > So I guess that would mean that, depending on HW support, one could avoid > disabling large pages to still allow for atomic cuts / partial unmaps that > don't affect concurrent DMA. Yes. Most x86 server HW will do this, though ARM support is a bit newish. > What would be your suggestion here to avoid the "map each 4k page > individually so we can unmap it individually" ? I didn't completely grasp > that, sorry. Map in large ranges in the VMM, lets say 1G of shared memory as a single mapping (called an iommufd area) When the guest makes a 2M chunk of it private you do a ioctl to iommufd to split the area into three, leaving the 2M chunk as a seperate area. The new iommufd ioctl to split areas will go down into the iommu driver and atomically cut the 1G PTEs into smaller PTEs as necessary so that no PTE spans the edges of the 2M area. Then userspace can unmap the 2M area and leave the remainder of the 1G area mapped. All of this would be fully hitless to ongoing DMA. The iommufs code is there to do this assuming the areas are mapped at 4k, what is missing is the iommu driver side to atomically resize large PTEs. > From "IIRC you can only trigger split using the VFIO type 1 legacy API. We > would need to formalize split as an IOMMUFD native ioctl. > Nobody should use this stuf through the legacy type 1 API!!!!" > > I assume you mean that we can only avoid the 4k map/unmap if we add proper > support to IOMMUFD native ioctl, and not try making it fly somehow with the > legacy type 1 API? The thread was talking about the built-in support in iommufd to split mappings. That built-in support is only accessible through legacy APIs and should never be used in new qemu code. To use that built in support in new code we need to build new APIs. The advantage of the built-in support is qemu can map in large regions (which is more efficient) and the kernel will break it down to 4k for the iommu driver. Mapping 4k at a time through the uAPI would be outrageously inefficient. Jason
On 11/1/25 01:14, Jason Gunthorpe wrote: > On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote: >> >> In your commit I read: >> >> "Implement the cut operation to be hitless, changes to the page table >> during cutting must cause zero disruption to any ongoing DMA. This is the >> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is >> incompatible with HW requiring break-before-make." >> >> So I guess that would mean that, depending on HW support, one could avoid >> disabling large pages to still allow for atomic cuts / partial unmaps that >> don't affect concurrent DMA. > > Yes. Most x86 server HW will do this, though ARM support is a bit newish. > >> What would be your suggestion here to avoid the "map each 4k page >> individually so we can unmap it individually" ? I didn't completely grasp >> that, sorry. > > Map in large ranges in the VMM, lets say 1G of shared memory as a > single mapping (called an iommufd area) > > When the guest makes a 2M chunk of it private you do a ioctl to > iommufd to split the area into three, leaving the 2M chunk as a > seperate area. > > The new iommufd ioctl to split areas will go down into the iommu driver > and atomically cut the 1G PTEs into smaller PTEs as necessary so that > no PTE spans the edges of the 2M area. > > Then userspace can unmap the 2M area and leave the remainder of the 1G > area mapped. > > All of this would be fully hitless to ongoing DMA. > > The iommufs code is there to do this assuming the areas are mapped at > 4k, what is missing is the iommu driver side to atomically resize > large PTEs. > >> From "IIRC you can only trigger split using the VFIO type 1 legacy API. We >> would need to formalize split as an IOMMUFD native ioctl. >> Nobody should use this stuf through the legacy type 1 API!!!!" >> >> I assume you mean that we can only avoid the 4k map/unmap if we add proper >> support to IOMMUFD native ioctl, and not try making it fly somehow with the >> legacy type 1 API? > > The thread was talking about the built-in support in iommufd to split > mappings. Just to clarify - I am talking about splitting only "iommufd areas", not large pages. If all IOMMU PTEs are 4k and areas are bigger than 4K => the hw support is not needed to allow splitting. The comments above and below seem to confuse large pages with large areas (well, I am consufed, at least). > That built-in support is only accessible through legacy APIs > and should never be used in new qemu code. To use that built in > support in new code we need to build new APIs. Why would not IOMMU_IOAS_MAP/UNMAP uAPI work? Thanks, > The advantage of the > built-in support is qemu can map in large regions (which is more > efficient) and the kernel will break it down to 4k for the iommu > driver. > Mapping 4k at a time through the uAPI would be outrageously > inefficient. > > Jason > -- Alexey
On Wed, Jan 15, 2025 at 02:39:55PM +1100, Alexey Kardashevskiy wrote: > > The thread was talking about the built-in support in iommufd to split > > mappings. > > Just to clarify - I am talking about splitting only "iommufd areas", not > large pages. In generality it is the same thing as you cannot generally guarantee that an area split doesn't also cross a large page. > If all IOMMU PTEs are 4k and areas are bigger than 4K => the hw > support is not needed to allow splitting. The comments above and below seem > to confuse large pages with large areas (well, I am consufed, at least). Yes, in that special case yes. > > That built-in support is only accessible through legacy APIs > > and should never be used in new qemu code. To use that built in > > support in new code we need to build new APIs. > > Why would not IOMMU_IOAS_MAP/UNMAP uAPI work? Thanks, I don't want to overload those APIs, I prefer to see a new API that is just about splitting areas. Splitting is a special operation that can fail depending on driver support. Jason
On 10.01.25 15:14, Jason Gunthorpe wrote: > On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote: >> >> In your commit I read: >> >> "Implement the cut operation to be hitless, changes to the page table >> during cutting must cause zero disruption to any ongoing DMA. This is the >> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is >> incompatible with HW requiring break-before-make." >> >> So I guess that would mean that, depending on HW support, one could avoid >> disabling large pages to still allow for atomic cuts / partial unmaps that >> don't affect concurrent DMA. > > Yes. Most x86 server HW will do this, though ARM support is a bit newish. > >> What would be your suggestion here to avoid the "map each 4k page >> individually so we can unmap it individually" ? I didn't completely grasp >> that, sorry. > > Map in large ranges in the VMM, lets say 1G of shared memory as a > single mapping (called an iommufd area) > > When the guest makes a 2M chunk of it private you do a ioctl to > iommufd to split the area into three, leaving the 2M chunk as a > seperate area. > > The new iommufd ioctl to split areas will go down into the iommu driver > and atomically cut the 1G PTEs into smaller PTEs as necessary so that > no PTE spans the edges of the 2M area. > > Then userspace can unmap the 2M area and leave the remainder of the 1G > area mapped. > > All of this would be fully hitless to ongoing DMA. > > The iommufs code is there to do this assuming the areas are mapped at > 4k, what is missing is the iommu driver side to atomically resize > large PTEs. > >> From "IIRC you can only trigger split using the VFIO type 1 legacy API. We >> would need to formalize split as an IOMMUFD native ioctl. >> Nobody should use this stuf through the legacy type 1 API!!!!" >> >> I assume you mean that we can only avoid the 4k map/unmap if we add proper >> support to IOMMUFD native ioctl, and not try making it fly somehow with the >> legacy type 1 API? > > The thread was talking about the built-in support in iommufd to split > mappings. That built-in support is only accessible through legacy APIs > and should never be used in new qemu code. To use that built in > support in new code we need to build new APIs. The advantage of the > built-in support is qemu can map in large regions (which is more > efficient) and the kernel will break it down to 4k for the iommu > driver. > > Mapping 4k at a time through the uAPI would be outrageously > inefficient. Got it, makes all sense, thanks! -- Cheers, David / dhildenb
© 2016 - 2025 Red Hat, Inc.