[v1] Enable shared device assignment

[PATCH 0/7] Enable shared device assignment

Posted by Chenyi Qiang 3 months, 3 weeks ago

Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") effectively disables device assignment when using guest_memfd.
This poses a significant challenge as guest_memfd is essential for
confidential guests, thereby blocking device assignment to these VMs.
The initial rationale for disabling device assignment was due to stale
IOMMU mappings (see Problem section) and the assumption that TEE I/O
(SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
problem for confidential guests [1]. However, this assumption has proven
to be incorrect. TEE I/O relies on the ability to operate devices against
"shared" or untrusted memory, which is crucial for device initialization
and error recovery scenarios. As a result, the current implementation does
not adequately support device assignment for confidential guests, necessitating
a reevaluation of the approach to ensure compatibility and functionality.

This series enables shared device assignment by notifying VFIO of page
conversions using an existing framework named RamDiscardListener.
Additionally, there is an ongoing patch set [2] that aims to add 1G page
support for guest_memfd. This patch set introduces in-place page conversion,
where private and shared memory share the same physical pages as the backend.
This development may impact our solution.

We presented our solution in the guest_memfd meeting to discuss its
compatibility with the new changes and potential future directions (see [3]
for more details). The conclusion was that, although our solution may not be
the most elegant (see the Limitation section), it is sufficient for now and
can be easily adapted to future changes.

We are re-posting the patch series with some cleanup and have removed the RFC
label for the main enabling patches (1-6). The newly-added patch 7 is still
marked as RFC as it tries to resolve some extension concerns related to
RamDiscardManager for future usage.

The overview of the patches:
- Patch 1: Export a helper to get intersection of a MemoryRegionSection
  with a given range.
- Patch 2-6: Introduce a new object to manage the guest-memfd with
  RamDiscardManager, and notify the shared/private state change during
  conversion.
- Patch 7: Try to resolve a semantics concern related to RamDiscardManager
  i.e. RamDiscardManager is used to manage memory plug/unplug state
  instead of shared/private state. It would affect future users of
  RamDiscardManger in confidential VMs. Attach it behind as a RFC patch[4].

Changes since last version:
- Add a patch to export some generic helper functions from virtio-mem code.
- Change the bitmap in guest_memfd_manager from default shared to default
  private. This keeps alignment with virtio-mem that 1-setting in bitmap
  represents the populated state and may help to export more generic code
  if necessary.
- Add the helpers to initialize/uninitialize the guest_memfd_manager instance
  to make it more clear.
- Add a patch to distinguish between the shared/private state change and
  the memory plug/unplug state change in RamDiscardManager.
- RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-chenyi.qiang@intel.com/

---

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. The key differences between guest_memfd and normal memfd
are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
cannot be mapped, read or written by userspace.

In QEMU's implementation, shared memory is allocated with normal methods
(e.g. mmap or fallocate) while private memory is allocated from
guest_memfd. When a VM performs memory conversions, QEMU frees pages via
madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
allocates new pages from the other side.

Problem
=======
Device assignment in QEMU is implemented via VFIO system. In the normal
VM, VM memory is pinned at the beginning of time by VFIO. In the
confidential VM, the VM can convert memory and when that happens
nothing currently tells VFIO that its mappings are stale. This means
that page conversion leaks memory and leaves stale IOMMU mappings. For
example, sequence like the following can result in stale IOMMU mappings:

1. allocate shared page
2. convert page shared->private
3. discard shared page
4. convert page private->shared
5. allocate shared page
6. issue DMA operations against that shared page

After step 3, VFIO is still pinning the page. However, DMA operations in
step 6 will hit the old mapping that was allocated in step 1, which
causes the device to access the invalid data.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion.

Given the constraints and assumptions here is a solution that satisfied
the use cases. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversion is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardManager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Another possible attempt [5] was to not discard shared pages in step 3
above. This was an incomplete band-aid because guests would consume
twice the memory since shared pages wouldn't be freed even after they
were converted to private.

w/ in-place page conversion
===========================
To support 1G page support for guest_memfd, the current direction is to
allow mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages as the backend. This in-place page
conversion design eliminates the need to discard pages during shared/private
conversions. However, device assignment will still be blocked because the
in-place page conversion will reject the conversion when the page is pinned
by VFIO.

To address this, the key difference lies in the sequence of VFIO map/unmap
operations and the page conversion. This series can be adjusted to achieve
unmap-before-conversion-to-private and map-after-conversion-to-shared,
ensuring compatibility with guest_memfd.

Additionally, with in-place page conversion, the previously mentioned
solution to disable the discard of shared pages is not feasible because
shared and private memory share the same backend, and no discard operation
is performed. Retaining the old mappings in the IOMMU would result in
unsafe DMA access to protected memory.

Limitation
==========

One limitation (also discussed in the guest_memfd meeting) is that VFIO
expects the DMA mapping for a specific IOVA to be mapped and unmapped with
the same granularity. The guest may perform partial conversions, such as
converting a small region within a larger region. To prevent such invalid
cases, all operations are performed with 4K granularity. The possible
solutions we can think of are either to enable VFIO to support partial unmap
or to implement an enlightened guest to avoid partial conversion. The former
requires complex changes in VFIO, while the latter requires the page
conversion to be a guest-enlightened behavior. It is still uncertain which
option is a preferred one.

Testing
=======
This patch series is tested with the KVM/QEMU branch:
KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2024-12-13

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

No additional adjustment required.

Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.

Related link
============
[1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
[2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.jr4csfgw1uql
[4] https://lore.kernel.org/qemu-devel/d299bbad-81bc-462e-91b5-a6d9c27ffe3a@redhat.com/
[5] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/

Chenyi Qiang (7):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  guest_memfd: Introduce an object to manage the guest-memfd with
    RamDiscardManager
  guest_memfd: Introduce a callback to notify the shared/private state
    change
  KVM: Notify the state change event during shared/private conversion
  memory: Register the RamDiscardManager instance upon guest_memfd
    creation
  RAMBlock: make guest_memfd require coordinate discard
  memory: Add a new argument to indicate the request attribute in
    RamDismcardManager helpers

 accel/kvm/kvm-all.c                  |   4 +
 hw/vfio/common.c                     |  22 +-
 hw/virtio/virtio-mem.c               |  55 ++--
 include/exec/memory.h                |  36 ++-
 include/sysemu/guest-memfd-manager.h |  91 ++++++
 migration/ram.c                      |  14 +-
 system/guest-memfd-manager.c         | 456 +++++++++++++++++++++++++++
 system/memory.c                      |  30 +-
 system/memory_mapping.c              |   4 +-
 system/meson.build                   |   1 +
 system/physmem.c                     |   9 +-
 11 files changed, 659 insertions(+), 63 deletions(-)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c

-- 
2.43.5

Re: [PATCH 0/7] Enable shared device assignment

Posted by Alexey Kardashevskiy 3 months ago

On 13/12/24 18:08, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") effectively disables device assignment when using guest_memfd.
> This poses a significant challenge as guest_memfd is essential for
> confidential guests, thereby blocking device assignment to these VMs.
> The initial rationale for disabling device assignment was due to stale
> IOMMU mappings (see Problem section) and the assumption that TEE I/O
> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
> problem for confidential guests [1]. However, this assumption has proven
> to be incorrect. TEE I/O relies on the ability to operate devices against
> "shared" or untrusted memory, which is crucial for device initialization
> and error recovery scenarios. As a result, the current implementation does
> not adequately support device assignment for confidential guests, necessitating
> a reevaluation of the approach to ensure compatibility and functionality.
> 
> This series enables shared device assignment by notifying VFIO of page
> conversions using an existing framework named RamDiscardListener.
> Additionally, there is an ongoing patch set [2] that aims to add 1G page
> support for guest_memfd. This patch set introduces in-place page conversion,
> where private and shared memory share the same physical pages as the backend.
> This development may impact our solution.
> 
> We presented our solution in the guest_memfd meeting to discuss its
> compatibility with the new changes and potential future directions (see [3]
> for more details). The conclusion was that, although our solution may not be
> the most elegant (see the Limitation section), it is sufficient for now and
> can be easily adapted to future changes.
> 
> We are re-posting the patch series with some cleanup and have removed the RFC
> label for the main enabling patches (1-6). The newly-added patch 7 is still
> marked as RFC as it tries to resolve some extension concerns related to
> RamDiscardManager for future usage.
> 
> The overview of the patches:
> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>    with a given range.
> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>    RamDiscardManager, and notify the shared/private state change during
>    conversion.
> - Patch 7: Try to resolve a semantics concern related to RamDiscardManager
>    i.e. RamDiscardManager is used to manage memory plug/unplug state
>    instead of shared/private state. It would affect future users of
>    RamDiscardManger in confidential VMs. Attach it behind as a RFC patch[4].
> 
> Changes since last version:
> - Add a patch to export some generic helper functions from virtio-mem code.
> - Change the bitmap in guest_memfd_manager from default shared to default
>    private. This keeps alignment with virtio-mem that 1-setting in bitmap
>    represents the populated state and may help to export more generic code
>    if necessary.
> - Add the helpers to initialize/uninitialize the guest_memfd_manager instance
>    to make it more clear.
> - Add a patch to distinguish between the shared/private state change and
>    the memory plug/unplug state change in RamDiscardManager.
> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-chenyi.qiang@intel.com/
> 
> ---
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared/private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. The key differences between guest_memfd and normal memfd
> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> cannot be mapped, read or written by userspace.

The "cannot be mapped" seems to be not true soon anymore (if not already).

https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/


> 
> In QEMU's implementation, shared memory is allocated with normal methods
> (e.g. mmap or fallocate) while private memory is allocated from
> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> allocates new pages from the other side.
> 
> Problem
> =======
> Device assignment in QEMU is implemented via VFIO system. In the normal
> VM, VM memory is pinned at the beginning of time by VFIO. In the
> confidential VM, the VM can convert memory and when that happens
> nothing currently tells VFIO that its mappings are stale. This means
> that page conversion leaks memory and leaves stale IOMMU mappings. For
> example, sequence like the following can result in stale IOMMU mappings:
> 
> 1. allocate shared page
> 2. convert page shared->private
> 3. discard shared page
> 4. convert page private->shared
> 5. allocate shared page
> 6. issue DMA operations against that shared page
> 
> After step 3, VFIO is still pinning the page. However, DMA operations in
> step 6 will hit the old mapping that was allocated in step 1, which
> causes the device to access the invalid data.
> 
> Solution
> ========
> The key to enable shared device assignment is to update the IOMMU mappings
> on page conversion.
> 
> Given the constraints and assumptions here is a solution that satisfied
> the use cases. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Page conversion is similar to
> hot-removing a page in one mode and adding it back in the other.
> 
> This series implements a RamDiscardManager for confidential VMs and
> utilizes its infrastructure to notify VFIO of page conversions.
> 
> Another possible attempt [5] was to not discard shared pages in step 3
> above. This was an incomplete band-aid because guests would consume
> twice the memory since shared pages wouldn't be freed even after they
> were converted to private.
> 
> w/ in-place page conversion
> ===========================
> To support 1G page support for guest_memfd, the current direction is to
> allow mmap() of guest_memfd to userspace so that both private and shared
> memory can use the same physical pages as the backend. This in-place page
> conversion design eliminates the need to discard pages during shared/private
> conversions. However, device assignment will still be blocked because the
> in-place page conversion will reject the conversion when the page is pinned
> by VFIO.
> 
> To address this, the key difference lies in the sequence of VFIO map/unmap
> operations and the page conversion. This series can be adjusted to achieve
> unmap-before-conversion-to-private and map-after-conversion-to-shared,
> ensuring compatibility with guest_memfd.
> 
> Additionally, with in-place page conversion, the previously mentioned
> solution to disable the discard of shared pages is not feasible because
> shared and private memory share the same backend, and no discard operation
> is performed. Retaining the old mappings in the IOMMU would result in
> unsafe DMA access to protected memory.
> 
> Limitation
> ==========
> 
> One limitation (also discussed in the guest_memfd meeting) is that VFIO
> expects the DMA mapping for a specific IOVA to be mapped and unmapped with
> the same granularity. The guest may perform partial conversions, such as
> converting a small region within a larger region. To prevent such invalid
> cases, all operations are performed with 4K granularity. The possible
> solutions we can think of are either to enable VFIO to support partial unmap
> or to implement an enlightened guest to avoid partial conversion. The former
> requires complex changes in VFIO, while the latter requires the page
> conversion to be a guest-enlightened behavior. It is still uncertain which
> option is a preferred one.

in-place memory conversion is :)

> 
> Testing
> =======
> This patch series is tested with the KVM/QEMU branch:
> KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20
> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2024-12-13


The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have 
these though. Thanks,

> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>      -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> If use the iommufd-backed VFIO with the qemu command:
> 
> qemu-system-x86_64 [...]
>      -object iommufd,id=iommufd0 \
>      -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> No additional adjustment required.
> 
> Following the bootup of the TD guest, the guest's IP address becomes
> visible, and iperf is able to successfully send and receive data.
> 
> Related link
> ============
> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
> [2] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com/
> [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.jr4csfgw1uql
> [4] https://lore.kernel.org/qemu-devel/d299bbad-81bc-462e-91b5-a6d9c27ffe3a@redhat.com/
> [5] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
> 
> Chenyi Qiang (7):
>    memory: Export a helper to get intersection of a MemoryRegionSection
>      with a given range
>    guest_memfd: Introduce an object to manage the guest-memfd with
>      RamDiscardManager
>    guest_memfd: Introduce a callback to notify the shared/private state
>      change
>    KVM: Notify the state change event during shared/private conversion
>    memory: Register the RamDiscardManager instance upon guest_memfd
>      creation
>    RAMBlock: make guest_memfd require coordinate discard
>    memory: Add a new argument to indicate the request attribute in
>      RamDismcardManager helpers
> 
>   accel/kvm/kvm-all.c                  |   4 +
>   hw/vfio/common.c                     |  22 +-
>   hw/virtio/virtio-mem.c               |  55 ++--
>   include/exec/memory.h                |  36 ++-
>   include/sysemu/guest-memfd-manager.h |  91 ++++++
>   migration/ram.c                      |  14 +-
>   system/guest-memfd-manager.c         | 456 +++++++++++++++++++++++++++
>   system/memory.c                      |  30 +-
>   system/memory_mapping.c              |   4 +-
>   system/meson.build                   |   1 +
>   system/physmem.c                     |   9 +-
>   11 files changed, 659 insertions(+), 63 deletions(-)
>   create mode 100644 include/sysemu/guest-memfd-manager.h
>   create mode 100644 system/guest-memfd-manager.c
> 

-- 
Alexey

Re: [PATCH 0/7] Enable shared device assignment

Posted by Chenyi Qiang 3 months ago

Thanks Alexey for your review!

On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
> On 13/12/24 18:08, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") effectively disables device assignment when using guest_memfd.
>> This poses a significant challenge as guest_memfd is essential for
>> confidential guests, thereby blocking device assignment to these VMs.
>> The initial rationale for disabling device assignment was due to stale
>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>> problem for confidential guests [1]. However, this assumption has proven
>> to be incorrect. TEE I/O relies on the ability to operate devices against
>> "shared" or untrusted memory, which is crucial for device initialization
>> and error recovery scenarios. As a result, the current implementation
>> does
>> not adequately support device assignment for confidential guests,
>> necessitating
>> a reevaluation of the approach to ensure compatibility and functionality.
>>
>> This series enables shared device assignment by notifying VFIO of page
>> conversions using an existing framework named RamDiscardListener.
>> Additionally, there is an ongoing patch set [2] that aims to add 1G page
>> support for guest_memfd. This patch set introduces in-place page
>> conversion,
>> where private and shared memory share the same physical pages as the
>> backend.
>> This development may impact our solution.
>>
>> We presented our solution in the guest_memfd meeting to discuss its
>> compatibility with the new changes and potential future directions
>> (see [3]
>> for more details). The conclusion was that, although our solution may
>> not be
>> the most elegant (see the Limitation section), it is sufficient for
>> now and
>> can be easily adapted to future changes.
>>
>> We are re-posting the patch series with some cleanup and have removed
>> the RFC
>> label for the main enabling patches (1-6). The newly-added patch 7 is
>> still
>> marked as RFC as it tries to resolve some extension concerns related to
>> RamDiscardManager for future usage.
>>
>> The overview of the patches:
>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>    with a given range.
>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>    RamDiscardManager, and notify the shared/private state change during
>>    conversion.
>> - Patch 7: Try to resolve a semantics concern related to
>> RamDiscardManager
>>    i.e. RamDiscardManager is used to manage memory plug/unplug state
>>    instead of shared/private state. It would affect future users of
>>    RamDiscardManger in confidential VMs. Attach it behind as a RFC
>> patch[4].
>>
>> Changes since last version:
>> - Add a patch to export some generic helper functions from virtio-mem
>> code.
>> - Change the bitmap in guest_memfd_manager from default shared to default
>>    private. This keeps alignment with virtio-mem that 1-setting in bitmap
>>    represents the populated state and may help to export more generic
>> code
>>    if necessary.
>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>> instance
>>    to make it more clear.
>> - Add a patch to distinguish between the shared/private state change and
>>    the memory plug/unplug state change in RamDiscardManager.
>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>> chenyi.qiang@intel.com/
>>
>> ---
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. The key differences between guest_memfd and normal memfd
>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>> cannot be mapped, read or written by userspace.
> 
> The "cannot be mapped" seems to be not true soon anymore (if not already).
> 
> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/

Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
it below with in-place page conversion. Maybe I would move it here to
make it more clear.

> 
> 
>>
>> In QEMU's implementation, shared memory is allocated with normal methods
>> (e.g. mmap or fallocate) while private memory is allocated from
>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>> allocates new pages from the other side.
>>

[...]

>>
>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>> with
>> the same granularity. The guest may perform partial conversions, such as
>> converting a small region within a larger region. To prevent such invalid
>> cases, all operations are performed with 4K granularity. The possible
>> solutions we can think of are either to enable VFIO to support partial
>> unmap
>> or to implement an enlightened guest to avoid partial conversion. The
>> former
>> requires complex changes in VFIO, while the latter requires the page
>> conversion to be a guest-enlightened behavior. It is still uncertain
>> which
>> option is a preferred one.
> 
> in-place memory conversion is :)
> 
>>
>> Testing
>> =======
>> This patch series is tested with the KVM/QEMU branch:
>> KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20
>> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-
>> snapshot-2024-12-13
> 
> 
> The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have
> these though. Thanks,

Thanks for pointing it out. You're right,
tdx-upstream-snapshot-2024-12-18 is the latest branch. I added the fixup
for patch 1 and forgot to update the change here.

> 
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>      -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> If use the iommufd-backed VFIO with the qemu command:
>>
>> qemu-system-x86_64 [...]
>>      -object iommufd,id=iommufd0 \
>>      -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
>>
>> No additional adjustment required.
>>
>> Following the bootup of the TD guest, the guest's IP address becomes
>> visible, and iperf is able to successfully send and receive data.

>

Re: [PATCH 0/7] Enable shared device assignment

Posted by Alexey Kardashevskiy 3 months ago


On 8/1/25 17:28, Chenyi Qiang wrote:
> Thanks Alexey for your review!
> 
> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>> discard") effectively disables device assignment when using guest_memfd.
>>> This poses a significant challenge as guest_memfd is essential for
>>> confidential guests, thereby blocking device assignment to these VMs.
>>> The initial rationale for disabling device assignment was due to stale
>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>>> problem for confidential guests [1]. However, this assumption has proven
>>> to be incorrect. TEE I/O relies on the ability to operate devices against
>>> "shared" or untrusted memory, which is crucial for device initialization
>>> and error recovery scenarios. As a result, the current implementation
>>> does
>>> not adequately support device assignment for confidential guests,
>>> necessitating
>>> a reevaluation of the approach to ensure compatibility and functionality.
>>>
>>> This series enables shared device assignment by notifying VFIO of page
>>> conversions using an existing framework named RamDiscardListener.
>>> Additionally, there is an ongoing patch set [2] that aims to add 1G page
>>> support for guest_memfd. This patch set introduces in-place page
>>> conversion,
>>> where private and shared memory share the same physical pages as the
>>> backend.
>>> This development may impact our solution.
>>>
>>> We presented our solution in the guest_memfd meeting to discuss its
>>> compatibility with the new changes and potential future directions
>>> (see [3]
>>> for more details). The conclusion was that, although our solution may
>>> not be
>>> the most elegant (see the Limitation section), it is sufficient for
>>> now and
>>> can be easily adapted to future changes.
>>>
>>> We are re-posting the patch series with some cleanup and have removed
>>> the RFC
>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>> still
>>> marked as RFC as it tries to resolve some extension concerns related to
>>> RamDiscardManager for future usage.
>>>
>>> The overview of the patches:
>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>>     with a given range.
>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>     RamDiscardManager, and notify the shared/private state change during
>>>     conversion.
>>> - Patch 7: Try to resolve a semantics concern related to
>>> RamDiscardManager
>>>     i.e. RamDiscardManager is used to manage memory plug/unplug state
>>>     instead of shared/private state. It would affect future users of
>>>     RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>> patch[4].
>>>
>>> Changes since last version:
>>> - Add a patch to export some generic helper functions from virtio-mem
>>> code.
>>> - Change the bitmap in guest_memfd_manager from default shared to default
>>>     private. This keeps alignment with virtio-mem that 1-setting in bitmap
>>>     represents the populated state and may help to export more generic
>>> code
>>>     if necessary.
>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>> instance
>>>     to make it more clear.
>>> - Add a patch to distinguish between the shared/private state change and
>>>     the memory plug/unplug state change in RamDiscardManager.
>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>> chenyi.qiang@intel.com/
>>>
>>> ---
>>>
>>> Background
>>> ==========
>>> Confidential VMs have two classes of memory: shared and private memory.
>>> Shared memory is accessible from the host/VMM while private memory is
>>> not. Confidential VMs can decide which memory is shared/private and
>>> convert memory between shared/private at runtime.
>>>
>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>>> private memory. The key differences between guest_memfd and normal memfd
>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>>> cannot be mapped, read or written by userspace.
>>
>> The "cannot be mapped" seems to be not true soon anymore (if not already).
>>
>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/
> 
> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
> it below with in-place page conversion. Maybe I would move it here to
> make it more clear.
> 
>>
>>
>>>
>>> In QEMU's implementation, shared memory is allocated with normal methods
>>> (e.g. mmap or fallocate) while private memory is allocated from
>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>> allocates new pages from the other side.
>>>
> 
> [...]
> 
>>>
>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>> with
>>> the same granularity. The guest may perform partial conversions, such as
>>> converting a small region within a larger region. To prevent such invalid
>>> cases, all operations are performed with 4K granularity. The possible
>>> solutions we can think of are either to enable VFIO to support partial
>>> unmap

btw the old VFIO does not split mappings but iommufd seems to be capable 
of it - there is iopt_area_split(). What happens if you try unmapping a 
smaller chunk that does not exactly match any mapped chunk? thanks,


-- 
Alexey

Re: [PATCH 0/7] Enable shared device assignment

Posted by Chenyi Qiang 2 months, 4 weeks ago


On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 8/1/25 17:28, Chenyi Qiang wrote:
>> Thanks Alexey for your review!
>>
>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>> discard") effectively disables device assignment when using
>>>> guest_memfd.
>>>> This poses a significant challenge as guest_memfd is essential for
>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>> The initial rationale for disabling device assignment was due to stale
>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>>>> problem for confidential guests [1]. However, this assumption has
>>>> proven
>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>> against
>>>> "shared" or untrusted memory, which is crucial for device
>>>> initialization
>>>> and error recovery scenarios. As a result, the current implementation
>>>> does
>>>> not adequately support device assignment for confidential guests,
>>>> necessitating
>>>> a reevaluation of the approach to ensure compatibility and
>>>> functionality.
>>>>
>>>> This series enables shared device assignment by notifying VFIO of page
>>>> conversions using an existing framework named RamDiscardListener.
>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>> page
>>>> support for guest_memfd. This patch set introduces in-place page
>>>> conversion,
>>>> where private and shared memory share the same physical pages as the
>>>> backend.
>>>> This development may impact our solution.
>>>>
>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>> compatibility with the new changes and potential future directions
>>>> (see [3]
>>>> for more details). The conclusion was that, although our solution may
>>>> not be
>>>> the most elegant (see the Limitation section), it is sufficient for
>>>> now and
>>>> can be easily adapted to future changes.
>>>>
>>>> We are re-posting the patch series with some cleanup and have removed
>>>> the RFC
>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>> still
>>>> marked as RFC as it tries to resolve some extension concerns related to
>>>> RamDiscardManager for future usage.
>>>>
>>>> The overview of the patches:
>>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>>>     with a given range.
>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>     RamDiscardManager, and notify the shared/private state change
>>>> during
>>>>     conversion.
>>>> - Patch 7: Try to resolve a semantics concern related to
>>>> RamDiscardManager
>>>>     i.e. RamDiscardManager is used to manage memory plug/unplug state
>>>>     instead of shared/private state. It would affect future users of
>>>>     RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>> patch[4].
>>>>
>>>> Changes since last version:
>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>> code.
>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>> default
>>>>     private. This keeps alignment with virtio-mem that 1-setting in
>>>> bitmap
>>>>     represents the populated state and may help to export more generic
>>>> code
>>>>     if necessary.
>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>> instance
>>>>     to make it more clear.
>>>> - Add a patch to distinguish between the shared/private state change
>>>> and
>>>>     the memory plug/unplug state change in RamDiscardManager.
>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>> chenyi.qiang@intel.com/
>>>>
>>>> ---
>>>>
>>>> Background
>>>> ==========
>>>> Confidential VMs have two classes of memory: shared and private memory.
>>>> Shared memory is accessible from the host/VMM while private memory is
>>>> not. Confidential VMs can decide which memory is shared/private and
>>>> convert memory between shared/private at runtime.
>>>>
>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>>>> private memory. The key differences between guest_memfd and normal
>>>> memfd
>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>> VM and
>>>> cannot be mapped, read or written by userspace.
>>>
>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>> already).
>>>
>>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/
>>
>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>> it below with in-place page conversion. Maybe I would move it here to
>> make it more clear.
>>
>>>
>>>
>>>>
>>>> In QEMU's implementation, shared memory is allocated with normal
>>>> methods
>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>> via
>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>> allocates new pages from the other side.
>>>>
>>
>> [...]
>>
>>>>
>>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>> with
>>>> the same granularity. The guest may perform partial conversions,
>>>> such as
>>>> converting a small region within a larger region. To prevent such
>>>> invalid
>>>> cases, all operations are performed with 4K granularity. The possible
>>>> solutions we can think of are either to enable VFIO to support partial
>>>> unmap
> 
> btw the old VFIO does not split mappings but iommufd seems to be capable
> of it - there is iopt_area_split(). What happens if you try unmapping a
> smaller chunk that does not exactly match any mapped chunk? thanks,

iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
disable_large_page=true. That means the large IOPTE is also disabled in
IOMMU. So it can do the split easily. See the comment in
iommufd_vfio_set_iommu().

iommufd VFIO compatible mode is a transition from legacy VFIO to
iommufd. For the normal iommufd, it requires the iova/length must be a
superset of a previously mapped range. If not match, will return error.

> 
>

Re: [PATCH 0/7] Enable shared device assignment

Posted by Alexey Kardashevskiy 2 months, 4 weeks ago


On 9/1/25 18:52, Chenyi Qiang wrote:
> 
> 
> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>> Thanks Alexey for your review!
>>>
>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>> discard") effectively disables device assignment when using
>>>>> guest_memfd.
>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>>> The initial rationale for disabling device assignment was due to stale
>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>>>>> problem for confidential guests [1]. However, this assumption has
>>>>> proven
>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>> against
>>>>> "shared" or untrusted memory, which is crucial for device
>>>>> initialization
>>>>> and error recovery scenarios. As a result, the current implementation
>>>>> does
>>>>> not adequately support device assignment for confidential guests,
>>>>> necessitating
>>>>> a reevaluation of the approach to ensure compatibility and
>>>>> functionality.
>>>>>
>>>>> This series enables shared device assignment by notifying VFIO of page
>>>>> conversions using an existing framework named RamDiscardListener.
>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>> page
>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>> conversion,
>>>>> where private and shared memory share the same physical pages as the
>>>>> backend.
>>>>> This development may impact our solution.
>>>>>
>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>> compatibility with the new changes and potential future directions
>>>>> (see [3]
>>>>> for more details). The conclusion was that, although our solution may
>>>>> not be
>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>> now and
>>>>> can be easily adapted to future changes.
>>>>>
>>>>> We are re-posting the patch series with some cleanup and have removed
>>>>> the RFC
>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>>> still
>>>>> marked as RFC as it tries to resolve some extension concerns related to
>>>>> RamDiscardManager for future usage.
>>>>>
>>>>> The overview of the patches:
>>>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>>>>      with a given range.
>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>      RamDiscardManager, and notify the shared/private state change
>>>>> during
>>>>>      conversion.
>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>> RamDiscardManager
>>>>>      i.e. RamDiscardManager is used to manage memory plug/unplug state
>>>>>      instead of shared/private state. It would affect future users of
>>>>>      RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>>> patch[4].
>>>>>
>>>>> Changes since last version:
>>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>>> code.
>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>> default
>>>>>      private. This keeps alignment with virtio-mem that 1-setting in
>>>>> bitmap
>>>>>      represents the populated state and may help to export more generic
>>>>> code
>>>>>      if necessary.
>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>>> instance
>>>>>      to make it more clear.
>>>>> - Add a patch to distinguish between the shared/private state change
>>>>> and
>>>>>      the memory plug/unplug state change in RamDiscardManager.
>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>> chenyi.qiang@intel.com/
>>>>>
>>>>> ---
>>>>>
>>>>> Background
>>>>> ==========
>>>>> Confidential VMs have two classes of memory: shared and private memory.
>>>>> Shared memory is accessible from the host/VMM while private memory is
>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>> convert memory between shared/private at runtime.
>>>>>
>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>>>>> private memory. The key differences between guest_memfd and normal
>>>>> memfd
>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>> VM and
>>>>> cannot be mapped, read or written by userspace.
>>>>
>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>> already).
>>>>
>>>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/T/
>>>
>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>>> it below with in-place page conversion. Maybe I would move it here to
>>> make it more clear.
>>>
>>>>
>>>>
>>>>>
>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>> methods
>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>>> via
>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>>> allocates new pages from the other side.
>>>>>
>>>
>>> [...]
>>>
>>>>>
>>>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>>> with
>>>>> the same granularity. The guest may perform partial conversions,
>>>>> such as
>>>>> converting a small region within a larger region. To prevent such
>>>>> invalid
>>>>> cases, all operations are performed with 4K granularity. The possible
>>>>> solutions we can think of are either to enable VFIO to support partial
>>>>> unmap
>>
>> btw the old VFIO does not split mappings but iommufd seems to be capable
>> of it - there is iopt_area_split(). What happens if you try unmapping a
>> smaller chunk that does not exactly match any mapped chunk? thanks,
> 
> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
> disable_large_page=true. That means the large IOPTE is also disabled in
> IOMMU. So it can do the split easily. See the comment in
> iommufd_vfio_set_iommu().
> 
> iommufd VFIO compatible mode is a transition from legacy VFIO to
> iommufd. For the normal iommufd, it requires the iova/length must be a
> superset of a previously mapped range. If not match, will return error.


This is all true but this also means that "The former requires complex 
changes in VFIO" is not entirely true - some code is already there. Thanks,



-- 
Alexey

Re: [PATCH 0/7] Enable shared device assignment

Posted by Chenyi Qiang 2 months, 4 weeks ago


On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 18:52, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>> Thanks Alexey for your review!
>>>>
>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>>> discard") effectively disables device assignment when using
>>>>>> guest_memfd.
>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>>>> The initial rationale for disabling device assignment was due to
>>>>>> stale
>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>> assignment
>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>> proven
>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>> against
>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>> initialization
>>>>>> and error recovery scenarios. As a result, the current implementation
>>>>>> does
>>>>>> not adequately support device assignment for confidential guests,
>>>>>> necessitating
>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>> functionality.
>>>>>>
>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>> page
>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>> page
>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>> conversion,
>>>>>> where private and shared memory share the same physical pages as the
>>>>>> backend.
>>>>>> This development may impact our solution.
>>>>>>
>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>> compatibility with the new changes and potential future directions
>>>>>> (see [3]
>>>>>> for more details). The conclusion was that, although our solution may
>>>>>> not be
>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>> now and
>>>>>> can be easily adapted to future changes.
>>>>>>
>>>>>> We are re-posting the patch series with some cleanup and have removed
>>>>>> the RFC
>>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>>>> still
>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>> related to
>>>>>> RamDiscardManager for future usage.
>>>>>>
>>>>>> The overview of the patches:
>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>> MemoryRegionSection
>>>>>>      with a given range.
>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>      RamDiscardManager, and notify the shared/private state change
>>>>>> during
>>>>>>      conversion.
>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>> RamDiscardManager
>>>>>>      i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>> state
>>>>>>      instead of shared/private state. It would affect future users of
>>>>>>      RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>>>> patch[4].
>>>>>>
>>>>>> Changes since last version:
>>>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>>>> code.
>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>> default
>>>>>>      private. This keeps alignment with virtio-mem that 1-setting in
>>>>>> bitmap
>>>>>>      represents the populated state and may help to export more
>>>>>> generic
>>>>>> code
>>>>>>      if necessary.
>>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>>>> instance
>>>>>>      to make it more clear.
>>>>>> - Add a patch to distinguish between the shared/private state change
>>>>>> and
>>>>>>      the memory plug/unplug state change in RamDiscardManager.
>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>> chenyi.qiang@intel.com/
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Background
>>>>>> ==========
>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>> memory.
>>>>>> Shared memory is accessible from the host/VMM while private memory is
>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>> convert memory between shared/private at runtime.
>>>>>>
>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>> guest
>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>> memfd
>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>> VM and
>>>>>> cannot be mapped, read or written by userspace.
>>>>>
>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>> already).
>>>>>
>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>> tabba@google.com/T/
>>>>
>>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>>>> it below with in-place page conversion. Maybe I would move it here to
>>>> make it more clear.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>> methods
>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>>>> via
>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>>>> allocates new pages from the other side.
>>>>>>
>>>>
>>>> [...]
>>>>
>>>>>>
>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>> VFIO
>>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>>>> with
>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>> such as
>>>>>> converting a small region within a larger region. To prevent such
>>>>>> invalid
>>>>>> cases, all operations are performed with 4K granularity. The possible
>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>> partial
>>>>>> unmap
>>>
>>> btw the old VFIO does not split mappings but iommufd seems to be capable
>>> of it - there is iopt_area_split(). What happens if you try unmapping a
>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>
>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>> disable_large_page=true. That means the large IOPTE is also disabled in
>> IOMMU. So it can do the split easily. See the comment in
>> iommufd_vfio_set_iommu().
>>
>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>> iommufd. For the normal iommufd, it requires the iova/length must be a
>> superset of a previously mapped range. If not match, will return error.
> 
> 
> This is all true but this also means that "The former requires complex
> changes in VFIO" is not entirely true - some code is already there. Thanks,

Hmm, my statement is a little confusing.  The bottleneck is that the
IOMMU driver doesn't support the large page split. So if we want to
enable large page and want to do partial unmap, it requires complex change.

> 
> 
>

Re: [PATCH 0/7] Enable shared device assignment

Posted by Alexey Kardashevskiy 2 months, 4 weeks ago


On 9/1/25 19:49, Chenyi Qiang wrote:
> 
> 
> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 18:52, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>>> Thanks Alexey for your review!
>>>>>
>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>>>> discard") effectively disables device assignment when using
>>>>>>> guest_memfd.
>>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>>>>> The initial rationale for disabling device assignment was due to
>>>>>>> stale
>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>>> assignment
>>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>>> proven
>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>>> against
>>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>>> initialization
>>>>>>> and error recovery scenarios. As a result, the current implementation
>>>>>>> does
>>>>>>> not adequately support device assignment for confidential guests,
>>>>>>> necessitating
>>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>>> functionality.
>>>>>>>
>>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>>> page
>>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>>> page
>>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>>> conversion,
>>>>>>> where private and shared memory share the same physical pages as the
>>>>>>> backend.
>>>>>>> This development may impact our solution.
>>>>>>>
>>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>>> compatibility with the new changes and potential future directions
>>>>>>> (see [3]
>>>>>>> for more details). The conclusion was that, although our solution may
>>>>>>> not be
>>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>>> now and
>>>>>>> can be easily adapted to future changes.
>>>>>>>
>>>>>>> We are re-posting the patch series with some cleanup and have removed
>>>>>>> the RFC
>>>>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>>>>> still
>>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>>> related to
>>>>>>> RamDiscardManager for future usage.
>>>>>>>
>>>>>>> The overview of the patches:
>>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>>> MemoryRegionSection
>>>>>>>       with a given range.
>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>>       RamDiscardManager, and notify the shared/private state change
>>>>>>> during
>>>>>>>       conversion.
>>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>>> RamDiscardManager
>>>>>>>       i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>>> state
>>>>>>>       instead of shared/private state. It would affect future users of
>>>>>>>       RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>>>>> patch[4].
>>>>>>>
>>>>>>> Changes since last version:
>>>>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>>>>> code.
>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>>> default
>>>>>>>       private. This keeps alignment with virtio-mem that 1-setting in
>>>>>>> bitmap
>>>>>>>       represents the populated state and may help to export more
>>>>>>> generic
>>>>>>> code
>>>>>>>       if necessary.
>>>>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>>>>> instance
>>>>>>>       to make it more clear.
>>>>>>> - Add a patch to distinguish between the shared/private state change
>>>>>>> and
>>>>>>>       the memory plug/unplug state change in RamDiscardManager.
>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>>> chenyi.qiang@intel.com/
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> Background
>>>>>>> ==========
>>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>>> memory.
>>>>>>> Shared memory is accessible from the host/VMM while private memory is
>>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>>> convert memory between shared/private at runtime.
>>>>>>>
>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>>> guest
>>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>>> memfd
>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>>> VM and
>>>>>>> cannot be mapped, read or written by userspace.
>>>>>>
>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>>> already).
>>>>>>
>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>>> tabba@google.com/T/
>>>>>
>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>>>>> it below with in-place page conversion. Maybe I would move it here to
>>>>> make it more clear.
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>>> methods
>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>>>>> via
>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>>>>> allocates new pages from the other side.
>>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>>>
>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>>> VFIO
>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>>>>> with
>>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>>> such as
>>>>>>> converting a small region within a larger region. To prevent such
>>>>>>> invalid
>>>>>>> cases, all operations are performed with 4K granularity. The possible
>>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>>> partial
>>>>>>> unmap
>>>>
>>>> btw the old VFIO does not split mappings but iommufd seems to be capable
>>>> of it - there is iopt_area_split(). What happens if you try unmapping a
>>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>>
>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>>> disable_large_page=true. That means the large IOPTE is also disabled in
>>> IOMMU. So it can do the split easily. See the comment in
>>> iommufd_vfio_set_iommu().
>>>
>>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>>> iommufd. For the normal iommufd, it requires the iova/length must be a
>>> superset of a previously mapped range. If not match, will return error.
>>
>>
>> This is all true but this also means that "The former requires complex
>> changes in VFIO" is not entirely true - some code is already there. Thanks,
> 
> Hmm, my statement is a little confusing.  The bottleneck is that the
> IOMMU driver doesn't support the large page split. So if we want to
> enable large page and want to do partial unmap, it requires complex change.

We won't need to split large pages (if we stick to 4K for now), we need 
to split large mappings (not large pages) to allow partial unmapping and 
iopt_area_split() seems to be doing this. Thanks,


> 
>>
>>
>>
> 

-- 
Alexey

Re: [PATCH 0/7] Enable shared device assignment

Posted by Chenyi Qiang 2 months, 4 weeks ago


On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 19:49, Chenyi Qiang wrote:
>>
>>
>> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 18:52, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>>>> Thanks Alexey for your review!
>>>>>>
>>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>> uncoordinated
>>>>>>>> discard") effectively disables device assignment when using
>>>>>>>> guest_memfd.
>>>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>>>> confidential guests, thereby blocking device assignment to these
>>>>>>>> VMs.
>>>>>>>> The initial rationale for disabling device assignment was due to
>>>>>>>> stale
>>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE
>>>>>>>> I/O
>>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>>>> assignment
>>>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>>>> proven
>>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>>>> against
>>>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>>>> initialization
>>>>>>>> and error recovery scenarios. As a result, the current
>>>>>>>> implementation
>>>>>>>> does
>>>>>>>> not adequately support device assignment for confidential guests,
>>>>>>>> necessitating
>>>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>>>> functionality.
>>>>>>>>
>>>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>>>> page
>>>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>>>> page
>>>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>>>> conversion,
>>>>>>>> where private and shared memory share the same physical pages as
>>>>>>>> the
>>>>>>>> backend.
>>>>>>>> This development may impact our solution.
>>>>>>>>
>>>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>>>> compatibility with the new changes and potential future directions
>>>>>>>> (see [3]
>>>>>>>> for more details). The conclusion was that, although our
>>>>>>>> solution may
>>>>>>>> not be
>>>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>>>> now and
>>>>>>>> can be easily adapted to future changes.
>>>>>>>>
>>>>>>>> We are re-posting the patch series with some cleanup and have
>>>>>>>> removed
>>>>>>>> the RFC
>>>>>>>> label for the main enabling patches (1-6). The newly-added patch
>>>>>>>> 7 is
>>>>>>>> still
>>>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>>>> related to
>>>>>>>> RamDiscardManager for future usage.
>>>>>>>>
>>>>>>>> The overview of the patches:
>>>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>>>> MemoryRegionSection
>>>>>>>>       with a given range.
>>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>>>       RamDiscardManager, and notify the shared/private state change
>>>>>>>> during
>>>>>>>>       conversion.
>>>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>>>> RamDiscardManager
>>>>>>>>       i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>>>> state
>>>>>>>>       instead of shared/private state. It would affect future
>>>>>>>> users of
>>>>>>>>       RamDiscardManger in confidential VMs. Attach it behind as
>>>>>>>> a RFC
>>>>>>>> patch[4].
>>>>>>>>
>>>>>>>> Changes since last version:
>>>>>>>> - Add a patch to export some generic helper functions from
>>>>>>>> virtio-mem
>>>>>>>> code.
>>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>>>> default
>>>>>>>>       private. This keeps alignment with virtio-mem that 1-
>>>>>>>> setting in
>>>>>>>> bitmap
>>>>>>>>       represents the populated state and may help to export more
>>>>>>>> generic
>>>>>>>> code
>>>>>>>>       if necessary.
>>>>>>>> - Add the helpers to initialize/uninitialize the
>>>>>>>> guest_memfd_manager
>>>>>>>> instance
>>>>>>>>       to make it more clear.
>>>>>>>> - Add a patch to distinguish between the shared/private state
>>>>>>>> change
>>>>>>>> and
>>>>>>>>       the memory plug/unplug state change in RamDiscardManager.
>>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>>>> chenyi.qiang@intel.com/
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Background
>>>>>>>> ==========
>>>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>>>> memory.
>>>>>>>> Shared memory is accessible from the host/VMM while private
>>>>>>>> memory is
>>>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>>>> convert memory between shared/private at runtime.
>>>>>>>>
>>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>>>> guest
>>>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>>>> memfd
>>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>>>> VM and
>>>>>>>> cannot be mapped, read or written by userspace.
>>>>>>>
>>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>>>> already).
>>>>>>>
>>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>>>> tabba@google.com/T/
>>>>>>
>>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I
>>>>>> mentioned
>>>>>> it below with in-place page conversion. Maybe I would move it here to
>>>>>> make it more clear.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>>>> methods
>>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees
>>>>>>>> pages
>>>>>>>> via
>>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one
>>>>>>>> side and
>>>>>>>> allocates new pages from the other side.
>>>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>>
>>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>>>> VFIO
>>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and
>>>>>>>> unmapped
>>>>>>>> with
>>>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>>>> such as
>>>>>>>> converting a small region within a larger region. To prevent such
>>>>>>>> invalid
>>>>>>>> cases, all operations are performed with 4K granularity. The
>>>>>>>> possible
>>>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>>>> partial
>>>>>>>> unmap
>>>>>
>>>>> btw the old VFIO does not split mappings but iommufd seems to be
>>>>> capable
>>>>> of it - there is iopt_area_split(). What happens if you try
>>>>> unmapping a
>>>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>>>
>>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>>>> disable_large_page=true. That means the large IOPTE is also disabled in
>>>> IOMMU. So it can do the split easily. See the comment in
>>>> iommufd_vfio_set_iommu().
>>>>
>>>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>>>> iommufd. For the normal iommufd, it requires the iova/length must be a
>>>> superset of a previously mapped range. If not match, will return error.
>>>
>>>
>>> This is all true but this also means that "The former requires complex
>>> changes in VFIO" is not entirely true - some code is already there.
>>> Thanks,
>>
>> Hmm, my statement is a little confusing.  The bottleneck is that the
>> IOMMU driver doesn't support the large page split. So if we want to
>> enable large page and want to do partial unmap, it requires complex
>> change.
> 
> We won't need to split large pages (if we stick to 4K for now), we need
> to split large mappings (not large pages) to allow partial unmapping and
> iopt_area_split() seems to be doing this. Thanks,

You mean we can disable large page in iommufd and then VFIO will be able
to do partial unmap. Yes, I think it is doable and we can avoid many
ioctl context switches overhead.

> 
> 
>>
>>>
>>>
>>>
>>
>

Re: [PATCH 0/7] Enable shared device assignment

Posted by David Hildenbrand 2 months, 4 weeks ago

On 10.01.25 08:06, Chenyi Qiang wrote:
> 
> 
> On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 9/1/25 19:49, Chenyi Qiang wrote:
>>>
>>>
>>> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 9/1/25 18:52, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>>>>> Thanks Alexey for your review!
>>>>>>>
>>>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>>> uncoordinated
>>>>>>>>> discard") effectively disables device assignment when using
>>>>>>>>> guest_memfd.
>>>>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>>>>> confidential guests, thereby blocking device assignment to these
>>>>>>>>> VMs.
>>>>>>>>> The initial rationale for disabling device assignment was due to
>>>>>>>>> stale
>>>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE
>>>>>>>>> I/O
>>>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>>>>> assignment
>>>>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>>>>> proven
>>>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>>>>> against
>>>>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>>>>> initialization
>>>>>>>>> and error recovery scenarios. As a result, the current
>>>>>>>>> implementation
>>>>>>>>> does
>>>>>>>>> not adequately support device assignment for confidential guests,
>>>>>>>>> necessitating
>>>>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>>>>> functionality.
>>>>>>>>>
>>>>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>>>>> page
>>>>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>>>>> page
>>>>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>>>>> conversion,
>>>>>>>>> where private and shared memory share the same physical pages as
>>>>>>>>> the
>>>>>>>>> backend.
>>>>>>>>> This development may impact our solution.
>>>>>>>>>
>>>>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>>>>> compatibility with the new changes and potential future directions
>>>>>>>>> (see [3]
>>>>>>>>> for more details). The conclusion was that, although our
>>>>>>>>> solution may
>>>>>>>>> not be
>>>>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>>>>> now and
>>>>>>>>> can be easily adapted to future changes.
>>>>>>>>>
>>>>>>>>> We are re-posting the patch series with some cleanup and have
>>>>>>>>> removed
>>>>>>>>> the RFC
>>>>>>>>> label for the main enabling patches (1-6). The newly-added patch
>>>>>>>>> 7 is
>>>>>>>>> still
>>>>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>>>>> related to
>>>>>>>>> RamDiscardManager for future usage.
>>>>>>>>>
>>>>>>>>> The overview of the patches:
>>>>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>>>>> MemoryRegionSection
>>>>>>>>>        with a given range.
>>>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>>>>        RamDiscardManager, and notify the shared/private state change
>>>>>>>>> during
>>>>>>>>>        conversion.
>>>>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>>>>> RamDiscardManager
>>>>>>>>>        i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>>>>> state
>>>>>>>>>        instead of shared/private state. It would affect future
>>>>>>>>> users of
>>>>>>>>>        RamDiscardManger in confidential VMs. Attach it behind as
>>>>>>>>> a RFC
>>>>>>>>> patch[4].
>>>>>>>>>
>>>>>>>>> Changes since last version:
>>>>>>>>> - Add a patch to export some generic helper functions from
>>>>>>>>> virtio-mem
>>>>>>>>> code.
>>>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>>>>> default
>>>>>>>>>        private. This keeps alignment with virtio-mem that 1-
>>>>>>>>> setting in
>>>>>>>>> bitmap
>>>>>>>>>        represents the populated state and may help to export more
>>>>>>>>> generic
>>>>>>>>> code
>>>>>>>>>        if necessary.
>>>>>>>>> - Add the helpers to initialize/uninitialize the
>>>>>>>>> guest_memfd_manager
>>>>>>>>> instance
>>>>>>>>>        to make it more clear.
>>>>>>>>> - Add a patch to distinguish between the shared/private state
>>>>>>>>> change
>>>>>>>>> and
>>>>>>>>>        the memory plug/unplug state change in RamDiscardManager.
>>>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>>>>> chenyi.qiang@intel.com/
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Background
>>>>>>>>> ==========
>>>>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>>>>> memory.
>>>>>>>>> Shared memory is accessible from the host/VMM while private
>>>>>>>>> memory is
>>>>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>>>>> convert memory between shared/private at runtime.
>>>>>>>>>
>>>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>>>>> guest
>>>>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>>>>> memfd
>>>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>>>>> VM and
>>>>>>>>> cannot be mapped, read or written by userspace.
>>>>>>>>
>>>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>>>>> already).
>>>>>>>>
>>>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>>>>> tabba@google.com/T/
>>>>>>>
>>>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I
>>>>>>> mentioned
>>>>>>> it below with in-place page conversion. Maybe I would move it here to
>>>>>>> make it more clear.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>>>>> methods
>>>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees
>>>>>>>>> pages
>>>>>>>>> via
>>>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one
>>>>>>>>> side and
>>>>>>>>> allocates new pages from the other side.
>>>>>>>>>
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>>>
>>>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>>>>> VFIO
>>>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and
>>>>>>>>> unmapped
>>>>>>>>> with
>>>>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>>>>> such as
>>>>>>>>> converting a small region within a larger region. To prevent such
>>>>>>>>> invalid
>>>>>>>>> cases, all operations are performed with 4K granularity. The
>>>>>>>>> possible
>>>>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>>>>> partial
>>>>>>>>> unmap
>>>>>>
>>>>>> btw the old VFIO does not split mappings but iommufd seems to be
>>>>>> capable
>>>>>> of it - there is iopt_area_split(). What happens if you try
>>>>>> unmapping a
>>>>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>>>>
>>>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>>>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>>>>> disable_large_page=true. That means the large IOPTE is also disabled in
>>>>> IOMMU. So it can do the split easily. See the comment in
>>>>> iommufd_vfio_set_iommu().
>>>>>
>>>>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>>>>> iommufd. For the normal iommufd, it requires the iova/length must be a
>>>>> superset of a previously mapped range. If not match, will return error.
>>>>
>>>>
>>>> This is all true but this also means that "The former requires complex
>>>> changes in VFIO" is not entirely true - some code is already there.
>>>> Thanks,
>>>
>>> Hmm, my statement is a little confusing.  The bottleneck is that the
>>> IOMMU driver doesn't support the large page split. So if we want to
>>> enable large page and want to do partial unmap, it requires complex
>>> change.
>>
>> We won't need to split large pages (if we stick to 4K for now), we need
>> to split large mappings (not large pages) to allow partial unmapping and
>> iopt_area_split() seems to be doing this. Thanks,
> 
> You mean we can disable large page in iommufd and then VFIO will be able
> to do partial unmap. Yes, I think it is doable and we can avoid many
> ioctl context switches overhead.

So I understand this correctly: the disable_large_pages=true will imply 
that we never have PMD mappings such that we can atomically poke a hole 
in a mapping, without temporarily having to remove a PMD mapping in the 
iommu table to insert a PTE table?

batch_iommu_map_small() seems to document that behavior.

It's interesting that that comment points out that this is purely "VFIO 
compatibility", and that it otherwise violates the iommufd invariant: 
"pairing map/unmap". So, it is against the real iommufd design ...

Back when working on virtio-mem support (RAMDiscardManager), thought 
there was not way to reliably do atomic partial unmappings.

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/7] Enable shared device assignment

Posted by Jason Gunthorpe 2 months, 4 weeks ago

On Fri, Jan 10, 2025 at 09:26:02AM +0100, David Hildenbrand wrote:
> > > > > > > > > > One limitation (also discussed in the guest_memfd
> > > > > > > > > > meeting) is that VFIO expects the DMA mapping for
> > > > > > > > > > a specific IOVA to be mapped and unmapped with the
> > > > > > > > > > same granularity.

Not just same granularity, whatever you map you have to unmap in
whole. map/unmap must be perfectly paired by userspace.

> > > > > > > > > > such as converting a small region within a larger
> > > > > > > > > > region. To prevent such invalid cases, all
> > > > > > > > > > operations are performed with 4K granularity. The
> > > > > > > > > > possible solutions we can think of are either to
> > > > > > > > > > enable VFIO to support partial unmap

Yes, you can do that, but it is aweful for performance everywhere

> > > > > > iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
> > > > > > iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
> > > > > > disable_large_page=true. That means the large IOPTE is also disabled in
> > > > > > IOMMU. So it can do the split easily. See the comment in
> > > > > > iommufd_vfio_set_iommu().

Yes. But I am working on a project to make this more general purpose
and not have the 4k limitation. There are now several use cases for
this kind of cut feature.

https://lore.kernel.org/linux-iommu/7-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/

> > > > > This is all true but this also means that "The former requires complex
> > > > > changes in VFIO" is not entirely true - some code is already there.

Well, to do it without forcing 4k requires complex changes.

> > > > Hmm, my statement is a little confusing.  The bottleneck is that the
> > > > IOMMU driver doesn't support the large page split. So if we want to
> > > > enable large page and want to do partial unmap, it requires complex
> > > > change.

Yes, this is what I'm working on.

> > > We won't need to split large pages (if we stick to 4K for now), we need
> > > to split large mappings (not large pages) to allow partial unmapping and
> > > iopt_area_split() seems to be doing this. Thanks,

Correct
 
> > You mean we can disable large page in iommufd and then VFIO will be able
> > to do partial unmap. Yes, I think it is doable and we can avoid many
> > ioctl context switches overhead.

Right

> So I understand this correctly: the disable_large_pages=true will imply that
> we never have PMD mappings such that we can atomically poke a hole in a
> mapping, without temporarily having to remove a PMD mapping in the iommu
> table to insert a PTE table?

Yes
 
> batch_iommu_map_small() seems to document that behavior.

Yes
 
> It's interesting that that comment points out that this is purely "VFIO
> compatibility", and that it otherwise violates the iommufd invariant:
> "pairing map/unmap". So, it is against the real iommufd design ...

IIRC you can only trigger split using the VFIO type 1 legacy API. We
would need to formalize split as an IOMMUFD native ioctl.

Nobody should use this stuf through the legacy type 1 API!!!!

> Back when working on virtio-mem support (RAMDiscardManager), thought there
> was not way to reliably do atomic partial unmappings.

Correct

Jason

Re: [PATCH 0/7] Enable shared device assignment

Posted by David Hildenbrand 2 months, 4 weeks ago

On 10.01.25 14:20, Jason Gunthorpe wrote:

Thanks for your reply, I knew CCing you would be very helpful :)

> On Fri, Jan 10, 2025 at 09:26:02AM +0100, David Hildenbrand wrote:
>>>>>>>>>>> One limitation (also discussed in the guest_memfd
>>>>>>>>>>> meeting) is that VFIO expects the DMA mapping for
>>>>>>>>>>> a specific IOVA to be mapped and unmapped with the
>>>>>>>>>>> same granularity.
> 
> Not just same granularity, whatever you map you have to unmap in
> whole. map/unmap must be perfectly paired by userspace.

Right, that's what virtio-mem ends up doing by mapping each memory block 
(e.g., 2 MiB) separately that could be unmapped separately.

It adds "overhead", but at least you don't run into "no, you cannot 
split this region because you would be out of memory/slots" or in the 
past issues with concurrent ongoing DMA.

> 
>>>>>>>>>>> such as converting a small region within a larger
>>>>>>>>>>> region. To prevent such invalid cases, all
>>>>>>>>>>> operations are performed with 4K granularity. The
>>>>>>>>>>> possible solutions we can think of are either to
>>>>>>>>>>> enable VFIO to support partial unmap
> 
> Yes, you can do that, but it is aweful for performance everywhere

Absolutely.

In your commit I read:

"Implement the cut operation to be hitless, changes to the page table
during cutting must cause zero disruption to any ongoing DMA. This is 
the expectation of the VFIO type 1 uAPI. Hitless requires HW support, it 
is incompatible with HW requiring break-before-make."

So I guess that would mean that, depending on HW support, one could 
avoid disabling large pages to still allow for atomic cuts / partial 
unmaps that don't affect concurrent DMA.

What would be your suggestion here to avoid the "map each 4k page 
individually so we can unmap it individually" ? I didn't completely 
grasp that, sorry.

 From "IIRC you can only trigger split using the VFIO type 1 legacy API. 
We would need to formalize split as an IOMMUFD native ioctl.
Nobody should use this stuf through the legacy type 1 API!!!!"

I assume you mean that we can only avoid the 4k map/unmap if we add 
proper support to IOMMUFD native ioctl, and not try making it fly 
somehow with the legacy type 1 API?

-- 
Cheers,

David / dhildenb

Re: [PATCH 0/7] Enable shared device assignment

Posted by Jason Gunthorpe 2 months, 4 weeks ago

On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote:
> 
> In your commit I read:
> 
> "Implement the cut operation to be hitless, changes to the page table
> during cutting must cause zero disruption to any ongoing DMA. This is the
> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is
> incompatible with HW requiring break-before-make."
> 
> So I guess that would mean that, depending on HW support, one could avoid
> disabling large pages to still allow for atomic cuts / partial unmaps that
> don't affect concurrent DMA.

Yes. Most x86 server HW will do this, though ARM support is a bit newish.

> What would be your suggestion here to avoid the "map each 4k page
> individually so we can unmap it individually" ? I didn't completely grasp
> that, sorry.

Map in large ranges in the VMM, lets say 1G of shared memory as a
single mapping (called an iommufd area)

When the guest makes a 2M chunk of it private you do a ioctl to
iommufd to split the area into three, leaving the 2M chunk as a
seperate area.

The new iommufd ioctl to split areas will go down into the iommu driver
and atomically cut the 1G PTEs into smaller PTEs as necessary so that
no PTE spans the edges of the 2M area.

Then userspace can unmap the 2M area and leave the remainder of the 1G
area mapped.

All of this would be fully hitless to ongoing DMA.

The iommufs code is there to do this assuming the areas are mapped at
4k, what is missing is the iommu driver side to atomically resize
large PTEs.

> From "IIRC you can only trigger split using the VFIO type 1 legacy API. We
> would need to formalize split as an IOMMUFD native ioctl.
> Nobody should use this stuf through the legacy type 1 API!!!!"
> 
> I assume you mean that we can only avoid the 4k map/unmap if we add proper
> support to IOMMUFD native ioctl, and not try making it fly somehow with the
> legacy type 1 API?

The thread was talking about the built-in support in iommufd to split
mappings. That built-in support is only accessible through legacy APIs
and should never be used in new qemu code. To use that built in
support in new code we need to build new APIs. The advantage of the
built-in support is qemu can map in large regions (which is more
efficient) and the kernel will break it down to 4k for the iommu
driver.

Mapping 4k at a time through the uAPI would be outrageously
inefficient.

Jason

Re: [PATCH 0/7] Enable shared device assignment

Posted by Alexey Kardashevskiy 2 months, 3 weeks ago

On 11/1/25 01:14, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote:
>>
>> In your commit I read:
>>
>> "Implement the cut operation to be hitless, changes to the page table
>> during cutting must cause zero disruption to any ongoing DMA. This is the
>> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is
>> incompatible with HW requiring break-before-make."
>>
>> So I guess that would mean that, depending on HW support, one could avoid
>> disabling large pages to still allow for atomic cuts / partial unmaps that
>> don't affect concurrent DMA.
> 
> Yes. Most x86 server HW will do this, though ARM support is a bit newish.
> 
>> What would be your suggestion here to avoid the "map each 4k page
>> individually so we can unmap it individually" ? I didn't completely grasp
>> that, sorry.
> 
> Map in large ranges in the VMM, lets say 1G of shared memory as a
> single mapping (called an iommufd area)
> 
> When the guest makes a 2M chunk of it private you do a ioctl to
> iommufd to split the area into three, leaving the 2M chunk as a
> seperate area.
> 
> The new iommufd ioctl to split areas will go down into the iommu driver
> and atomically cut the 1G PTEs into smaller PTEs as necessary so that
> no PTE spans the edges of the 2M area.
> 
> Then userspace can unmap the 2M area and leave the remainder of the 1G
> area mapped.
> 
> All of this would be fully hitless to ongoing DMA.
> 
> The iommufs code is there to do this assuming the areas are mapped at
> 4k, what is missing is the iommu driver side to atomically resize
> large PTEs.
> 
>>  From "IIRC you can only trigger split using the VFIO type 1 legacy API. We
>> would need to formalize split as an IOMMUFD native ioctl.
>> Nobody should use this stuf through the legacy type 1 API!!!!"
>>
>> I assume you mean that we can only avoid the 4k map/unmap if we add proper
>> support to IOMMUFD native ioctl, and not try making it fly somehow with the
>> legacy type 1 API?
> 
> The thread was talking about the built-in support in iommufd to split
> mappings. 

Just to clarify - I am talking about splitting only "iommufd areas", not 
large pages. If all IOMMU PTEs are 4k and areas are bigger than 4K => 
the hw support is not needed to allow splitting. The comments above and 
below seem to confuse large pages with large areas (well, I am consufed, 
at least).


> That built-in support is only accessible through legacy APIs
> and should never be used in new qemu code. To use that built in
> support in new code we need to build new APIs.

Why would not IOMMU_IOAS_MAP/UNMAP uAPI work? Thanks,

> The advantage of the
> built-in support is qemu can map in large regions (which is more
> efficient) and the kernel will break it down to 4k for the iommu
> driver.
> Mapping 4k at a time through the uAPI would be outrageously
> inefficient.

> 
> Jason
> 

-- 
Alexey

Re: [PATCH 0/7] Enable shared device assignment

Posted by Jason Gunthorpe 2 months, 3 weeks ago

On Wed, Jan 15, 2025 at 02:39:55PM +1100, Alexey Kardashevskiy wrote:
> > The thread was talking about the built-in support in iommufd to split
> > mappings.
> 
> Just to clarify - I am talking about splitting only "iommufd areas", not
> large pages.

In generality it is the same thing as you cannot generally guarantee
that an area split doesn't also cross a large page.

> If all IOMMU PTEs are 4k and areas are bigger than 4K => the hw
> support is not needed to allow splitting. The comments above and below seem
> to confuse large pages with large areas (well, I am consufed, at least).

Yes, in that special case yes.

> > That built-in support is only accessible through legacy APIs
> > and should never be used in new qemu code. To use that built in
> > support in new code we need to build new APIs.
> 
> Why would not IOMMU_IOAS_MAP/UNMAP uAPI work? Thanks,

I don't want to overload those APIs, I prefer to see a new API that is
just about splitting areas. Splitting is a special operation that can
fail depending on driver support.

Jason

Re: [PATCH 0/7] Enable shared device assignment

Posted by David Hildenbrand 2 months, 4 weeks ago

On 10.01.25 15:14, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 02:45:39PM +0100, David Hildenbrand wrote:
>>
>> In your commit I read:
>>
>> "Implement the cut operation to be hitless, changes to the page table
>> during cutting must cause zero disruption to any ongoing DMA. This is the
>> expectation of the VFIO type 1 uAPI. Hitless requires HW support, it is
>> incompatible with HW requiring break-before-make."
>>
>> So I guess that would mean that, depending on HW support, one could avoid
>> disabling large pages to still allow for atomic cuts / partial unmaps that
>> don't affect concurrent DMA.
> 
> Yes. Most x86 server HW will do this, though ARM support is a bit newish.
> 
>> What would be your suggestion here to avoid the "map each 4k page
>> individually so we can unmap it individually" ? I didn't completely grasp
>> that, sorry.
> 
> Map in large ranges in the VMM, lets say 1G of shared memory as a
> single mapping (called an iommufd area)
> 
> When the guest makes a 2M chunk of it private you do a ioctl to
> iommufd to split the area into three, leaving the 2M chunk as a
> seperate area.
> 
> The new iommufd ioctl to split areas will go down into the iommu driver
> and atomically cut the 1G PTEs into smaller PTEs as necessary so that
> no PTE spans the edges of the 2M area.
> 
> Then userspace can unmap the 2M area and leave the remainder of the 1G
> area mapped.
> 
> All of this would be fully hitless to ongoing DMA.
> 
> The iommufs code is there to do this assuming the areas are mapped at
> 4k, what is missing is the iommu driver side to atomically resize
> large PTEs.
> 
>>  From "IIRC you can only trigger split using the VFIO type 1 legacy API. We
>> would need to formalize split as an IOMMUFD native ioctl.
>> Nobody should use this stuf through the legacy type 1 API!!!!"
>>
>> I assume you mean that we can only avoid the 4k map/unmap if we add proper
>> support to IOMMUFD native ioctl, and not try making it fly somehow with the
>> legacy type 1 API?
> 
> The thread was talking about the built-in support in iommufd to split
> mappings. That built-in support is only accessible through legacy APIs
> and should never be used in new qemu code. To use that built in
> support in new code we need to build new APIs. The advantage of the
> built-in support is qemu can map in large regions (which is more
> efficient) and the kernel will break it down to 4k for the iommu
> driver.
> 
> Mapping 4k at a time through the uAPI would be outrageously
> inefficient.

Got it, makes all sense, thanks!

-- 
Cheers,

David / dhildenb