This patchset is also available at:
https://github.com/AMDESE/linux/tree/gmem-populate-rework-v2
and is based on top of kvm/next (e0c26d47def7)
Overview
--------
Yan previously posted a series[1] that reworked kvm_gmem_populate() to deal
with potential locking issues that might arise once in-place conversion
support[2] is added for guest_memfd. To quote Yan's original summary of the
issues:
(1)
In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
invalidation lock for protecting its preparedness tracking. Similarly, the
in-place conversion version of guest_memfd series by Ackerly also requires
kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
kvm_gmem_get_pfn
filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
filemap invalidation lock.
(2)
Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
resulting in the following lock sequence in tdx_vcpu_init_mem_region():
- filemap invalidation lock --> mm->mmap_lock
However, in future code, the shared filemap invalidation lock will be held
in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
- mm->mmap_lock --> filemap invalidation lock
This creates an AB-BA deadlock issue.
Sean has since then addressed (1) with his series[3] that avoids relying on
calling kvm_gmem_get_pfn() within the TDX post-populate callback to re-fetch
the PFN that was passed to it.
This series aims to address (2), which is still outstanding, and does so based
heavily on Sean's suggested approach[4] of hoisting the get_user_pages_fast()
out of the TDX post-populate callback so that it can be called prior to taking
the filemap invalidate lock so that the ABBA deadlock is no longer possible.
As preperation for this change, all the partial enablement for hugepages in
the kvm_gmem_populate() path is stripped out so that it can be better
considered once hugepage support is actually in place and code/design can be
kept simpler in the meantime.
It additionally removes 'preparation' tracking from guest_memfd, which would
similarly complicate locking considerations in the context of in-place
conversion (and even moreso in the context of hugepage support). This has
been discussed during both the guest_memfd calls and PUCK calls, and so far
no strong objections have been given, so hopefully that particular change
isn't too controversial.
Some items worth noting/discussing
----------------------------------
(A) While one of the aims of this rework is to implement things such that
a separate source address can still be passed to kvm_gmem_populate()
even though the gmem pages can be populated in-place from userspace
beforehand, issues still arise if the source address itself has the
KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set, e.g. if source/target
addresses are the same page. One line of reasoning would be to
conclude that KVM_MEMORY_ATTRIBUTE_PRIVATE implies that it cannot
be used as the source of a GUP/copy_from_user(), and thus cases like
source==target are naturally disallowed. Thus userspace has no choice
but to populate pages in-place *prior* to setting the
KVM_MEMORY_ATTRIBUTE_PRIVATE attribute (as kvm_gmem_populate()
requires), and passing in NULL for the source such that the GUP can
be skipped (otherwise, it will trigger the shared memory fault path,
which will then SIGBUS because it will see that it is faulting in
pages for which KVM_MEMORY_ATTRIBUTE_PRIVATE is set).
While workable, this would at the very least involve documentation
updates to KVM_TDX_INIT_MEM_REGION/KVM_SEV_SNP_LAUNCH_UPDATE to cover
these soon-to-be-possible scenarios. Ira posted a patch separately
that demonstrates how a NULL source could be safely handled within
the TDX post-populate callback[5].
Known issues / TODO
-------------------
- Compile-tested only for the TDX bits (testing/feedback welcome!)
Changes since RFC v1
--------------------
- and a prep patch to remove partial hugepage enablement in
kvm_gmem_populate() to simplify things until a hugepage implementation
can actually make use of it (Yan, Ira, Vishal, Sean)
- begin retroactively enforcing that source pages must be page-aligned
so that kvm_gmem_populate() callbacks can be simplified. add a patch
to update SNP user-facing documentation to mention this.
- drop handling for GUP'ing multiple pages before issuing callbacks.
This will only be needed for potentially for hugepages, and it must
simpler to handle per-page in the meantime. (Yan, Vishal)
- make sure TDX actually builds (Ira, Yan)
Thanks,
Mike
[1] https://lore.kernel.org/kvm/20250703062641.3247-1-yan.y.zhao@intel.com/
[2] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com/
[3] https://lore.kernel.org/kvm/20251030200951.3402865-1-seanjc@google.com/
[4] https://lore.kernel.org/kvm/aHEwT4X0RcfZzHlt@google.com/
[5] https://lore.kernel.org/kvm/20251105-tdx-init-in-place-v1-1-1196b67d0423@intel.com/
----------------------------------------------------------------
Michael Roth (5):
KVM: guest_memfd: Remove partial hugepage handling from kvm_gmem_populate()
KVM: guest_memfd: Remove preparation tracking
KVM: SEV: Document/enforce page-alignment for KVM_SEV_SNP_LAUNCH_UPDATE
KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION
KVM: guest_memfd: GUP source pages prior to populating guest memory
.../virt/kvm/x86/amd-memory-encryption.rst | 2 +-
Documentation/virt/kvm/x86/intel-tdx.rst | 2 +-
arch/x86/kvm/svm/sev.c | 108 +++++++---------
arch/x86/kvm/vmx/tdx.c | 15 +--
include/linux/kvm_host.h | 4 +-
virt/kvm/guest_memfd.c | 140 +++++++++++----------
6 files changed, 129 insertions(+), 142 deletions(-)