Documentation/filesystems/locking.rst | 2 +- Documentation/filesystems/vfs.rst | 15 +- Documentation/virt/kvm/api.rst | 5 + arch/arm64/include/asm/kvm_host.h | 5 - arch/x86/include/asm/kvm_host.h | 10 - arch/x86/kvm/x86.c | 53 +- fs/hugetlbfs/inode.c | 2 +- fs/nfs/dir.c | 9 +- fs/orangefs/inode.c | 3 +- include/linux/fs.h | 2 +- include/linux/guestmem.h | 23 + include/linux/huge_mm.h | 6 +- include/linux/hugetlb.h | 19 +- include/linux/kvm_host.h | 32 +- include/linux/mempolicy.h | 11 +- include/linux/mm.h | 2 + include/linux/page-flags.h | 32 + include/uapi/linux/guestmem.h | 29 + include/uapi/linux/kvm.h | 16 + include/uapi/linux/magic.h | 1 + mm/Kconfig | 13 + mm/Makefile | 1 + mm/debug.c | 1 + mm/filemap.c | 12 +- mm/guestmem_hugetlb.c | 512 +++++ mm/guestmem_hugetlb.h | 9 + mm/hugetlb.c | 488 ++--- mm/internal.h | 1 - mm/memcontrol.c | 2 + mm/memory.c | 1 + mm/mempolicy.c | 44 +- mm/secretmem.c | 3 +- mm/swap.c | 32 +- mm/truncate.c | 27 +- mm/vmscan.c | 4 +- tools/testing/selftests/kvm/Makefile.kvm | 2 + .../kvm/guest_memfd_conversions_test.c | 797 ++++++++ .../kvm/guest_memfd_hugetlb_reporting_test.c | 384 ++++ ...uest_memfd_provide_hugetlb_cgroup_mount.sh | 36 + .../testing/selftests/kvm/guest_memfd_test.c | 293 ++- ...memfd_wrap_test_check_hugetlb_reporting.sh | 95 + .../testing/selftests/kvm/include/kvm_util.h | 104 +- .../testing/selftests/kvm/include/test_util.h | 20 +- .../selftests/kvm/include/ucall_common.h | 1 + tools/testing/selftests/kvm/lib/kvm_util.c | 465 +++-- tools/testing/selftests/kvm/lib/test_util.c | 102 + .../testing/selftests/kvm/lib/ucall_common.c | 16 +- .../kvm/x86/private_mem_conversions_test.c | 195 +- .../kvm/x86/private_mem_conversions_test.sh | 100 + virt/kvm/Kconfig | 5 + virt/kvm/guest_memfd.c | 1655 ++++++++++++++++- virt/kvm/kvm_main.c | 14 +- virt/kvm/kvm_mm.h | 9 +- 53 files changed, 5080 insertions(+), 640 deletions(-) create mode 100644 include/linux/guestmem.h create mode 100644 include/uapi/linux/guestmem.h create mode 100644 mm/guestmem_hugetlb.c create mode 100644 mm/guestmem_hugetlb.h create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c create mode 100755 tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh create mode 100755 tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
Hello,
This patchset builds upon discussion at LPC 2024 and many guest_memfd
upstream calls to provide 1G page support for guest_memfd by taking
pages from HugeTLB.
This patchset is based on Linux v6.15-rc6, and requires the mmap support
for guest_memfd patchset (Thanks Fuad!) [1].
For ease of testing, this series is also available, stitched together,
at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
This patchset can be divided into two sections:
(a) Patches from the beginning up to and including "KVM: selftests:
Update script to map shared memory from guest_memfd" are a modified
version of "conversion support for guest_memfd", which Fuad is
managing [2].
(b) Patches after "KVM: selftests: Update script to map shared memory
from guest_memfd" till the end are patches that actually bring in 1G
page support for guest_memfd.
These are the significant differences between (a) and [2]:
+ [2] uses an xarray to track sharability, but I used a maple tree
because for 1G pages, iterating pagewise to update shareability was
prohibitively slow even for testing. I was choosing from among
multi-index xarrays, interval trees and maple trees [3], and picked
maple trees because
+ Maple trees were easier to figure out since I didn't have to
compute the correct multi-index order and handle edge cases if the
converted range wasn't a neat power of 2.
+ Maple trees were easier to figure out as compared to updating
parts of a multi-index xarray.
+ Maple trees had an easier API to use than interval trees.
+ [2] doesn't yet have a conversion ioctl, but I needed it to test 1G
support end-to-end.
+ (a) Removes guest_memfd from participating in LRU, which I needed, to
get conversion selftests to work as expected, since participation in
LRU was causing some unexpected refcounts on folios which was blocking
conversions.
I am sending (a) in emails as well, as opposed to just leaving it on
GitHub, so that we can discuss by commenting inline on emails. If you'd
like to just look at 1G page support, here are some key takeaways from
the first section (a):
+ If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd
creation, guest_memfd will
+ Track shareability (whether an index in the inode is guest-only or
if the host is allowed to fault memory at a given index).
+ Always be used for guest faults - specifically, kvm_gmem_get_pfn()
will be used to provide pages for the guest.
+ Always be used by KVM to check private/shared status of a gfn.
+ guest_memfd now has conversion ioctls, allowing conversion to
private/shared
+ Conversion can fail if there are unexpected refcounts on any
folios in the range.
Focusing on (b) 1G page support, here's an overview:
1. A bunch of refactoring patches for HugeTLB that isolates the
allocation of a HugeTLB folio from other HugeTLB concepts such as
VMA-level reservations, and HugeTLBfs-specific concepts, such as
where memory policy is stored in the VMA, or where the subpool is
stored on the inode.
2. A few patches that add a guestmem_hugetlb allocator within mm/. The
guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize
the memory management functions, and to cleanly handle cleanup, so
that folio cleanup can happen after the guest_memfd inode (and even
KVM) goes away.
3. Some updates to guest_memfd to use the guestmem_hugetlb allocator.
4. Selftests for 1G page support.
Here are some remaining issues/TODOs:
1. Memory error handling such as machine check errors have not been
implemented.
2. I've not looked into preparedness of pages, only zeroing has been
considered.
3. When allocating HugeTLB pages, if two threads allocate indices
mapping to the same huge page, the utilization in guest_memfd inode's
subpool may momentarily go over the subpool limit (the requested size
of the inode at guest_memfd creation time), causing one of the two
threads to get -ENOMEM. Suggestions to solve this are appreciated!
4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
pages should be correct but needs testing and could be wrong.
5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
HugeTLB pages after splitting should be correct but needs testing and
could be wrong.
6. Page cache accounting: When a hugetlb page is split, guest_memfd will
incur page count in both NR_HUGETLB (counted at hugetlb allocation
time) and NR_FILE_PAGES stats (counted when split pages are added to
the filemap). Is this aligned with what people expect?
Here are some optimizations that could be explored in future series:
1. Pages could be split from 1G to 2M first and only split to 4K if
necessary.
2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
pages.
Here's RFC v1 [4] if you're interested in the motivation behind choosing
HugeTLB, or the history of this patch series.
[1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/
[3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/
---
Ackerley Tng (49):
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: Introduce and use shareability to guard faulting
KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
KVM: guest_memfd: Skip LRU for guest_memfd folios
KVM: Query guest_memfd for private/shared status
KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
KVM: selftests: Test flag validity after guest_memfd supports
conversions
KVM: selftests: Test faulting with respect to
GUEST_MEMFD_FLAG_INIT_PRIVATE
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Allow cleanup of ucall_pool from host
KVM: selftests: Test conversion flows for guest_memfd
KVM: selftests: Add script to exercise private_mem_conversions_test
KVM: selftests: Update private_mem_conversions_test to mmap
guest_memfd
KVM: selftests: Update script to map shared memory from guest_memfd
mm: hugetlb: Consolidate interpretation of gbl_chg within
alloc_hugetlb_folio()
mm: hugetlb: Cleanup interpretation of gbl_chg in
alloc_hugetlb_folio()
mm: hugetlb: Cleanup interpretation of map_chg_state within
alloc_hugetlb_folio()
mm: hugetlb: Rename alloc_surplus_hugetlb_folio
mm: mempolicy: Refactor out policy_node_nodemask()
mm: hugetlb: Inline huge_node() into callers
mm: hugetlb: Refactor hugetlb allocation functions
mm: hugetlb: Refactor out hugetlb_alloc_folio()
mm: hugetlb: Add option to create new subpool without using surplus
mm: truncate: Expose preparation steps for truncate_inode_pages_final
mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
mm: Introduce guestmem_hugetlb to support folio_put() handling of
guestmem pages
mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
mm: truncate: Expose truncate_inode_folio()
KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff
misalignment
KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
KVM: guest_memfd: Allocate and truncate from custom allocator
mm: hugetlb: Add functions to add/delete folio from hugetlb lists
mm: guestmem_hugetlb: Add support for splitting and merging pages
mm: Convert split_folio() macro to function
KVM: guest_memfd: Split allocator pages for guest_memfd use
KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page
status
KVM: Add CAP to indicate support for HugeTLB as custom allocator
KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
KVM: selftests: Update conversion flows test for HugeTLB
KVM: selftests: Test truncation paths of guest_memfd
KVM: selftests: Test allocation and conversion of subfolios
KVM: selftests: Test that guest_memfd usage is reported via hugetlb
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
KVM: selftests: Update private_mem_conversions_test.sh to test with
HugeTLB pages
KVM: selftests: Add script to test HugeTLB statistics
KVM: selftests: Test guest_memfd for accuracy of st_blocks
Elliot Berman (1):
filemap: Pass address_space mapping to ->free_folio()
Fuad Tabba (1):
mm: Consolidate freeing of typed folios on final folio_put()
Documentation/filesystems/locking.rst | 2 +-
Documentation/filesystems/vfs.rst | 15 +-
Documentation/virt/kvm/api.rst | 5 +
arch/arm64/include/asm/kvm_host.h | 5 -
arch/x86/include/asm/kvm_host.h | 10 -
arch/x86/kvm/x86.c | 53 +-
fs/hugetlbfs/inode.c | 2 +-
fs/nfs/dir.c | 9 +-
fs/orangefs/inode.c | 3 +-
include/linux/fs.h | 2 +-
include/linux/guestmem.h | 23 +
include/linux/huge_mm.h | 6 +-
include/linux/hugetlb.h | 19 +-
include/linux/kvm_host.h | 32 +-
include/linux/mempolicy.h | 11 +-
include/linux/mm.h | 2 +
include/linux/page-flags.h | 32 +
include/uapi/linux/guestmem.h | 29 +
include/uapi/linux/kvm.h | 16 +
include/uapi/linux/magic.h | 1 +
mm/Kconfig | 13 +
mm/Makefile | 1 +
mm/debug.c | 1 +
mm/filemap.c | 12 +-
mm/guestmem_hugetlb.c | 512 +++++
mm/guestmem_hugetlb.h | 9 +
mm/hugetlb.c | 488 ++---
mm/internal.h | 1 -
mm/memcontrol.c | 2 +
mm/memory.c | 1 +
mm/mempolicy.c | 44 +-
mm/secretmem.c | 3 +-
mm/swap.c | 32 +-
mm/truncate.c | 27 +-
mm/vmscan.c | 4 +-
tools/testing/selftests/kvm/Makefile.kvm | 2 +
.../kvm/guest_memfd_conversions_test.c | 797 ++++++++
.../kvm/guest_memfd_hugetlb_reporting_test.c | 384 ++++
...uest_memfd_provide_hugetlb_cgroup_mount.sh | 36 +
.../testing/selftests/kvm/guest_memfd_test.c | 293 ++-
...memfd_wrap_test_check_hugetlb_reporting.sh | 95 +
.../testing/selftests/kvm/include/kvm_util.h | 104 +-
.../testing/selftests/kvm/include/test_util.h | 20 +-
.../selftests/kvm/include/ucall_common.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 465 +++--
tools/testing/selftests/kvm/lib/test_util.c | 102 +
.../testing/selftests/kvm/lib/ucall_common.c | 16 +-
.../kvm/x86/private_mem_conversions_test.c | 195 +-
.../kvm/x86/private_mem_conversions_test.sh | 100 +
virt/kvm/Kconfig | 5 +
virt/kvm/guest_memfd.c | 1655 ++++++++++++++++-
virt/kvm/kvm_main.c | 14 +-
virt/kvm/kvm_mm.h | 9 +-
53 files changed, 5080 insertions(+), 640 deletions(-)
create mode 100644 include/linux/guestmem.h
create mode 100644 include/uapi/linux/guestmem.h
create mode 100644 mm/guestmem_hugetlb.c
create mode 100644 mm/guestmem_hugetlb.h
create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100755 tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh
create mode 100755 tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh
create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
--
2.49.0.1045.g170613ef41-goog
On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote: > Hello, > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > upstream calls to provide 1G page support for guest_memfd by taking > pages from HugeTLB. Do you have any more concrete numbers on benefits of 1GB huge pages for guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as: - Increase TLB hit rate and reduce page walks on TLB miss - Improved IO performance - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO) - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for backing memory Do you know how often the 1GB TDP mappings get shattered by shared pages? Thinking from the TDX perspective, we might have bigger fish to fry than 1.6% memory savings (for example dynamic PAMT), and the rest of the benefits don't have numbers. How much are we getting for all the complexity, over say buddy allocated 2MB pages?
On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote: > > Hello, > > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > > upstream calls to provide 1G page support for guest_memfd by taking > > pages from HugeTLB. > > Do you have any more concrete numbers on benefits of 1GB huge pages for > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as: > - Increase TLB hit rate and reduce page walks on TLB miss > - Improved IO performance > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO) > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for > backing memory > > Do you know how often the 1GB TDP mappings get shattered by shared pages? > > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6% > memory savings (for example dynamic PAMT), and the rest of the benefits don't > have numbers. How much are we getting for all the complexity, over say buddy > allocated 2MB pages? This series should work for any page sizes backed by hugetlb memory. Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are essential for certain workloads and will emerge as guest_memfd users. Features like KHO/memory persistence in addition also depend on hugepage support in guest_memfd. This series takes strides towards making guest_memfd compatible with usecases where 1G pages are essential and non-confidential VMs are already exercising them. I think the main complexity here lies in supporting in-place conversion which applies to any huge page size even for buddy allocated 2MB pages or THP. This complexity arises because page structs work at a fixed granularity, future roadmap towards not having page structs for guest memory (at least private memory to begin with) should help towards greatly reducing this complexity. That being said, DPAMT and huge page EPT mappings for TDX VMs remain essential and complement this series well for better memory footprint and overall performance of TDX VMs.
On Thu, 2025-05-15 at 11:42 -0700, Vishal Annapurve wrote: > On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > > > > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote: > > > Hello, > > > > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > > > upstream calls to provide 1G page support for guest_memfd by taking > > > pages from HugeTLB. > > > > Do you have any more concrete numbers on benefits of 1GB huge pages for > > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as: > > - Increase TLB hit rate and reduce page walks on TLB miss > > - Improved IO performance > > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO) > > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for > > backing memory > > > > Do you know how often the 1GB TDP mappings get shattered by shared pages? > > > > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6% > > memory savings (for example dynamic PAMT), and the rest of the benefits don't > > have numbers. How much are we getting for all the complexity, over say buddy > > allocated 2MB pages? > > This series should work for any page sizes backed by hugetlb memory. > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are > essential for certain workloads and will emerge as guest_memfd users. > Features like KHO/memory persistence in addition also depend on > hugepage support in guest_memfd. > > This series takes strides towards making guest_memfd compatible with > usecases where 1G pages are essential and non-confidential VMs are > already exercising them. > > I think the main complexity here lies in supporting in-place > conversion which applies to any huge page size even for buddy > allocated 2MB pages or THP. > > This complexity arises because page structs work at a fixed > granularity, future roadmap towards not having page structs for guest > memory (at least private memory to begin with) should help towards > greatly reducing this complexity. > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain > essential and complement this series well for better memory footprint > and overall performance of TDX VMs. Hmm, this didn't really answer my questions about the concrete benefits. I think it would help to include this kind of justification for the 1GB guestmemfd pages. "essential for certain workloads and will emerge" is a bit hard to review against... I think one of the challenges with coco is that it's almost like a sprint to reimplement virtualization. But enough things are changing at once that not all of the normal assumptions hold, so it can't copy all the same solutions. The recent example was that for TDX huge pages we found that normal promotion paths weren't actually yielding any benefit for surprising TDX specific reasons. On the TDX side we are also, at least currently, unmapping private pages while they are mapped shared, so any 1GB pages would get split to 2MB if there are any shared pages in them. I wonder how many 1GB pages there would be after all the shared pages are converted. At smaller TD sizes, it could be not much. So for TDX in isolation, it seems like jumping out too far ahead to effectively consider the value. But presumably you guys are testing this on SEV or something? Have you measured any performance improvement? For what kind of applications? Or is the idea to basically to make guestmemfd work like however Google does guest memory?
On Thu, May 15, 2025, Rick P Edgecombe wrote: > On Thu, 2025-05-15 at 11:42 -0700, Vishal Annapurve wrote: > > On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P > > <rick.p.edgecombe@intel.com> wrote: > > > > > > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote: > > > > Hello, > > > > > > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > > > > upstream calls to provide 1G page support for guest_memfd by taking > > > > pages from HugeTLB. > > > > > > Do you have any more concrete numbers on benefits of 1GB huge pages for > > > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as: > > > - Increase TLB hit rate and reduce page walks on TLB miss > > > - Improved IO performance > > > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO) > > > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for > > > backing memory > > > > > > Do you know how often the 1GB TDP mappings get shattered by shared pages? > > > > > > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6% > > > memory savings (for example dynamic PAMT), and the rest of the benefits don't > > > have numbers. How much are we getting for all the complexity, over say buddy > > > allocated 2MB pages? TDX may have bigger fish to fry, but some of us have bigger fish to fry than TDX :-) > > This series should work for any page sizes backed by hugetlb memory. > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are > > essential for certain workloads and will emerge as guest_memfd users. > > Features like KHO/memory persistence in addition also depend on > > hugepage support in guest_memfd. > > > > This series takes strides towards making guest_memfd compatible with > > usecases where 1G pages are essential and non-confidential VMs are > > already exercising them. > > > > I think the main complexity here lies in supporting in-place > > conversion which applies to any huge page size even for buddy > > allocated 2MB pages or THP. > > > > This complexity arises because page structs work at a fixed > > granularity, future roadmap towards not having page structs for guest > > memory (at least private memory to begin with) should help towards > > greatly reducing this complexity. > > > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain > > essential and complement this series well for better memory footprint > > and overall performance of TDX VMs. > > Hmm, this didn't really answer my questions about the concrete benefits. > > I think it would help to include this kind of justification for the 1GB > guestmemfd pages. "essential for certain workloads and will emerge" is a bit > hard to review against... > > I think one of the challenges with coco is that it's almost like a sprint to > reimplement virtualization. But enough things are changing at once that not all > of the normal assumptions hold, so it can't copy all the same solutions. The > recent example was that for TDX huge pages we found that normal promotion paths > weren't actually yielding any benefit for surprising TDX specific reasons. > > On the TDX side we are also, at least currently, unmapping private pages while > they are mapped shared, so any 1GB pages would get split to 2MB if there are any > shared pages in them. I wonder how many 1GB pages there would be after all the > shared pages are converted. At smaller TD sizes, it could be not much. You're conflating two different things. guest_memfd allocating and managing 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB granularity. Allocating memory in 1GiB chunks is useful even if KVM can only map memory into the guest using 4KiB pages. > So for TDX in isolation, it seems like jumping out too far ahead to effectively > consider the value. But presumably you guys are testing this on SEV or > something? Have you measured any performance improvement? For what kind of > applications? Or is the idea to basically to make guestmemfd work like however > Google does guest memory? The longer term goal of guest_memfd is to make it suitable for backing all VMs, hence Vishal's "Non-CoCo VMs" comment. Yes, some of this is useful for TDX, but we (and others) want to use guest_memfd for far more than just CoCo VMs. And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
On Thu, May 15, 2025 at 05:57:57PM -0700, Sean Christopherson wrote: > You're conflating two different things. guest_memfd allocating and managing > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB > granularity. Allocating memory in 1GiB chunks is useful even if KVM can only > map memory into the guest using 4KiB pages. Even if KVM is limited to 4K the IOMMU might not be - alot of these workloads have a heavy IO component and we need the iommu to perform well too. Frankly, I don't think there should be objection to making memory more contiguous. There is alot of data that this always brings wins somewhere for someone. > The longer term goal of guest_memfd is to make it suitable for backing all VMs, > hence Vishal's "Non-CoCo VMs" comment. Yes, some of this is useful for TDX, but > we (and others) want to use guest_memfd for far more than just CoCo VMs. And > for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads. Yes, even from an iommu perspective with 2D translation we need to have the 1G pages from the S2 resident in the IOTLB or performance falls off a cliff. Jason
On Fri, 2025-05-16 at 10:09 -0300, Jason Gunthorpe wrote: > > You're conflating two different things. guest_memfd allocating and managing > > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB > > granularity. Allocating memory in 1GiB chunks is useful even if KVM can > > only > > map memory into the guest using 4KiB pages. > > Even if KVM is limited to 4K the IOMMU might not be - alot of these > workloads have a heavy IO component and we need the iommu to perform > well too. Oh, interesting point. > > Frankly, I don't think there should be objection to making memory more > contiguous. No objections from me to anything except the lack of concrete justification. > There is alot of data that this always brings wins > somewhere for someone. For the direct map huge page benchmarking, they saw that sometimes 1GB pages helped, but also sometimes 2MB pages helped. That 1GB will help *some* workload doesn't seem surprising. > > > The longer term goal of guest_memfd is to make it suitable for backing all > > VMs, > > hence Vishal's "Non-CoCo VMs" comment. Yes, some of this is useful for TDX, > > but > > we (and others) want to use guest_memfd for far more than just CoCo VMs. > > And > > for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads. > > Yes, even from an iommu perspective with 2D translation we need to > have the 1G pages from the S2 resident in the IOTLB or performance > falls off a cliff. "falls off a cliff" is the level of detail and the direction of hand waving I have been hearing. But it also seems modern CPUs are quite good at hiding the cost of walks with caches etc. Like how 5 level paging was made unconditional. I didn't think about IOTLB though. Thanks for mentioning it.
On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote: > > > > Thinking from the TDX perspective, we might have bigger fish to fry than > > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the > > > > benefits don't have numbers. How much are we getting for all the > > > > complexity, over say buddy allocated 2MB pages? > > TDX may have bigger fish to fry, but some of us have bigger fish to fry than > TDX :-) Fair enough. But TDX is on the "roadmap". So it helps to say what the target of this series is. > > > > This series should work for any page sizes backed by hugetlb memory. > > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are > > > essential for certain workloads and will emerge as guest_memfd users. > > > Features like KHO/memory persistence in addition also depend on > > > hugepage support in guest_memfd. > > > > > > This series takes strides towards making guest_memfd compatible with > > > usecases where 1G pages are essential and non-confidential VMs are > > > already exercising them. > > > > > > I think the main complexity here lies in supporting in-place > > > conversion which applies to any huge page size even for buddy > > > allocated 2MB pages or THP. > > > > > > This complexity arises because page structs work at a fixed > > > granularity, future roadmap towards not having page structs for guest > > > memory (at least private memory to begin with) should help towards > > > greatly reducing this complexity. > > > > > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain > > > essential and complement this series well for better memory footprint > > > and overall performance of TDX VMs. > > > > Hmm, this didn't really answer my questions about the concrete benefits. > > > > I think it would help to include this kind of justification for the 1GB > > guestmemfd pages. "essential for certain workloads and will emerge" is a bit > > hard to review against... > > > > I think one of the challenges with coco is that it's almost like a sprint to > > reimplement virtualization. But enough things are changing at once that not > > all of the normal assumptions hold, so it can't copy all the same solutions. > > The recent example was that for TDX huge pages we found that normal > > promotion paths weren't actually yielding any benefit for surprising TDX > > specific reasons. > > > > On the TDX side we are also, at least currently, unmapping private pages > > while they are mapped shared, so any 1GB pages would get split to 2MB if > > there are any shared pages in them. I wonder how many 1GB pages there would > > be after all the shared pages are converted. At smaller TD sizes, it could > > be not much. > > You're conflating two different things. guest_memfd allocating and managing > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB > granularity. Allocating memory in 1GiB chunks is useful even if KVM can only > map memory into the guest using 4KiB pages. I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The list quoted there was more about guest performance. Or maybe the clever page table walkers that find contiguous small mappings could benefit guest performance too? It's the kind of thing I'd like to see at least broadly called out. I'm thinking that Google must have a ridiculous amount of learnings about VM memory management. And this is probably designed around those learnings. But reviewers can't really evaluate it if they don't know the reasons and tradeoffs taken. If it's going upstream, I think it should have at least the high level reasoning explained. I don't mean to harp on the point so hard, but I didn't expect it to be controversial either. > > > So for TDX in isolation, it seems like jumping out too far ahead to > > effectively consider the value. But presumably you guys are testing this on > > SEV or something? Have you measured any performance improvement? For what > > kind of applications? Or is the idea to basically to make guestmemfd work > > like however Google does guest memory? > > The longer term goal of guest_memfd is to make it suitable for backing all > VMs, hence Vishal's "Non-CoCo VMs" comment. Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was talking about pKVM. > Yes, some of this is useful for TDX, but we (and others) want to use > guest_memfd for far more than just CoCo VMs. > And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads. I've heard this a lot. It must be true, but I've never seen the actual numbers. For a long time people believed 1GB huge pages on the direct map were critical, but then benchmarking on a contemporary CPU couldn't find much difference between 2MB and 1GB. I'd expect TDP huge pages to be different than that because the combined walks are huge, iTLB, etc, but I'd love to see a real number.
On Thu, May 15, 2025 at 7:12 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote: > > > > > Thinking from the TDX perspective, we might have bigger fish to fry than > > > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the > > > > > benefits don't have numbers. How much are we getting for all the > > > > > complexity, over say buddy allocated 2MB pages? > > > > TDX may have bigger fish to fry, but some of us have bigger fish to fry than > > TDX :-) > > Fair enough. But TDX is on the "roadmap". So it helps to say what the target of > this series is. > > > > > > > This series should work for any page sizes backed by hugetlb memory. > > > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are > > > > essential for certain workloads and will emerge as guest_memfd users. > > > > Features like KHO/memory persistence in addition also depend on > > > > hugepage support in guest_memfd. > > > > > > > > This series takes strides towards making guest_memfd compatible with > > > > usecases where 1G pages are essential and non-confidential VMs are > > > > already exercising them. > > > > > > > > I think the main complexity here lies in supporting in-place > > > > conversion which applies to any huge page size even for buddy > > > > allocated 2MB pages or THP. > > > > > > > > This complexity arises because page structs work at a fixed > > > > granularity, future roadmap towards not having page structs for guest > > > > memory (at least private memory to begin with) should help towards > > > > greatly reducing this complexity. > > > > > > > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain > > > > essential and complement this series well for better memory footprint > > > > and overall performance of TDX VMs. > > > > > > Hmm, this didn't really answer my questions about the concrete benefits. > > > > > > I think it would help to include this kind of justification for the 1GB > > > guestmemfd pages. "essential for certain workloads and will emerge" is a bit > > > hard to review against... > > > > > > I think one of the challenges with coco is that it's almost like a sprint to > > > reimplement virtualization. But enough things are changing at once that not > > > all of the normal assumptions hold, so it can't copy all the same solutions. > > > The recent example was that for TDX huge pages we found that normal > > > promotion paths weren't actually yielding any benefit for surprising TDX > > > specific reasons. > > > > > > On the TDX side we are also, at least currently, unmapping private pages > > > while they are mapped shared, so any 1GB pages would get split to 2MB if > > > there are any shared pages in them. I wonder how many 1GB pages there would > > > be after all the shared pages are converted. At smaller TD sizes, it could > > > be not much. > > > > You're conflating two different things. guest_memfd allocating and managing > > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB > > granularity. Allocating memory in 1GiB chunks is useful even if KVM can only > > map memory into the guest using 4KiB pages. > > I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The > list quoted there was more about guest performance. Or maybe the clever page > table walkers that find contiguous small mappings could benefit guest > performance too? It's the kind of thing I'd like to see at least broadly called > out. The crux of this series really is hugetlb backing support for guest_memfd and handling CoCo VMs irrespective of the page size as I suggested earlier, so 2M page sizes will need to handle similar complexity of in-place conversion. Google internally uses 1G hugetlb pages to achieve high bandwidth IO, lower memory footprint using HVO and lower MMU/IOMMU page table memory footprint among other improvements. These percentages carry a substantial impact when working at the scale of large fleets of hosts each carrying significant memory capacity. guest_memfd hugepage support + hugepage EPT mapping support for TDX VMs significantly help: 1) ~70% decrease in TDX VM boot up time 2) ~65% decrease in TDX VM shutdown time 3) ~90% decrease in TDX VM PAMT memory overhead 4) Improvement in TDX SEPT memory overhead And we believe this combination should also help achieve better performance with TDX connect in future. Hugetlb huge pages are preferred as they are statically carved out at boot and so provide much better guarantees of availability. Once the pages are carved out, any VMs scheduled on such a host will need to work with the same hugetlb memory sizes. This series attempts to use hugetlb pages with in-place conversion, avoiding the double allocation problem that otherwise results in significant memory overheads for CoCo VMs. > > I'm thinking that Google must have a ridiculous amount of learnings about VM > memory management. And this is probably designed around those learnings. But > reviewers can't really evaluate it if they don't know the reasons and tradeoffs > taken. If it's going upstream, I think it should have at least the high level > reasoning explained. > > I don't mean to harp on the point so hard, but I didn't expect it to be > controversial either. > > > > > > So for TDX in isolation, it seems like jumping out too far ahead to > > > effectively consider the value. But presumably you guys are testing this on > > > SEV or something? Have you measured any performance improvement? For what > > > kind of applications? Or is the idea to basically to make guestmemfd work > > > like however Google does guest memory? > > > > The longer term goal of guest_memfd is to make it suitable for backing all > > VMs, hence Vishal's "Non-CoCo VMs" comment. > > Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was > talking about pKVM. > > > Yes, some of this is useful for TDX, but we (and others) want to use > > guest_memfd for far more than just CoCo VMs. > > > > And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads. > I've heard this a lot. It must be true, but I've never seen the actual numbers. > For a long time people believed 1GB huge pages on the direct map were critical, > but then benchmarking on a contemporary CPU couldn't find much difference > between 2MB and 1GB. I'd expect TDP huge pages to be different than that because > the combined walks are huge, iTLB, etc, but I'd love to see a real number.
On Fri, May 16, 2025, Vishal Annapurve wrote: > On Thu, May 15, 2025 at 7:12 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote: > > > You're conflating two different things. guest_memfd allocating and managing > > > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB > > > granularity. Allocating memory in 1GiB chunks is useful even if KVM can only > > > map memory into the guest using 4KiB pages. > > > > I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The > > list quoted there was more about guest performance. Or maybe the clever page > > table walkers that find contiguous small mappings could benefit guest > > performance too? It's the kind of thing I'd like to see at least broadly called > > out. > > The crux of this series really is hugetlb backing support for guest_memfd and > handling CoCo VMs irrespective of the page size as I suggested earlier, so 2M > page sizes will need to handle similar complexity of in-place conversion. > > Google internally uses 1G hugetlb pages to achieve high bandwidth IO, E.g. hitting target networking line rates is only possible with 1GiB mappings, otherwise TLB pressure gets in the way. > lower memory footprint using HVO and lower MMU/IOMMU page table memory > footprint among other improvements. These percentages carry a substantial > impact when working at the scale of large fleets of hosts each carrying > significant memory capacity. Yeah, 1.6% might sound small, but over however many bytes of RAM there are in the fleet, it's a huge (lol) amount of memory saved. > > > Yes, some of this is useful for TDX, but we (and others) want to use > > > guest_memfd for far more than just CoCo VMs. > > > > > > > And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads. > > I've heard this a lot. It must be true, but I've never seen the actual numbers. > > For a long time people believed 1GB huge pages on the direct map were critical, > > but then benchmarking on a contemporary CPU couldn't find much difference > > between 2MB and 1GB. I'd expect TDP huge pages to be different than that because > > the combined walks are huge, iTLB, etc, but I'd love to see a real number. The direct map is very, very different than userspace and thus guest mappings. Software (hopefully) isn't using the direct map to index multi-TiB databases, or to transfer GiBs of data over the network. The amount of memory the kernel is regularly accessing is an order or two magnitude smaller than single process use cases. A few examples from a quick search: http://pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be https://www.percona.com/blog/benchmark-postgresql-with-linux-hugepages/
On Fri, 2025-05-16 at 06:11 -0700, Vishal Annapurve wrote: > The crux of this series really is hugetlb backing support for > guest_memfd and handling CoCo VMs irrespective of the page size as I > suggested earlier, so 2M page sizes will need to handle similar > complexity of in-place conversion. I assumed this part was added 1GB complexity: mm/hugetlb.c | 488 ++--- I'll dig into the series and try to understand the point better. > > Google internally uses 1G hugetlb pages to achieve high bandwidth IO, > lower memory footprint using HVO and lower MMU/IOMMU page table memory > footprint among other improvements. These percentages carry a > substantial impact when working at the scale of large fleets of hosts > each carrying significant memory capacity. There must have been a lot of measuring involved in that. But the numbers I was hoping for were how much does *this* series help upstream. > > guest_memfd hugepage support + hugepage EPT mapping support for TDX > VMs significantly help: > 1) ~70% decrease in TDX VM boot up time > 2) ~65% decrease in TDX VM shutdown time > 3) ~90% decrease in TDX VM PAMT memory overhead > 4) Improvement in TDX SEPT memory overhead Thanks. It is the difference between 4k mappings and 2MB mappings I guess? Or are you saying this is the difference between 1GB contiguous pages for TDX at 2MB mapping, and 2MB contiguous pages at TDX 2MB mappings? The 1GB part is the one I was curious about. > > And we believe this combination should also help achieve better > performance with TDX connect in future. Please don't take this query as an objection that the series doesn't help TDX enough or something like that. If it doesn't help TDX at all (not the case), that is fine. The objection is only that the specific benefits and tradeoffs around 1GB pages are not clear in the upstream posting. > > Hugetlb huge pages are preferred as they are statically carved out at > boot and so provide much better guarantees of availability. > Reserved memory can provide physically contiguous pages more frequently. Seems not surprising at all, and something that could have a number. > Once the > pages are carved out, any VMs scheduled on such a host will need to > work with the same hugetlb memory sizes. This series attempts to use > hugetlb pages with in-place conversion, avoiding the double allocation > problem that otherwise results in significant memory overheads for > CoCo VMs. I asked this question assuming there were some measurements for the 1GB part of this series. It sounds like the reasoning is instead that this is how Google does things, which is backed by way more benchmarking than kernel patches are used to getting. So it can just be reasonable assumed to be helpful. But for upstream code, I'd expect there to be a bit more concrete than "we believe" and "substantial impact". It seems like I'm in the minority here though. So if no one else wants to pressure test the thinking in the usual way, I guess I'll just have to wonder.
On Fri, May 16, 2025, Rick P Edgecombe wrote: > On Fri, 2025-05-16 at 06:11 -0700, Vishal Annapurve wrote: > > Google internally uses 1G hugetlb pages to achieve high bandwidth IO, > > lower memory footprint using HVO and lower MMU/IOMMU page table memory > > footprint among other improvements. These percentages carry a > > substantial impact when working at the scale of large fleets of hosts > > each carrying significant memory capacity. > > There must have been a lot of measuring involved in that. But the numbers I was > hoping for were how much does *this* series help upstream. ... > I asked this question assuming there were some measurements for the 1GB part of > this series. It sounds like the reasoning is instead that this is how Google > does things, which is backed by way more benchmarking than kernel patches are > used to getting. So it can just be reasonable assumed to be helpful. > > But for upstream code, I'd expect there to be a bit more concrete than "we > believe" and "substantial impact". It seems like I'm in the minority here > though. So if no one else wants to pressure test the thinking in the usual way, > I guess I'll just have to wonder. From my perspective, 1GiB hugepage support in guest_memfd isn't about improving CoCo performance, it's about achieving feature parity on guest_memfd with respect to existing backing stores so that it's possible to use guest_memfd to back all VM shapes in a fleet. Let's assume there is significant value in backing non-CoCo VMs with 1GiB pages, unless you want to re-litigate the existence of 1GiB support in HugeTLBFS. If we assume 1GiB support is mandatory for non-CoCo VMs, then it becomes mandatory for CoCo VMs as well, because it's the only realistic way to run CoCo VMs and non-CoCo VMs on a single host. Mixing 1GiB HugeTLBFS with any other backing store for VMs simply isn't tenable due to the nature of 1GiB allocations. E.g. grabbing sub-1GiB chunks of memory for CoCo VMs quickly fragments memory to the point where HugeTLBFS can't allocate memory for non-CoCo VMs. Teaching HugeTLBFS to play nice with TDX and SNP isn't happening, which leaves adding 1GiB support to guest_memfd as the only way forward. Any boost to TDX (or SNP) performance is purely a bonus.
On Fri, 2025-05-16 at 10:51 -0700, Sean Christopherson wrote: > From my perspective, 1GiB hugepage support in guest_memfd isn't about improving > CoCo performance, it's about achieving feature parity on guest_memfd with respect > to existing backing stores so that it's possible to use guest_memfd to back all > VM shapes in a fleet. > > Let's assume there is significant value in backing non-CoCo VMs with 1GiB pages, > unless you want to re-litigate the existence of 1GiB support in HugeTLBFS. I didn't expect to go in that direction when I first asked. But everyone says huge, but no one knows the numbers. It can be a sign of things. Meanwhile I'm watching patches to make 5 level paging walks unconditional fly by because people couldn't find a cost to the extra level of walk. So re-litigate, no. But I'll probably remain quietly suspicious of the exact cost/value. At least on the CPU side, I totally missed the IOTLB side at first, sorry. > > If we assume 1GiB support is mandatory for non-CoCo VMs, then it becomes mandatory > for CoCo VMs as well, because it's the only realistic way to run CoCo VMs and > non-CoCo VMs on a single host. Mixing 1GiB HugeTLBFS with any other backing store > for VMs simply isn't tenable due to the nature of 1GiB allocations. E.g. grabbing > sub-1GiB chunks of memory for CoCo VMs quickly fragments memory to the point where > HugeTLBFS can't allocate memory for non-CoCo VMs. It makes sense that there would be a difference in how many huge pages the non- coco guests would get. Where I start to lose you is when you guys talk about "mandatory" or similar. If you want upstream review, it would help to have more numbers on the "why" question. At least for us folks outside the hyperscalars where such things are not as obvious. > > Teaching HugeTLBFS to play nice with TDX and SNP isn't happening, which leaves > adding 1GiB support to guest_memfd as the only way forward. > > Any boost to TDX (or SNP) performance is purely a bonus. Most of the bullets in the talk were about mapping sizes AFAICT, so this is the kind of reasoning I was hoping for. Thanks for elaborating on it, even though still no one has any numbers besides the vmemmap savings.
On 5/16/25 12:14, Edgecombe, Rick P wrote: > Meanwhile I'm watching patches to make 5 level paging walks unconditional fly by > because people couldn't find a cost to the extra level of walk. So re-litigate, > no. But I'll probably remain quietly suspicious of the exact cost/value. At > least on the CPU side, I totally missed the IOTLB side at first, sorry. It's a little more complicated than just the depth of the worst-case walk. In practice, many page walks can use the mid-level paging structure caches because the mappings aren't sparse. With 5-level paging in particular, userspace doesn't actually change much at all. Its layout is pretty much the same unless folks are opting in to the higher (5-level only) address space. So userspace isn't sparse, at least at the scale of what 5-level paging is capable of. For the kernel, things are a bit more spread out than they were before. For instance, the direct map and vmalloc() are in separate p4d pages when they used to be nestled together in the same half of one pgd. But, again, they're not *that* sparse. The direct map, for example, doesn't become more sparse, it just moves to a lower virtual address. Ditto for vmalloc(). Just because 5-level paging has a massive vmalloc() area doesn't mean we use it. Basically, 5-level paging adds a level to the top of the page walk, and we're really good at caching those when they're not accessed sparsely. CPUs are not as good at caching the leaf side of the page walk. There are tricks like AMD's TLB coalescing that help. But, generally, each walk on the leaf end of the walks eats a TLB entry. Those just don't cache as well as the top of the tree. That's why we need to be more maniacal about reducing leaf levels than the levels toward the root.
On Fri, 2025-05-16 at 13:25 -0700, Dave Hansen wrote: > It's a little more complicated than just the depth of the worst-case walk. > > In practice, many page walks can use the mid-level paging structure > caches because the mappings aren't sparse. > > With 5-level paging in particular, userspace doesn't actually change > much at all. Its layout is pretty much the same unless folks are opting > in to the higher (5-level only) address space. So userspace isn't > sparse, at least at the scale of what 5-level paging is capable of. > > For the kernel, things are a bit more spread out than they were before. > For instance, the direct map and vmalloc() are in separate p4d pages > when they used to be nestled together in the same half of one pgd. > > But, again, they're not *that* sparse. The direct map, for example, > doesn't become more sparse, it just moves to a lower virtual address. > Ditto for vmalloc(). Just because 5-level paging has a massive > vmalloc() area doesn't mean we use it. > > Basically, 5-level paging adds a level to the top of the page walk, and > we're really good at caching those when they're not accessed sparsely. > > CPUs are not as good at caching the leaf side of the page walk. There > are tricks like AMD's TLB coalescing that help. But, generally, each > walk on the leaf end of the walks eats a TLB entry. Those just don't > cache as well as the top of the tree. > > That's why we need to be more maniacal about reducing leaf levels than > the levels toward the root. Makes sense. For what is easy for the CPU to cache, it can be more about the address space layout then the length of the walk. Going off topic from this patchset... I have a possibly fun related anecdote. A while ago when I was doing the KVM XO stuff, I was trying to test how much worse the performance was from caches being forced to deal with the sparser GPA accesses. The test was to modify the guest to force all the executable GVA mappings to go on the XO alias. I was confused to find that KVM XO was faster than the normal layout by a small, but consistent amount. It had me scratching my head. It turned out that the NX huge page mitigation was able to maintain large pages for the data accesses because all the executable accesses were moved off of the main GPA alias. My takeaway was that the real world implementations can interact in surprising ways, and for at least my ability to reason about it, it's good to verify with a test when possible.
Ackerley Tng <ackerleytng@google.com> writes: > Hello, > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > upstream calls to provide 1G page support for guest_memfd by taking > pages from HugeTLB. > > [...] At the guest_memfd upstream call today (2025-06-26), we talked about when to merge folios with respect to conversions. Just want to call out that in this RFCv2, we managed to get conversions working with merges happening as soon as possible. "As soon as possible" means merges happen as long as shareability is all private (or all meaningless) within an aligned hugepage range. We try to merge after every conversion request and on truncation. On truncation, shareability becomes meaningless. On explicit truncation (e.g. fallocate(PUNCH_HOLE)), truncation can fail if there are unexpected refcounts (because we can't merge with unexpected refcounts). Explicit truncation will succeed only if refcounts are expected, and merge is performed before finally removing from filemap. On truncation caused by file close or inode release, guest_memfd may not hold the last refcount on the folio. Only in this case, we defer merging to the folio_put() callback, and because the callback can be called from atomic context, the merge is further deferred to be performed by a kernel worker. Deferment of merging is already minimized so that most of the restructuring is synchronous with some userspace-initiated action (conversion or explicit truncation). The only deferred merge is when the file is closed, and in that case there's no way to reject/fail this file close. (There are possible optimizations here - Yan suggested [1] checking if the folio_put() was called from interrupt context - I have not tried implementing that yet) I did propose an explicit guest_memfd merge ioctl, but since RFCv2 works, I was thinking to to have the merge ioctl be a separate optimization/project/patch series if it turns out that merging as-soon-as-possible is an inefficient strategy, or if some VM use cases prefer to have an explicit merge ioctl. During the call, Michael also brought up that SNP adds some constraints with respect to guest accepting pages/levels. Could you please expand on that? Suppose for an SNP guest, 1. Guest accepted a page at 2M level 2. Guest converts a 4K sub page to shared 3. guest_memfd requests unmapping of the guest-requested 4K range (the rest of the 2M remains mapped into stage 2 page tables) 4. guest_memfd splits the huge page to 4K pages (the 4K is set to SHAREABILITY_ALL, the rest of the 2M is still SHAREABILITY_GUEST) Can the SNP guest continue to use the rest of the 2M page or must it re-accept all the pages at 4K? And for the reverse: 1. Guest accepted a 2M range at 4K 2. guest_memfd merges the full 2M range to a single 2M page Must the SNP guest re-accept at 2M for the guest to continue functioning, or will the SNP guest continue to work (just with poorer performance than if the memory was accepted at 2M)? [1] https://lore.kernel.org/all/aDfT35EsYP%2FByf7Z@yzhao56-desk.sh.intel.com/
On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: > Hello, > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > upstream calls to provide 1G page support for guest_memfd by taking > pages from HugeTLB. > > This patchset is based on Linux v6.15-rc6, and requires the mmap support > for guest_memfd patchset (Thanks Fuad!) [1]. > > For ease of testing, this series is also available, stitched together, > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 Just to record a found issue -- not one that must be fixed. In TDX, the initial memory region is added as private memory during TD's build time, with its initial content copied from source pages in shared memory. The copy operation requires simultaneous access to both shared source memory and private target memory. Therefore, userspace cannot store the initial content in shared memory at the mmap-ed VA of a guest_memfd that performs in-place conversion between shared and private memory. This is because the guest_memfd will first unmap a PFN in shared page tables and then check for any extra refcount held for the shared PFN before converting it to private. Currently, we tested the initial memory region using the in-place conversion version of guest_memfd as backend by modifying QEMU to add an extra anonymous backend to hold the source initial content in shared memory. The extra anonymous backend is freed after finishing ading the initial memory region. This issue is benign for TDX, as the initial memory region can also utilize the traditional guest_memfd, which only allows 4KB mappings. This is acceptable for now, as the initial memory region typically involves a small amount of memory, and we may not enable huge pages for ranges covered by the initial memory region in the near future.
On 6/19/2025 4:13 PM, Yan Zhao wrote: > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: >> Hello, >> >> This patchset builds upon discussion at LPC 2024 and many guest_memfd >> upstream calls to provide 1G page support for guest_memfd by taking >> pages from HugeTLB. >> >> This patchset is based on Linux v6.15-rc6, and requires the mmap support >> for guest_memfd patchset (Thanks Fuad!) [1]. >> >> For ease of testing, this series is also available, stitched together, >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > > Just to record a found issue -- not one that must be fixed. > > In TDX, the initial memory region is added as private memory during TD's build > time, with its initial content copied from source pages in shared memory. > The copy operation requires simultaneous access to both shared source memory > and private target memory. > > Therefore, userspace cannot store the initial content in shared memory at the > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and > private memory. This is because the guest_memfd will first unmap a PFN in shared > page tables and then check for any extra refcount held for the shared PFN before > converting it to private. I have an idea. If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place conversion unmap the PFN in shared page tables while keeping the content of the page unchanged, right? So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory actually for non-CoCo case actually, that userspace first mmap() it and ensure it's shared and writes the initial content to it, after it userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it wants the private memory to be initialized with initial content, and just do in-place TDH.PAGE.ADD in the hook. > Currently, we tested the initial memory region using the in-place conversion > version of guest_memfd as backend by modifying QEMU to add an extra anonymous > backend to hold the source initial content in shared memory. The extra anonymous > backend is freed after finishing ading the initial memory region. > > This issue is benign for TDX, as the initial memory region can also utilize the > traditional guest_memfd, which only allows 4KB mappings. This is acceptable for > now, as the initial memory region typically involves a small amount of memory, > and we may not enable huge pages for ranges covered by the initial memory region > in the near future.
On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> >> Hello,
> >>
> >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> >> upstream calls to provide 1G page support for guest_memfd by taking
> >> pages from HugeTLB.
> >>
> >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> >> for guest_memfd patchset (Thanks Fuad!) [1].
> >>
> >> For ease of testing, this series is also available, stitched together,
> >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Just to record a found issue -- not one that must be fixed.
> >
> > In TDX, the initial memory region is added as private memory during TD's build
> > time, with its initial content copied from source pages in shared memory.
> > The copy operation requires simultaneous access to both shared source memory
> > and private target memory.
> >
> > Therefore, userspace cannot store the initial content in shared memory at the
> > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > page tables and then check for any extra refcount held for the shared PFN before
> > converting it to private.
>
> I have an idea.
>
> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> conversion unmap the PFN in shared page tables while keeping the content
> of the page unchanged, right?
That's correct.
>
> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> actually for non-CoCo case actually, that userspace first mmap() it and
> ensure it's shared and writes the initial content to it, after it
> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
I think you mean pKVM by non-coco VMs that care about private memory.
Yes, initial memory regions can start as shared which userspace can
populate and then convert the ranges to private.
>
> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> wants the private memory to be initialized with initial content, and
> just do in-place TDH.PAGE.ADD in the hook.
I think this scheme will be cleaner:
1) Userspace marks the guest_memfd ranges corresponding to initial
payload as shared.
2) Userspace mmaps and populates the ranges.
3) Userspace converts those guest_memfd ranges to private.
4) For both SNP and TDX, userspace continues to invoke corresponding
initial payload preparation operations via existing KVM ioctls e.g.
KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
- SNP/TDX KVM logic fetches the right pfns for the target gfns
using the normal paths supported by KVM and passes those pfns directly
to the right trusted module to initialize the "encrypted" memory
contents.
- Avoiding any GUP or memcpy from source addresses.
i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
Since we need to support VMs that will/won't use in-place conversion,
I think operations like KVM_TDX_INIT_MEM_REGION can introduce explicit
flags to allow userspace to indicate whether to assume in-place
conversion or not. Maybe
kvm_tdx_init_mem_region.source_addr/kvm_sev_snp_launch_update.uaddr
can be null in the scenarios where in-place conversion is used.
On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote: > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote: > > > > On 6/19/2025 4:13 PM, Yan Zhao wrote: > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: > > >> Hello, > > >> > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd > > >> upstream calls to provide 1G page support for guest_memfd by taking > > >> pages from HugeTLB. > > >> > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support > > >> for guest_memfd patchset (Thanks Fuad!) [1]. > > >> > > >> For ease of testing, this series is also available, stitched together, > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > > > > > > Just to record a found issue -- not one that must be fixed. > > > > > > In TDX, the initial memory region is added as private memory during TD's build > > > time, with its initial content copied from source pages in shared memory. > > > The copy operation requires simultaneous access to both shared source memory > > > and private target memory. > > > > > > Therefore, userspace cannot store the initial content in shared memory at the > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and > > > private memory. This is because the guest_memfd will first unmap a PFN in shared > > > page tables and then check for any extra refcount held for the shared PFN before > > > converting it to private. > > > > I have an idea. > > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place > > conversion unmap the PFN in shared page tables while keeping the content > > of the page unchanged, right? > > That's correct. > > > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory > > actually for non-CoCo case actually, that userspace first mmap() it and > > ensure it's shared and writes the initial content to it, after it > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. > > I think you mean pKVM by non-coco VMs that care about private memory. > Yes, initial memory regions can start as shared which userspace can > populate and then convert the ranges to private. > > > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it > > wants the private memory to be initialized with initial content, and > > just do in-place TDH.PAGE.ADD in the hook. > > I think this scheme will be cleaner: > 1) Userspace marks the guest_memfd ranges corresponding to initial > payload as shared. > 2) Userspace mmaps and populates the ranges. > 3) Userspace converts those guest_memfd ranges to private. > 4) For both SNP and TDX, userspace continues to invoke corresponding > initial payload preparation operations via existing KVM ioctls e.g. > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION. > - SNP/TDX KVM logic fetches the right pfns for the target gfns > using the normal paths supported by KVM and passes those pfns directly > to the right trusted module to initialize the "encrypted" memory > contents. > - Avoiding any GUP or memcpy from source addresses. One caveat: when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked. Then kvm_gmem_prepare_folio() is further invoked to zero the folio. > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD. So, upon here, the pages should not contain the original content? > Since we need to support VMs that will/won't use in-place conversion, > I think operations like KVM_TDX_INIT_MEM_REGION can introduce explicit > flags to allow userspace to indicate whether to assume in-place > conversion or not. Maybe > kvm_tdx_init_mem_region.source_addr/kvm_sev_snp_launch_update.uaddr > can be null in the scenarios where in-place conversion is used.
On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote: > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote: > > > > > > On 6/19/2025 4:13 PM, Yan Zhao wrote: > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: > > > >> Hello, > > > >> > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd > > > >> upstream calls to provide 1G page support for guest_memfd by taking > > > >> pages from HugeTLB. > > > >> > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support > > > >> for guest_memfd patchset (Thanks Fuad!) [1]. > > > >> > > > >> For ease of testing, this series is also available, stitched together, > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > > > > > > > > Just to record a found issue -- not one that must be fixed. > > > > > > > > In TDX, the initial memory region is added as private memory during TD's build > > > > time, with its initial content copied from source pages in shared memory. > > > > The copy operation requires simultaneous access to both shared source memory > > > > and private target memory. > > > > > > > > Therefore, userspace cannot store the initial content in shared memory at the > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared > > > > page tables and then check for any extra refcount held for the shared PFN before > > > > converting it to private. > > > > > > I have an idea. > > > > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place > > > conversion unmap the PFN in shared page tables while keeping the content > > > of the page unchanged, right? > > > > That's correct. > > > > > > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory > > > actually for non-CoCo case actually, that userspace first mmap() it and > > > ensure it's shared and writes the initial content to it, after it > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. > > > > I think you mean pKVM by non-coco VMs that care about private memory. > > Yes, initial memory regions can start as shared which userspace can > > populate and then convert the ranges to private. > > > > > > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it > > > wants the private memory to be initialized with initial content, and > > > just do in-place TDH.PAGE.ADD in the hook. > > > > I think this scheme will be cleaner: > > 1) Userspace marks the guest_memfd ranges corresponding to initial > > payload as shared. > > 2) Userspace mmaps and populates the ranges. > > 3) Userspace converts those guest_memfd ranges to private. > > 4) For both SNP and TDX, userspace continues to invoke corresponding > > initial payload preparation operations via existing KVM ioctls e.g. > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION. > > - SNP/TDX KVM logic fetches the right pfns for the target gfns > > using the normal paths supported by KVM and passes those pfns directly > > to the right trusted module to initialize the "encrypted" memory > > contents. > > - Avoiding any GUP or memcpy from source addresses. > One caveat: > > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked. > Then kvm_gmem_prepare_folio() is further invoked to zero the folio. Given that confidential VMs have their own way of initializing private memory, I think zeroing makes sense for only shared memory ranges. i.e. something like below: 1) Don't zero at allocation time. 2) If faulting in a shared page and its not uptodate, then zero the page and set the page as uptodate. 3) Clear uptodate flag on private to shared conversion. 4) For faults on private ranges, don't zero the memory. There might be some other considerations here e.g. pKVM needs non-destructive conversion operation, which might need a way to enable zeroing at allocation time only. On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear pages on future platforms [1]. [1] https://lore.kernel.org/lkml/6de76911-5007-4170-bf74-e1d045c68465@intel.com/ > > > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD. > So, upon here, the pages should not contain the original content? > Pages should contain the original content. Michael is already experimenting with similar logic [2] for SNP. [2] https://lore.kernel.org/lkml/20250613005400.3694904-6-michael.roth@amd.com/
On Mon, Jun 30, 2025 at 07:14:07AM -0700, Vishal Annapurve wrote: > On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote: > > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote: > > > > > > > > On 6/19/2025 4:13 PM, Yan Zhao wrote: > > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: > > > > >> Hello, > > > > >> > > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd > > > > >> upstream calls to provide 1G page support for guest_memfd by taking > > > > >> pages from HugeTLB. > > > > >> > > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support > > > > >> for guest_memfd patchset (Thanks Fuad!) [1]. > > > > >> > > > > >> For ease of testing, this series is also available, stitched together, > > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > > > > > > > > > > Just to record a found issue -- not one that must be fixed. > > > > > > > > > > In TDX, the initial memory region is added as private memory during TD's build > > > > > time, with its initial content copied from source pages in shared memory. > > > > > The copy operation requires simultaneous access to both shared source memory > > > > > and private target memory. > > > > > > > > > > Therefore, userspace cannot store the initial content in shared memory at the > > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and > > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared > > > > > page tables and then check for any extra refcount held for the shared PFN before > > > > > converting it to private. > > > > > > > > I have an idea. > > > > > > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place > > > > conversion unmap the PFN in shared page tables while keeping the content > > > > of the page unchanged, right? > > > > > > That's correct. > > > > > > > > > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory > > > > actually for non-CoCo case actually, that userspace first mmap() it and > > > > ensure it's shared and writes the initial content to it, after it > > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. > > > > > > I think you mean pKVM by non-coco VMs that care about private memory. > > > Yes, initial memory regions can start as shared which userspace can > > > populate and then convert the ranges to private. > > > > > > > > > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it > > > > wants the private memory to be initialized with initial content, and > > > > just do in-place TDH.PAGE.ADD in the hook. > > > > > > I think this scheme will be cleaner: > > > 1) Userspace marks the guest_memfd ranges corresponding to initial > > > payload as shared. > > > 2) Userspace mmaps and populates the ranges. > > > 3) Userspace converts those guest_memfd ranges to private. > > > 4) For both SNP and TDX, userspace continues to invoke corresponding > > > initial payload preparation operations via existing KVM ioctls e.g. > > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION. > > > - SNP/TDX KVM logic fetches the right pfns for the target gfns > > > using the normal paths supported by KVM and passes those pfns directly > > > to the right trusted module to initialize the "encrypted" memory > > > contents. > > > - Avoiding any GUP or memcpy from source addresses. > > One caveat: > > > > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked. > > Then kvm_gmem_prepare_folio() is further invoked to zero the folio. > > Given that confidential VMs have their own way of initializing private > memory, I think zeroing makes sense for only shared memory ranges. > i.e. something like below: > 1) Don't zero at allocation time. > 2) If faulting in a shared page and its not uptodate, then zero the > page and set the page as uptodate. > 3) Clear uptodate flag on private to shared conversion. > 4) For faults on private ranges, don't zero the memory. > > There might be some other considerations here e.g. pKVM needs > non-destructive conversion operation, which might need a way to enable > zeroing at allocation time only. > > On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear > pages on future platforms [1]. Yes, TDX does not need to clear pages on private page allocation. But current kvm_gmem_prepare_folio() clears private pages in the common path for both TDX and SEV-SNP. I just wanted to point out that it's a kind of obstacle that need to be removed to implement the proposed approach. > [1] https://lore.kernel.org/lkml/6de76911-5007-4170-bf74-e1d045c68465@intel.com/ > > > > > > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD. > > So, upon here, the pages should not contain the original content? > > > > Pages should contain the original content. Michael is already > experimenting with similar logic [2] for SNP. > > [2] https://lore.kernel.org/lkml/20250613005400.3694904-6-michael.roth@amd.com/
On Mon, Jun 30, 2025 at 10:26 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > On Mon, Jun 30, 2025 at 07:14:07AM -0700, Vishal Annapurve wrote: > > On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote: > > > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote: > > > > > > > > > > On 6/19/2025 4:13 PM, Yan Zhao wrote: > > > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: > > > > > >> Hello, > > > > > >> > > > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd > > > > > >> upstream calls to provide 1G page support for guest_memfd by taking > > > > > >> pages from HugeTLB. > > > > > >> > > > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support > > > > > >> for guest_memfd patchset (Thanks Fuad!) [1]. > > > > > >> > > > > > >> For ease of testing, this series is also available, stitched together, > > > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > > > > > > > > > > > > Just to record a found issue -- not one that must be fixed. > > > > > > > > > > > > In TDX, the initial memory region is added as private memory during TD's build > > > > > > time, with its initial content copied from source pages in shared memory. > > > > > > The copy operation requires simultaneous access to both shared source memory > > > > > > and private target memory. > > > > > > > > > > > > Therefore, userspace cannot store the initial content in shared memory at the > > > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and > > > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared > > > > > > page tables and then check for any extra refcount held for the shared PFN before > > > > > > converting it to private. > > > > > > > > > > I have an idea. > > > > > > > > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place > > > > > conversion unmap the PFN in shared page tables while keeping the content > > > > > of the page unchanged, right? > > > > > > > > That's correct. > > > > > > > > > > > > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory > > > > > actually for non-CoCo case actually, that userspace first mmap() it and > > > > > ensure it's shared and writes the initial content to it, after it > > > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. > > > > > > > > I think you mean pKVM by non-coco VMs that care about private memory. > > > > Yes, initial memory regions can start as shared which userspace can > > > > populate and then convert the ranges to private. > > > > > > > > > > > > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it > > > > > wants the private memory to be initialized with initial content, and > > > > > just do in-place TDH.PAGE.ADD in the hook. > > > > > > > > I think this scheme will be cleaner: > > > > 1) Userspace marks the guest_memfd ranges corresponding to initial > > > > payload as shared. > > > > 2) Userspace mmaps and populates the ranges. > > > > 3) Userspace converts those guest_memfd ranges to private. > > > > 4) For both SNP and TDX, userspace continues to invoke corresponding > > > > initial payload preparation operations via existing KVM ioctls e.g. > > > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION. > > > > - SNP/TDX KVM logic fetches the right pfns for the target gfns > > > > using the normal paths supported by KVM and passes those pfns directly > > > > to the right trusted module to initialize the "encrypted" memory > > > > contents. > > > > - Avoiding any GUP or memcpy from source addresses. > > > One caveat: > > > > > > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked. > > > Then kvm_gmem_prepare_folio() is further invoked to zero the folio. > > > > Given that confidential VMs have their own way of initializing private > > memory, I think zeroing makes sense for only shared memory ranges. > > i.e. something like below: > > 1) Don't zero at allocation time. > > 2) If faulting in a shared page and its not uptodate, then zero the > > page and set the page as uptodate. > > 3) Clear uptodate flag on private to shared conversion. > > 4) For faults on private ranges, don't zero the memory. > > > > There might be some other considerations here e.g. pKVM needs > > non-destructive conversion operation, which might need a way to enable > > zeroing at allocation time only. > > > > On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear > > pages on future platforms [1]. > Yes, TDX does not need to clear pages on private page allocation. > But current kvm_gmem_prepare_folio() clears private pages in the common path > for both TDX and SEV-SNP. > > I just wanted to point out that it's a kind of obstacle that need to be removed > to implement the proposed approach. > Proposed approach will work with 4K pages without any additional changes. For huge pages it's easy to prototype this approach by just disabling zeroing logic in guest mem on faulting and instead always doing zeroing on allocation. I would be curious to understand if we need zeroing on conversion for Confidential VMs. If not, then the simple rule of zeroing on allocation only will work for all usecases.
On Tue, Jul 01, 2025, Vishal Annapurve wrote: > I would be curious to understand if we need zeroing on conversion for > Confidential VMs. If not, then the simple rule of zeroing on > allocation only will work for all usecases. Unless I'm misunderstanding what your asking, pKVM very specific does NOT want zeroing on conversion, because one of its use cases is in-place conversion, e.g. to fill a shared buffer and then convert it to private so that the buffer can be processed in the TEE. Some architectures, e.g. SNP and TDX, may effectively require zeroing on conversion, but that's essentially a property of the architecture, i.e. an arch/vendor specific detail.
On Mon, Jul 7, 2025 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jul 01, 2025, Vishal Annapurve wrote: > > I would be curious to understand if we need zeroing on conversion for > > Confidential VMs. If not, then the simple rule of zeroing on > > allocation only will work for all usecases. > > Unless I'm misunderstanding what your asking, pKVM very specific does NOT want > zeroing on conversion, because one of its use cases is in-place conversion, e.g. > to fill a shared buffer and then convert it to private so that the buffer can be > processed in the TEE. Yeah, that makes sense. So "just zero on allocation" (and no more zeroing during conversion) policy will work for pKVM. > > Some architectures, e.g. SNP and TDX, may effectively require zeroing on conversion, > but that's essentially a property of the architecture, i.e. an arch/vendor specific > detail. Conversion operation is a unique capability supported by guest_memfd files so my intention of bringing up zeroing was to better understand the need and clarify the role of guest_memfd in handling zeroing during conversion. Not sure if I am misinterpreting you, but treating "zeroing during conversion" as the responsibility of arch/vendor specific implementation outside of guest_memfd sounds good to me.
On Mon, 2025-07-07 at 17:14 -0700, Vishal Annapurve wrote: > > > > Some architectures, e.g. SNP and TDX, may effectively require zeroing on > > conversion, > > but that's essentially a property of the architecture, i.e. an arch/vendor > > specific > > detail. > > Conversion operation is a unique capability supported by guest_memfd > files so my intention of bringing up zeroing was to better understand > the need and clarify the role of guest_memfd in handling zeroing > during conversion. > > Not sure if I am misinterpreting you, but treating "zeroing during > conversion" as the responsibility of arch/vendor specific > implementation outside of guest_memfd sounds good to me. For TDX if we don't zero on conversion from private->shared we will be dependent on behavior of the CPU when reading memory with keyid 0, which was previously encrypted and has some protection bits set. I don't *think* the behavior is architectural. So it might be prudent to either make it so, or zero it in the kernel in order to not make non-architectual behavior into userspace ABI. Up the thread Vishal says we need to support operations that use in-place conversion (overloaded term now I think, btw). Why exactly is pKVM using private/shared conversion for this private data provisioning? Instead of a special provisioning operation like the others? (Xiaoyao's suggestion)
On Tue, Jul 08, 2025, Rick P Edgecombe wrote: > On Mon, 2025-07-07 at 17:14 -0700, Vishal Annapurve wrote: > > > > > > Some architectures, e.g. SNP and TDX, may effectively require zeroing on > > > conversion, > > > but that's essentially a property of the architecture, i.e. an arch/vendor > > > specific > > > detail. > > > > Conversion operation is a unique capability supported by guest_memfd > > files so my intention of bringing up zeroing was to better understand > > the need and clarify the role of guest_memfd in handling zeroing > > during conversion. > > > > Not sure if I am misinterpreting you, but treating "zeroing during > > conversion" as the responsibility of arch/vendor specific > > implementation outside of guest_memfd sounds good to me. > > For TDX if we don't zero on conversion from private->shared we will be dependent > on behavior of the CPU when reading memory with keyid 0, which was previously > encrypted and has some protection bits set. I don't *think* the behavior is > architectural. So it might be prudent to either make it so, or zero it in the > kernel in order to not make non-architectual behavior into userspace ABI. Ya, by "vendor specific", I was also lumping in cases where the kernel would need to zero memory in order to not end up with effectively undefined behavior. > Up the thread Vishal says we need to support operations that use in-place > conversion (overloaded term now I think, btw). Why exactly is pKVM using > private/shared conversion for this private data provisioning? Because it's literally converting memory from shared to private? And IICU, it's not a one-time provisioning, e.g. memory can go: shared => fill => private => consume => shared => fill => private => consume > Instead of a special provisioning operation like the others? (Xiaoyao's > suggestion) Are you referring to this suggestion? : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to : explicitly request that the page range is converted to private and the : content needs to be retained. So that TDX can identify which case needs : to call in-place TDH.PAGE.ADD. If so, I agree with that idea, e.g. add a PRESERVE flag or whatever. That way userspace has explicit control over what happens to the data during conversion, and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for shared => private and only for select VM types.
On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote: > > For TDX if we don't zero on conversion from private->shared we will be > > dependent > > on behavior of the CPU when reading memory with keyid 0, which was > > previously > > encrypted and has some protection bits set. I don't *think* the behavior is > > architectural. So it might be prudent to either make it so, or zero it in > > the > > kernel in order to not make non-architectual behavior into userspace ABI. > > Ya, by "vendor specific", I was also lumping in cases where the kernel would > need to zero memory in order to not end up with effectively undefined > behavior. Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And the answer is sort of yes, even though TDX doesn't require it. But we actually don't want to zero memory when reclaiming memory. So TDX KVM code needs to know that the operation is a to-shared conversion and not another type of private zap. Like a callback from gmem, or maybe more simply a kernel internal flag to set in gmem such that it knows it should zero it. > > > Up the thread Vishal says we need to support operations that use in-place > > conversion (overloaded term now I think, btw). Why exactly is pKVM using > > private/shared conversion for this private data provisioning? > > Because it's literally converting memory from shared to private? And IICU, > it's > not a one-time provisioning, e.g. memory can go: > > shared => fill => private => consume => shared => fill => private => consume > > > Instead of a special provisioning operation like the others? (Xiaoyao's > > suggestion) > > Are you referring to this suggestion? Yea, in general to make it a specific operation preserving operation. > > : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to > : explicitly request that the page range is converted to private and the > : content needs to be retained. So that TDX can identify which case needs > : to call in-place TDH.PAGE.ADD. > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever. That way > userspace has explicit control over what happens to the data during > conversion, > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for > shared => private and only for select VM types. Ok, we should POC how it works with TDX.
On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote: > > > For TDX if we don't zero on conversion from private->shared we will be > > > dependent > > > on behavior of the CPU when reading memory with keyid 0, which was > > > previously > > > encrypted and has some protection bits set. I don't *think* the behavior is > > > architectural. So it might be prudent to either make it so, or zero it in > > > the > > > kernel in order to not make non-architectual behavior into userspace ABI. > > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would > > need to zero memory in order to not end up with effectively undefined > > behavior. > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And > the answer is sort of yes, even though TDX doesn't require it. But we actually > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know > that the operation is a to-shared conversion and not another type of private > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to > set in gmem such that it knows it should zero it. If the answer is that "always zero on private to shared conversions" for all CC VMs, then does the scheme outlined in [1] make sense for handling the private -> shared conversions? For pKVM, there can be a VM type check to avoid the zeroing during conversions and instead just zero on allocations. This allows delaying zeroing until the fault time for CC VMs and can be done in guest_memfd centrally. We will need more inputs from the SEV side for this discussion. [1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/ > > > > > > Up the thread Vishal says we need to support operations that use in-place > > > conversion (overloaded term now I think, btw). Why exactly is pKVM using > > > private/shared conversion for this private data provisioning? > > > > Because it's literally converting memory from shared to private? And IICU, > > it's > > not a one-time provisioning, e.g. memory can go: > > > > shared => fill => private => consume => shared => fill => private => consume > > > > > Instead of a special provisioning operation like the others? (Xiaoyao's > > > suggestion) > > > > Are you referring to this suggestion? > > Yea, in general to make it a specific operation preserving operation. > > > > > : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to > > : explicitly request that the page range is converted to private and the > > : content needs to be retained. So that TDX can identify which case needs > > : to call in-place TDH.PAGE.ADD. > > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever. That way > > userspace has explicit control over what happens to the data during > > conversion, > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for > > shared => private and only for select VM types. > > Ok, we should POC how it works with TDX. I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, 1) Conversions are always content-preserving for pKVM. 2) Shared to private conversions are always content-preserving for all VMs as far as guest_memfd is concerned. 3) Private to shared conversions are not content-preserving for CC VMs as far as guest_memfd is concerned, subject to more discussions. [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/
On Tue, Jul 08, 2025, Vishal Annapurve wrote: > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > > > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote: > > > > For TDX if we don't zero on conversion from private->shared we will be > > > > dependent > > > > on behavior of the CPU when reading memory with keyid 0, which was > > > > previously > > > > encrypted and has some protection bits set. I don't *think* the behavior is > > > > architectural. So it might be prudent to either make it so, or zero it in > > > > the > > > > kernel in order to not make non-architectual behavior into userspace ABI. > > > > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would > > > need to zero memory in order to not end up with effectively undefined > > > behavior. > > > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And > > the answer is sort of yes, even though TDX doesn't require it. But we actually > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know > > that the operation is a to-shared conversion and not another type of private > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to > > set in gmem such that it knows it should zero it. > > If the answer is that "always zero on private to shared conversions" > for all CC VMs, pKVM VMs *are* CoCo VMs. Just because pKVM doesn't rely on third party firmware to provide confidentiality and integrity doesn't make it any less of a CoCo VM. > > > : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to > > > : explicitly request that the page range is converted to private and the > > > : content needs to be retained. So that TDX can identify which case needs > > > : to call in-place TDH.PAGE.ADD. > > > > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever. That way > > > userspace has explicit control over what happens to the data during > > > conversion, > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for > > > shared => private and only for select VM types. > > > > Ok, we should POC how it works with TDX. > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, > 1) Conversions are always content-preserving for pKVM. No? Perserving contents on private => shared is a security vulnerability waiting to happen. > 2) Shared to private conversions are always content-preserving for all > VMs as far as guest_memfd is concerned. There is no "as far as guest_memfd is concerned". Userspace doesn't care whether code lives in guest_memfd.c versus arch/xxx/kvm, the only thing that matters is the behavior that userspace sees. I don't want to end up with userspace ABI that is vendor/VM specific. > 3) Private to shared conversions are not content-preserving for CC VMs > as far as guest_memfd is concerned, subject to more discussions. > > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/
Hi Sean, On Tue, 8 Jul 2025 at 16:39, Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jul 08, 2025, Vishal Annapurve wrote: > > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P > > <rick.p.edgecombe@intel.com> wrote: > > > > > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote: > > > > > For TDX if we don't zero on conversion from private->shared we will be > > > > > dependent > > > > > on behavior of the CPU when reading memory with keyid 0, which was > > > > > previously > > > > > encrypted and has some protection bits set. I don't *think* the behavior is > > > > > architectural. So it might be prudent to either make it so, or zero it in > > > > > the > > > > > kernel in order to not make non-architectual behavior into userspace ABI. > > > > > > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would > > > > need to zero memory in order to not end up with effectively undefined > > > > behavior. > > > > > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And > > > the answer is sort of yes, even though TDX doesn't require it. But we actually > > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know > > > that the operation is a to-shared conversion and not another type of private > > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to > > > set in gmem such that it knows it should zero it. > > > > If the answer is that "always zero on private to shared conversions" > > for all CC VMs, > > pKVM VMs *are* CoCo VMs. Just because pKVM doesn't rely on third party firmware > to provide confidentiality and integrity doesn't make it any less of a CoCo VM. > > > > : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to > > > > : explicitly request that the page range is converted to private and the > > > > : content needs to be retained. So that TDX can identify which case needs > > > > : to call in-place TDH.PAGE.ADD. > > > > > > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever. That way > > > > userspace has explicit control over what happens to the data during > > > > conversion, > > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for > > > > shared => private and only for select VM types. > > > > > > Ok, we should POC how it works with TDX. > > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, > > 1) Conversions are always content-preserving for pKVM. > > No? Perserving contents on private => shared is a security vulnerability waiting > to happen. Actually it is one of the requirements for pKVM as well as its current behavior. We would like to preserve contents both ways, private <=> shared, since it is required by some of the potential use cases (e.g., guest handling video encoding/decoding). To make it clear, I'm talking about explicit sharing from the guest, not relinquishing memory back to the host. In the case of relinquishing (and guest teardown), relinquished memory is poisoned (zeroed) in pKVM. Cheers, /fuad > > 2) Shared to private conversions are always content-preserving for all > > VMs as far as guest_memfd is concerned. > > There is no "as far as guest_memfd is concerned". Userspace doesn't care whether > code lives in guest_memfd.c versus arch/xxx/kvm, the only thing that matters is > the behavior that userspace sees. I don't want to end up with userspace ABI that > is vendor/VM specific. > > > 3) Private to shared conversions are not content-preserving for CC VMs > > as far as guest_memfd is concerned, subject to more discussions. > > > > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/
On Tue, Jul 08, 2025, Fuad Tabba wrote: > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, > > > 1) Conversions are always content-preserving for pKVM. > > > > No? Perserving contents on private => shared is a security vulnerability waiting > > to happen. > > Actually it is one of the requirements for pKVM as well as its current > behavior. We would like to preserve contents both ways, private <=> > shared, since it is required by some of the potential use cases (e.g., > guest handling video encoding/decoding). > > To make it clear, I'm talking about explicit sharing from the guest, > not relinquishing memory back to the host. In the case of > relinquishing (and guest teardown), relinquished memory is poisoned > (zeroed) in pKVM. I forget, what's the "explicit sharing" flow look like? E.g. how/when does pKVM know it's ok to convert memory from private to shared? I think we'd still want to make data preservation optional, e.g. to avoid potential leakage with setups where memory is private by default, but a flag in KVM's uAPI might not be a good fit since whether or not to preserve data is more of a guest decision (or at least needs to be ok'd by the guest).
On Tue, 8 Jul 2025 at 18:25, Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jul 08, 2025, Fuad Tabba wrote: > > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, > > > > 1) Conversions are always content-preserving for pKVM. > > > > > > No? Perserving contents on private => shared is a security vulnerability waiting > > > to happen. > > > > Actually it is one of the requirements for pKVM as well as its current > > behavior. We would like to preserve contents both ways, private <=> > > shared, since it is required by some of the potential use cases (e.g., > > guest handling video encoding/decoding). > > > > To make it clear, I'm talking about explicit sharing from the guest, > > not relinquishing memory back to the host. In the case of > > relinquishing (and guest teardown), relinquished memory is poisoned > > (zeroed) in pKVM. > > I forget, what's the "explicit sharing" flow look like? E.g. how/when does pKVM > know it's ok to convert memory from private to shared? I think we'd still want > to make data preservation optional, e.g. to avoid potential leakage with setups > where memory is private by default, but a flag in KVM's uAPI might not be a good > fit since whether or not to preserve data is more of a guest decision (or at least > needs to be ok'd by the guest). In pKVM all sharing and unsharing is triggered by the guest via hypercalls. The host cannot unshare. That said, making data preservation optional works for pKVM and is a good idea, for the reasons that you've mentioned. Cheers, /fuad
Fuad Tabba <tabba@google.com> writes: > On Tue, 8 Jul 2025 at 18:25, Sean Christopherson <seanjc@google.com> wrote: >> >> On Tue, Jul 08, 2025, Fuad Tabba wrote: >> > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, >> > > > 1) Conversions are always content-preserving for pKVM. >> > > >> > > No? Perserving contents on private => shared is a security vulnerability waiting >> > > to happen. >> > >> > Actually it is one of the requirements for pKVM as well as its current >> > behavior. We would like to preserve contents both ways, private <=> >> > shared, since it is required by some of the potential use cases (e.g., >> > guest handling video encoding/decoding). >> > >> > To make it clear, I'm talking about explicit sharing from the guest, >> > not relinquishing memory back to the host. In the case of >> > relinquishing (and guest teardown), relinquished memory is poisoned >> > (zeroed) in pKVM. >> >> I forget, what's the "explicit sharing" flow look like? E.g. how/when does pKVM >> know it's ok to convert memory from private to shared? I think we'd still want >> to make data preservation optional, e.g. to avoid potential leakage with setups >> where memory is private by default, but a flag in KVM's uAPI might not be a good >> fit since whether or not to preserve data is more of a guest decision (or at least >> needs to be ok'd by the guest). > > In pKVM all sharing and unsharing is triggered by the guest via > hypercalls. The host cannot unshare. In pKVM's case, would the conversion ioctl be disabled completely, or would the ioctl be allowed, but conversion always checks with pKVM to see if the guest had previously requested a unshare? > That said, making data > preservation optional works for pKVM and is a good idea, for the > reasons that you've mentioned. > > Cheers, > /fuad
On Tue, 2025-07-08 at 08:07 -0700, Vishal Annapurve wrote: > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > > > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote: > > > > For TDX if we don't zero on conversion from private->shared we will be > > > > dependent > > > > on behavior of the CPU when reading memory with keyid 0, which was > > > > previously > > > > encrypted and has some protection bits set. I don't *think* the behavior is > > > > architectural. So it might be prudent to either make it so, or zero it in > > > > the > > > > kernel in order to not make non-architectual behavior into userspace ABI. > > > > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would > > > need to zero memory in order to not end up with effectively undefined > > > behavior. > > > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And > > the answer is sort of yes, even though TDX doesn't require it. But we actually > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know > > that the operation is a to-shared conversion and not another type of private > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to > > set in gmem such that it knows it should zero it. > > If the answer is that "always zero on private to shared conversions" > for all CC VMs, then does the scheme outlined in [1] make sense for > handling the private -> shared conversions? For pKVM, there can be a > VM type check to avoid the zeroing during conversions and instead just > zero on allocations. This allows delaying zeroing until the fault time > for CC VMs and can be done in guest_memfd centrally. We will need more > inputs from the SEV side for this discussion. > > [1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/ It's nice that we don't double zero (since TDX module will do it too) for private allocation/mapping. Seems ok to me. > > > > > > > > > > Up the thread Vishal says we need to support operations that use in-place > > > > conversion (overloaded term now I think, btw). Why exactly is pKVM using > > > > private/shared conversion for this private data provisioning? > > > > > > Because it's literally converting memory from shared to private? And IICU, > > > it's > > > not a one-time provisioning, e.g. memory can go: > > > > > > shared => fill => private => consume => shared => fill => private => consume > > > > > > > Instead of a special provisioning operation like the others? (Xiaoyao's > > > > suggestion) > > > > > > Are you referring to this suggestion? > > > > Yea, in general to make it a specific operation preserving operation. > > > > > > > > : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to > > > : explicitly request that the page range is converted to private and the > > > : content needs to be retained. So that TDX can identify which case needs > > > : to call in-place TDH.PAGE.ADD. > > > > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever. That way > > > userspace has explicit control over what happens to the data during > > > conversion, > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for > > > shared => private and only for select VM types. > > > > Ok, we should POC how it works with TDX. > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, > 1) Conversions are always content-preserving for pKVM. > 2) Shared to private conversions are always content-preserving for all > VMs as far as guest_memfd is concerned. > 3) Private to shared conversions are not content-preserving for CC VMs > as far as guest_memfd is concerned, subject to more discussions. > > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/ Right, I read that. I still don't see why pKVM needs to do normal private/shared conversion for data provisioning. Vs a dedicated operation/flag to make it a special case. I'm trying to suggest there could be a benefit to making all gmem VM types behave the same. If conversions are always content preserving for pKVM, why can't userspace always use the operation that says preserve content? Vs changing the behavior of the common operations? So for all VM types, the user ABI would be: private->shared - Always zero's page shared->private - Always destructive shared->private (w/flag) - Always preserves data or return error if not possible Do you see a problem?
On Tue, Jul 8, 2025 at 8:31 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2025-07-08 at 08:07 -0700, Vishal Annapurve wrote: > > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P > > <rick.p.edgecombe@intel.com> wrote: > > > > > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote: > > > > > For TDX if we don't zero on conversion from private->shared we will be > > > > > dependent > > > > > on behavior of the CPU when reading memory with keyid 0, which was > > > > > previously > > > > > encrypted and has some protection bits set. I don't *think* the behavior is > > > > > architectural. So it might be prudent to either make it so, or zero it in > > > > > the > > > > > kernel in order to not make non-architectual behavior into userspace ABI. > > > > > > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would > > > > need to zero memory in order to not end up with effectively undefined > > > > behavior. > > > > > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And > > > the answer is sort of yes, even though TDX doesn't require it. But we actually > > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know > > > that the operation is a to-shared conversion and not another type of private > > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to > > > set in gmem such that it knows it should zero it. > > > > If the answer is that "always zero on private to shared conversions" > > for all CC VMs, then does the scheme outlined in [1] make sense for > > handling the private -> shared conversions? For pKVM, there can be a > > VM type check to avoid the zeroing during conversions and instead just > > zero on allocations. This allows delaying zeroing until the fault time > > for CC VMs and can be done in guest_memfd centrally. We will need more > > inputs from the SEV side for this discussion. > > > > [1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/ > > It's nice that we don't double zero (since TDX module will do it too) for > private allocation/mapping. Seems ok to me. > > > > > > > > > > > > > > > Up the thread Vishal says we need to support operations that use in-place > > > > > conversion (overloaded term now I think, btw). Why exactly is pKVM using > > > > > private/shared conversion for this private data provisioning? > > > > > > > > Because it's literally converting memory from shared to private? And IICU, > > > > it's > > > > not a one-time provisioning, e.g. memory can go: > > > > > > > > shared => fill => private => consume => shared => fill => private => consume > > > > > > > > > Instead of a special provisioning operation like the others? (Xiaoyao's > > > > > suggestion) > > > > > > > > Are you referring to this suggestion? > > > > > > Yea, in general to make it a specific operation preserving operation. > > > > > > > > > > > : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to > > > > : explicitly request that the page range is converted to private and the > > > > : content needs to be retained. So that TDX can identify which case needs > > > > : to call in-place TDH.PAGE.ADD. > > > > > > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever. That way > > > > userspace has explicit control over what happens to the data during > > > > conversion, > > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for > > > > shared => private and only for select VM types. > > > > > > Ok, we should POC how it works with TDX. > > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC, > > 1) Conversions are always content-preserving for pKVM. > > 2) Shared to private conversions are always content-preserving for all > > VMs as far as guest_memfd is concerned. > > 3) Private to shared conversions are not content-preserving for CC VMs > > as far as guest_memfd is concerned, subject to more discussions. > > > > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/ > > Right, I read that. I still don't see why pKVM needs to do normal private/shared > conversion for data provisioning. Vs a dedicated operation/flag to make it a > special case. It's dictated by pKVM usecases, memory contents need to be preserved for every conversion not just for initial payload population. > > I'm trying to suggest there could be a benefit to making all gmem VM types > behave the same. If conversions are always content preserving for pKVM, why > can't userspace always use the operation that says preserve content? Vs > changing the behavior of the common operations? I don't see a benefit of userspace passing a flag that's kind of default for the VM type (assuming pKVM will use a special VM type). Common operations in guest_memfd will need to either check for the userspace passed flag or the VM type, so no major change in guest_memfd implementation for either mechanism. > > So for all VM types, the user ABI would be: > private->shared - Always zero's page > shared->private - Always destructive > shared->private (w/flag) - Always preserves data or return error if not possible > > > Do you see a problem? >
On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote: > > Right, I read that. I still don't see why pKVM needs to do normal > > private/shared > > conversion for data provisioning. Vs a dedicated operation/flag to make it a > > special case. > > It's dictated by pKVM usecases, memory contents need to be preserved > for every conversion not just for initial payload population. We are weighing pros/cons between: - Unifying this uABI across all gmemfd VM types - Userspace for one VM type passing a flag for it's special non-shared use case I don't see how passing a flag or not is dictated by pKVM use case. P.S. This doesn't really impact TDX I think. Except that TDX development needs to work in the code without bumping anything. So just wishing to work in code with less conditionals. > > > > > I'm trying to suggest there could be a benefit to making all gmem VM types > > behave the same. If conversions are always content preserving for pKVM, why > > can't userspace always use the operation that says preserve content? Vs > > changing the behavior of the common operations? > > I don't see a benefit of userspace passing a flag that's kind of > default for the VM type (assuming pKVM will use a special VM type). The benefit is that we don't need to have special VM default behavior for gmemfd. Think about if some day (very hypothetical and made up) we want to add a mode for TDX that adds new private data to a running guest (with special accept on the guest side or something). Then we might want to add a flag to override the default destructive behavior. Then maybe pKVM wants to add a "don't preserve" operation and it adds a second flag to not destroy. Now gmemfd has lots of VM specific flags. The point of this example is to show how unified uABI can he helpful. > Common operations in guest_memfd will need to either check for the > userspace passed flag or the VM type, so no major change in > guest_memfd implementation for either mechanism. While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd fd tied to a VM? I think there is interest in de-coupling it? Is the VM type sticky? It seems the more they are separate, the better it will be to not have VM-aware behavior living in gmem.
On Tue, Jul 08, 2025, Rick P Edgecombe wrote: > On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote: > > > Right, I read that. I still don't see why pKVM needs to do normal > > > private/shared > > > conversion for data provisioning. Vs a dedicated operation/flag to make it a > > > special case. > > > > It's dictated by pKVM usecases, memory contents need to be preserved > > for every conversion not just for initial payload population. > > We are weighing pros/cons between: > - Unifying this uABI across all gmemfd VM types > - Userspace for one VM type passing a flag for it's special non-shared use case > > I don't see how passing a flag or not is dictated by pKVM use case. Yep. Baking the behavior of a single usecase into the kernel's ABI is rarely a good idea. Just because pKVM's current usecases always wants contents to be preserved doesn't mean that pKVM will never change. As a general rule, KVM should push policy to userspace whenever possible. > P.S. This doesn't really impact TDX I think. Except that TDX development needs > to work in the code without bumping anything. So just wishing to work in code > with less conditionals. > > > > > > > > > I'm trying to suggest there could be a benefit to making all gmem VM types > > > behave the same. If conversions are always content preserving for pKVM, why > > > can't userspace always use the operation that says preserve content? Vs > > > changing the behavior of the common operations? > > > > I don't see a benefit of userspace passing a flag that's kind of > > default for the VM type (assuming pKVM will use a special VM type). > > The benefit is that we don't need to have special VM default behavior for > gmemfd. Think about if some day (very hypothetical and made up) we want to add a > mode for TDX that adds new private data to a running guest (with special accept > on the guest side or something). Then we might want to add a flag to override > the default destructive behavior. Then maybe pKVM wants to add a "don't > preserve" operation and it adds a second flag to not destroy. Now gmemfd has > lots of VM specific flags. The point of this example is to show how unified uABI > can he helpful. Yep again. Pivoting on the VM type would be completely inflexible. If pKVM gains a usecase that wants to zero memory on conversions, we're hosed. If SNP or TDX gains the ability to preserve data on conversions, we're hosed. The VM type may restrict what is possible, but (a) that should be abstracted, e.g. by defining the allowed flags during guest_memfd creation, and (b) the capabilities of the guest_memfd instance need to be communicated to userspace. > > Common operations in guest_memfd will need to either check for the > > userspace passed flag or the VM type, so no major change in > > guest_memfd implementation for either mechanism. > > While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd > fd tied to a VM? Yes. > I think there is interest in de-coupling it? No? Even if we get to a point where multiple distinct VMs can bind to a single guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add non-trivial complexity for zero practical benefit. > Is the VM type sticky? > > It seems the more they are separate, the better it will be to not have VM-aware > behavior living in gmem. Ya. A guest_memfd instance may have capabilities/features that are restricted and/or defined based on the properties of the owning VM, but we should do our best to make guest_memfd itself blissly unaware of the VM type.
On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > > > Right, I read that. I still don't see why pKVM needs to do normal
> > > > private/shared
> > > > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > > > special case.
> > >
> > > It's dictated by pKVM usecases, memory contents need to be preserved
> > > for every conversion not just for initial payload population.
> >
> > We are weighing pros/cons between:
> > - Unifying this uABI across all gmemfd VM types
> > - Userspace for one VM type passing a flag for it's special non-shared use case
> >
> > I don't see how passing a flag or not is dictated by pKVM use case.
>
> Yep. Baking the behavior of a single usecase into the kernel's ABI is rarely a
> good idea. Just because pKVM's current usecases always wants contents to be
> preserved doesn't mean that pKVM will never change.
>
> As a general rule, KVM should push policy to userspace whenever possible.
>
> > P.S. This doesn't really impact TDX I think. Except that TDX development needs
> > to work in the code without bumping anything. So just wishing to work in code
> > with less conditionals.
> >
> > >
> > > >
> > > > I'm trying to suggest there could be a benefit to making all gmem VM types
> > > > behave the same. If conversions are always content preserving for pKVM, why
> > > > can't userspace always use the operation that says preserve content? Vs
> > > > changing the behavior of the common operations?
> > >
> > > I don't see a benefit of userspace passing a flag that's kind of
> > > default for the VM type (assuming pKVM will use a special VM type).
> >
> > The benefit is that we don't need to have special VM default behavior for
> > gmemfd. Think about if some day (very hypothetical and made up) we want to add a
> > mode for TDX that adds new private data to a running guest (with special accept
> > on the guest side or something). Then we might want to add a flag to override
> > the default destructive behavior. Then maybe pKVM wants to add a "don't
> > preserve" operation and it adds a second flag to not destroy. Now gmemfd has
> > lots of VM specific flags. The point of this example is to show how unified uABI
> > can he helpful.
>
> Yep again. Pivoting on the VM type would be completely inflexible. If pKVM gains
> a usecase that wants to zero memory on conversions, we're hosed. If SNP or TDX
> gains the ability to preserve data on conversions, we're hosed.
>
> The VM type may restrict what is possible, but (a) that should be abstracted,
> e.g. by defining the allowed flags during guest_memfd creation, and (b) the
> capabilities of the guest_memfd instance need to be communicated to userspace.
Ok, I concur with this: It's beneficial to keep a unified ABI that
allows guest_memfd to make runtime decisions without relying on VM
type as far as possible.
Few points that seem important here:
1) Userspace can and should be able to only dictate if memory contents
need to be preserved on shared to private conversion.
-> For SNP/TDX VMs:
* Only usecase for preserving contents is initial memory
population, which can be achieved by:
- Userspace converting the ranges to shared,
populating the contents, converting them back to private and then
calling SNP/TDX specific existing ABI functions.
* For runtime conversions, guest_memfd can't ensure memory
contents are preserved during shared to private conversions as the
architectures don't support that behavior.
* So IMO, this "preserve" flag doesn't make sense for SNP/TDX
VMs, even if we add this flag, today guest_memfd should effectively
mark this unsupported based on the backing architecture support.
2) For pKVM, if userspace wants to specify a "preserve" flag then this
flag can be allowed based on the known capabilities of the backing
architecture.
So this topic is still orthogonal to "zeroing on private to shared conversion".
>
> > > Common operations in guest_memfd will need to either check for the
> > > userspace passed flag or the VM type, so no major change in
> > > guest_memfd implementation for either mechanism.
> >
> > While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
> > fd tied to a VM?
>
> Yes.
>
> > I think there is interest in de-coupling it?
>
> No? Even if we get to a point where multiple distinct VMs can bind to a single
> guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole
> owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add
> non-trivial complexity for zero practical benefit.
>
> > Is the VM type sticky?
> >
> > It seems the more they are separate, the better it will be to not have VM-aware
> > behavior living in gmem.
>
> Ya. A guest_memfd instance may have capabilities/features that are restricted
> and/or defined based on the properties of the owning VM, but we should do our
> best to make guest_memfd itself blissly unaware of the VM type.
On Tue, Jul 08, 2025, Vishal Annapurve wrote: > On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote: > Few points that seem important here: > 1) Userspace can and should be able to only dictate if memory contents > need to be preserved on shared to private conversion. No, I was wrong, pKVM has use cases where it's desirable to preserve data on private => shared conversions. Side topic, if you're going to use fancy indentation, align the indentation so it's actually readable. > -> For SNP/TDX VMs: > * Only usecase for preserving contents is initial memory > population, which can be achieved by: > - Userspace converting the ranges to shared, populating the contents, > converting them back to private and then calling SNP/TDX specific > existing ABI functions. > * For runtime conversions, guest_memfd can't ensure memory contents are > preserved during shared to private conversions as the architectures > don't support that behavior. > * So IMO, this "preserve" flag doesn't make sense for SNP/TDX VMs, even It makes sense, it's just not supported by the architecture *at runtime*. Case in point, *something* needs to allow preserving data prior to launching the VM. If we want to go with the PRIVATE => SHARED => FILL => PRIVATE approach for TDX and SNP, then we'll probably want to allow PRESERVE only until the VM image is finalized. > if we add this flag, today guest_memfd should effectively mark this > unsupported based on the backing architecture support. > > 2) For pKVM, if userspace wants to specify a "preserve" flag then this There is no "For pKVM". We are defining uAPI for guest_memfd. I.e. this statement holds true for all implementations: PRESERVE is allowed based on the capabilities of the architecture. > So this topic is still orthogonal to "zeroing on private to shared conversion". As above, no. pKVM might not expose PRESERVE to _userspace_ since all current conversions are initiated by the guest, but for guest_memfd itself, this is all one and the same.
On Tue, Jul 8, 2025 at 12:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> > On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
> > Few points that seem important here:
> > 1) Userspace can and should be able to only dictate if memory contents
> > need to be preserved on shared to private conversion.
>
> No, I was wrong, pKVM has use cases where it's desirable to preserve data on
> private => shared conversions.
>
> Side topic, if you're going to use fancy indentation, align the indentation so
> it's actually readable.
>
> > -> For SNP/TDX VMs:
> > * Only usecase for preserving contents is initial memory
> > population, which can be achieved by:
> > - Userspace converting the ranges to shared, populating the contents,
> > converting them back to private and then calling SNP/TDX specific
> > existing ABI functions.
> > * For runtime conversions, guest_memfd can't ensure memory contents are
> > preserved during shared to private conversions as the architectures
> > don't support that behavior.
> > * So IMO, this "preserve" flag doesn't make sense for SNP/TDX VMs, even
>
> It makes sense, it's just not supported by the architecture *at runtime*. Case
> in point, *something* needs to allow preserving data prior to launching the VM.
> If we want to go with the PRIVATE => SHARED => FILL => PRIVATE approach for TDX
> and SNP, then we'll probably want to allow PRESERVE only until the VM image is
> finalized.
Maybe we can simplify the story a bit here for today, how about:
1) For shared to private conversions:
* Is it safe to say that the conversion itself is always
content preserving, it's upto the
architecture what it does with memory contents on the private faults?
- During initial memory setup, userspace can control
how private memory would
be faulted in by architecture supported ABI operations.
- After initial memory setup, userspace can't control
how private memory would
be faulted in.
2) For private to shared conversions:
* Architecture decides what should be done with the memory on
shared faults.
- guest_memfd can query architecture whether to zero
memory or not.
-> guest_memfd will only take on the responsibility of zeroing if
needed by the architecture on shared faults.
-> Architecture is responsible for the behavior on private faults.
In future, if there is a usecase for controlling runtime behavior of
private faults, architecture can expose additional ABI that userspace
can use after initiating guest_memfd conversion.
On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote: > > I think there is interest in de-coupling it? > > No? I'm talking about the intra-host migration/reboot optimization stuff. And not doing a good job, sorry. > Even if we get to a point where multiple distinct VMs can bind to a single > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a > sole > owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add > non-trivial complexity for zero practical benefit. I'm talking about moving a gmem fd between different VMs or something using KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to feel out where the concepts are headed. It kind of allows gmem fds (or just their source memory?) to live beyond a VM lifecycle. [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/ https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/
On Tue, Jul 08, 2025, Rick P Edgecombe wrote: > On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote: > > > I think there is interest in de-coupling it? > > > > No? > > I'm talking about the intra-host migration/reboot optimization stuff. And not > doing a good job, sorry. > > > Even if we get to a point where multiple distinct VMs can bind to a single > > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a > > sole > > owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add > > non-trivial complexity for zero practical benefit. > > I'm talking about moving a gmem fd between different VMs or something using > KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to > feel out where the concepts are headed. It kind of allows gmem fds (or just > their source memory?) to live beyond a VM lifecycle. I think the answer is that we want to let guest_memfd live beyond the "struct kvm" instance, but not beyond the Virtual Machine. From a past discussion on this topic[*]. : No go. Because again, the inode (physical memory) is coupled to the virtual machine : as a thing, not to a "struct kvm". Or more concretely, the inode is coupled to an : ASID or an HKID, and there can be multiple "struct kvm" objects associated with a : single ASID. And at some point in the future, I suspect we'll have multiple KVM : objects per HKID too. : : The current SEV use case is for the migration helper, where two KVM objects share : a single ASID (the "real" VM and the helper). I suspect TDX will end up with : similar behavior where helper "VMs" can use the HKID of the "real" VM. For KVM, : that means multiple struct kvm objects being associated with a single HKID. : : To prevent use-after-free, KVM "just" needs to ensure the helper instances can't : outlive the real instance, i.e. can't use the HKID/ASID after the owning virtual : machine has been destroyed. : : To put it differently, "struct kvm" is a KVM software construct that _usually_, : but not always, is associated 1:1 with a virtual machine. : : And FWIW, stashing the pointer without holding a reference would not be a complete : solution, because it couldn't guard against KVM reusing a pointer. E.g. if a : struct kvm was unbound and then freed, KVM could reuse the same memory for a new : struct kvm, with a different ASID/HKID, and get a false negative on the rebinding : check. Exactly what that will look like in code is TBD, but the concept/logic holds up. [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com > [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/ > https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/
On Tue, Jul 8, 2025 at 11:55 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote:
> > > > I think there is interest in de-coupling it?
> > >
> > > No?
> >
> > I'm talking about the intra-host migration/reboot optimization stuff. And not
> > doing a good job, sorry.
> >
> > > Even if we get to a point where multiple distinct VMs can bind to a single
> > > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a
> > > sole
> > > owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add
> > > non-trivial complexity for zero practical benefit.
> >
> > I'm talking about moving a gmem fd between different VMs or something using
> > KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to
> > feel out where the concepts are headed. It kind of allows gmem fds (or just
> > their source memory?) to live beyond a VM lifecycle.
>
> I think the answer is that we want to let guest_memfd live beyond the "struct kvm"
> instance, but not beyond the Virtual Machine. From a past discussion on this topic[*].
>
> : No go. Because again, the inode (physical memory) is coupled to the virtual machine
> : as a thing, not to a "struct kvm". Or more concretely, the inode is coupled to an
> : ASID or an HKID, and there can be multiple "struct kvm" objects associated with a
> : single ASID. And at some point in the future, I suspect we'll have multiple KVM
> : objects per HKID too.
> :
> : The current SEV use case is for the migration helper, where two KVM objects share
> : a single ASID (the "real" VM and the helper). I suspect TDX will end up with
> : similar behavior where helper "VMs" can use the HKID of the "real" VM. For KVM,
> : that means multiple struct kvm objects being associated with a single HKID.
> :
> : To prevent use-after-free, KVM "just" needs to ensure the helper instances can't
> : outlive the real instance, i.e. can't use the HKID/ASID after the owning virtual
> : machine has been destroyed.
> :
> : To put it differently, "struct kvm" is a KVM software construct that _usually_,
> : but not always, is associated 1:1 with a virtual machine.
> :
> : And FWIW, stashing the pointer without holding a reference would not be a complete
> : solution, because it couldn't guard against KVM reusing a pointer. E.g. if a
> : struct kvm was unbound and then freed, KVM could reuse the same memory for a new
> : struct kvm, with a different ASID/HKID, and get a false negative on the rebinding
> : check.
>
> Exactly what that will look like in code is TBD, but the concept/logic holds up.
I think we can simplify the role of guest_memfd in line with discussion [1]:
1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
- It allows fallocate to populate/deallocate memory
2) guest_memfd supports the notion of private/shared faults.
3) guest_memfd supports memory access control:
- It allows shared faults from userspace, KVM, IOMMU
- It allows private faults from KVM, IOMMU
4) guest_memfd supports changing access control on its ranges between
shared/private.
- It notifies the users to invalidate their mappings for the
ranges getting converted/truncated.
Responsibilities that ideally should not be taken up by guest_memfd:
1) guest_memfd can not initiate pre-faulting on behalf of it's users.
2) guest_memfd should not be directly communicating with the
underlying architecture layers.
- All communication should go via KVM/IOMMU.
3) KVM should ideally associate the lifetime of backing
pagetables/protection tables/RMP tables with the lifetime of the
binding of memslots with guest_memfd.
- Today KVM SNP logic ties RMP table entry lifetimes with how
long the folios are mapped in guest_memfd, which I think should be
revisited.
Some very early thoughts on how guest_memfd could be laid out for the long term:
1) guest_memfd code ideally should be built-in to the kernel.
2) guest_memfd instances should still be created using KVM IOCTLs that
carry specific capabilities/restrictions for its users based on the
backing VM/arch.
3) Any outgoing communication from guest_memfd to it's users like
userspace/KVM/IOMMU should be via notifiers to invalidate similar to
how MMU notifiers work.
4) KVM and IOMMU can implement intermediate layers to handle
interaction with guest_memfd.
- e.g. there could be a layer within kvm that handles:
- creating guest_memfd files and associating a
kvm_gmem_context with those files.
- memslot binding
- kvm_gmem_context will be used to bind kvm
memslots with the context ranges.
- invalidate notifier handling
- kvm_gmem_context will be used to intercept
guest_memfd callbacks and
translate them to the right GPA ranges.
- linking
- kvm_gmem_context can be linked to different
KVM instances.
This line of thinking can allow cleaner separation between
guest_memfd/KVM/IOMMU [2].
[1] https://lore.kernel.org/lkml/CAGtprH-+gPN8J_RaEit=M_ErHWTmFHeCipC6viT6PHhG3ELg6A@mail.gmail.com/#t
[2] https://lore.kernel.org/lkml/31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com/
>
> [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com
>
> > [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> > https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/
On Wed, 2025-07-09 at 07:28 -0700, Vishal Annapurve wrote: > I think we can simplify the role of guest_memfd in line with discussion [1]: > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU. > - It allows fallocate to populate/deallocate memory > 2) guest_memfd supports the notion of private/shared faults. > 3) guest_memfd supports memory access control: > - It allows shared faults from userspace, KVM, IOMMU > - It allows private faults from KVM, IOMMU > 4) guest_memfd supports changing access control on its ranges between > shared/private. > - It notifies the users to invalidate their mappings for the > ranges getting converted/truncated. KVM needs to know if a GFN is private/shared. I think it is also intended to now be a repository for this information, right? Besides invalidations, it needs to be queryable. > > Responsibilities that ideally should not be taken up by guest_memfd: > 1) guest_memfd can not initiate pre-faulting on behalf of it's users. > 2) guest_memfd should not be directly communicating with the > underlying architecture layers. > - All communication should go via KVM/IOMMU. Maybe stronger, there should be generic gmem behaviors. Not any special if (vm_type == tdx) type logic. > 3) KVM should ideally associate the lifetime of backing > pagetables/protection tables/RMP tables with the lifetime of the > binding of memslots with guest_memfd. > - Today KVM SNP logic ties RMP table entry lifetimes with how > long the folios are mapped in guest_memfd, which I think should be > revisited. I don't understand the problem. KVM needs to respond to user accessible invalidations, but how long it keeps other resources around could be useful for various optimizations. Like deferring work to a work queue or something. I think it would help to just target the ackerly series goals. We should get that code into shape and this kind of stuff will fall out of it. > > Some very early thoughts on how guest_memfd could be laid out for the long term: > 1) guest_memfd code ideally should be built-in to the kernel. > 2) guest_memfd instances should still be created using KVM IOCTLs that > carry specific capabilities/restrictions for its users based on the > backing VM/arch. > 3) Any outgoing communication from guest_memfd to it's users like > userspace/KVM/IOMMU should be via notifiers to invalidate similar to > how MMU notifiers work. > 4) KVM and IOMMU can implement intermediate layers to handle > interaction with guest_memfd. > - e.g. there could be a layer within kvm that handles: > - creating guest_memfd files and associating a > kvm_gmem_context with those files. > - memslot binding > - kvm_gmem_context will be used to bind kvm > memslots with the context ranges. > - invalidate notifier handling > - kvm_gmem_context will be used to intercept > guest_memfd callbacks and > translate them to the right GPA ranges. > - linking > - kvm_gmem_context can be linked to different > KVM instances. We can probably look at the code to decide these. > > This line of thinking can allow cleaner separation between > guest_memfd/KVM/IOMMU [2]. > > [1] https://lore.kernel.org/lkml/CAGtprH-+gPN8J_RaEit=M_ErHWTmFHeCipC6viT6PHhG3ELg6A@mail.gmail.com/#t > [2] https://lore.kernel.org/lkml/31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com/ > > > > > > > [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com > > > > > [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/ > > > https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/
On Wed, Jul 9, 2025 at 8:17 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Wed, 2025-07-09 at 07:28 -0700, Vishal Annapurve wrote: > > I think we can simplify the role of guest_memfd in line with discussion [1]: > > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU. > > - It allows fallocate to populate/deallocate memory > > 2) guest_memfd supports the notion of private/shared faults. > > 3) guest_memfd supports memory access control: > > - It allows shared faults from userspace, KVM, IOMMU > > - It allows private faults from KVM, IOMMU > > 4) guest_memfd supports changing access control on its ranges between > > shared/private. > > - It notifies the users to invalidate their mappings for the > > ranges getting converted/truncated. > > KVM needs to know if a GFN is private/shared. I think it is also intended to now > be a repository for this information, right? Besides invalidations, it needs to > be queryable. Yeah, that interface can be added as well. Though, if possible KVM can just directly pass the fault type to guest_memfd and it can return an error if the fault type doesn't match the permission. Additionally KVM does query the mapping order for a certain pfn/gfn which will need to be supported as well. > > > > > Responsibilities that ideally should not be taken up by guest_memfd: > > 1) guest_memfd can not initiate pre-faulting on behalf of it's users. > > 2) guest_memfd should not be directly communicating with the > > underlying architecture layers. > > - All communication should go via KVM/IOMMU. > > Maybe stronger, there should be generic gmem behaviors. Not any special > if (vm_type == tdx) type logic. > > > 3) KVM should ideally associate the lifetime of backing > > pagetables/protection tables/RMP tables with the lifetime of the > > binding of memslots with guest_memfd. > > - Today KVM SNP logic ties RMP table entry lifetimes with how > > long the folios are mapped in guest_memfd, which I think should be > > revisited. > > I don't understand the problem. KVM needs to respond to user accessible > invalidations, but how long it keeps other resources around could be useful for > various optimizations. Like deferring work to a work queue or something. I don't think it could be deferred to a work queue as the RMP table entries will need to be removed synchronously once the last reference on the guest_memfd drops, unless memory itself is kept around after filemap eviction. I can see benefits of this approach for handling scenarios like intrahost-migration. > > I think it would help to just target the ackerly series goals. We should get > that code into shape and this kind of stuff will fall out of it. > > > > > Some very early thoughts on how guest_memfd could be laid out for the long term: > > 1) guest_memfd code ideally should be built-in to the kernel. > > 2) guest_memfd instances should still be created using KVM IOCTLs that > > carry specific capabilities/restrictions for its users based on the > > backing VM/arch. > > 3) Any outgoing communication from guest_memfd to it's users like > > userspace/KVM/IOMMU should be via notifiers to invalidate similar to > > how MMU notifiers work. > > 4) KVM and IOMMU can implement intermediate layers to handle > > interaction with guest_memfd. > > - e.g. there could be a layer within kvm that handles: > > - creating guest_memfd files and associating a > > kvm_gmem_context with those files. > > - memslot binding > > - kvm_gmem_context will be used to bind kvm > > memslots with the context ranges. > > - invalidate notifier handling > > - kvm_gmem_context will be used to intercept > > guest_memfd callbacks and > > translate them to the right GPA ranges. > > - linking > > - kvm_gmem_context can be linked to different > > KVM instances. > > We can probably look at the code to decide these. > Agree.
On Wed, Jul 09, 2025, Vishal Annapurve wrote: > I think we can simplify the role of guest_memfd in line with discussion [1]: I genuinely don't understand what you're trying to "simplify". We need to define an ABI that is flexible and robust, but beyond that most of these guidelines boil down to "don't write bad code". > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU. No, guest_memfd is a memory provider for KVM guests. That memory *might* be mapped by userspace and/or into IOMMU page tables in order out of functional necessity, but guest_memfd exists solely to serve memory to KVM guests, full stop. > 3) KVM should ideally associate the lifetime of backing > pagetables/protection tables/RMP tables with the lifetime of the > binding of memslots with guest_memfd. Again, please align your indentation. > - Today KVM SNP logic ties RMP table entry lifetimes with how > long the folios are mapped in guest_memfd, which I think should be > revisited. Why? Memslots are ephemeral per-"struct kvm" mappings. RMP entries and guest_memfd inodes are tied to the Virtual Machine, not to the "struct kvm" instance. > Some very early thoughts on how guest_memfd could be laid out for the long term: > 1) guest_memfd code ideally should be built-in to the kernel. Why? How is this at all relevant? If we need to bake some parts of guest_memfd into the kernel in order to avoid nasty exports and/or ordering dependencies, then we can do so. But that is 100% an implementation detail and in no way a design goal.
On Wed, Jul 9, 2025 at 8:00 AM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Jul 09, 2025, Vishal Annapurve wrote: > > I think we can simplify the role of guest_memfd in line with discussion [1]: > > I genuinely don't understand what you're trying to "simplify". We need to define > an ABI that is flexible and robust, but beyond that most of these guidelines boil > down to "don't write bad code". My goal for bringing this discussion up is to see if we can better define the role of guest_memfd and how it interacts with other layers, as I see some scenarios that can be improved like kvm_gmem_populate[1] where guest_memfd is trying to fault in pages on behalf of KVM. [1] https://lore.kernel.org/lkml/20250703062641.3247-1-yan.y.zhao@intel.com/ > > > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU. > > No, guest_memfd is a memory provider for KVM guests. That memory *might* be > mapped by userspace and/or into IOMMU page tables in order out of functional > necessity, but guest_memfd exists solely to serve memory to KVM guests, full stop. I look at this as guest_memfd should serve memory to KVM guests and to other users by following some KVM/Arch related guidelines e.g. for CC VMs, guest_memfd can handle certain behavior differently. > > > 3) KVM should ideally associate the lifetime of backing > > pagetables/protection tables/RMP tables with the lifetime of the > > binding of memslots with guest_memfd. > > Again, please align your indentation. > > > - Today KVM SNP logic ties RMP table entry lifetimes with how > > long the folios are mapped in guest_memfd, which I think should be > > revisited. > > Why? Memslots are ephemeral per-"struct kvm" mappings. RMP entries and guest_memfd > inodes are tied to the Virtual Machine, not to the "struct kvm" instance. IIUC guest_memfd can only be accessed through the window of memslots and if there are no memslots I don't see the reason for memory still being associated with "virtual machine". Likely because I am yet to completely wrap my head around 'guest_memfd inodes are tied to the Virtual Machine, not to the "struct kvm" instance', I need to spend more time on this one. > > > Some very early thoughts on how guest_memfd could be laid out for the long term: > > 1) guest_memfd code ideally should be built-in to the kernel. > > Why? How is this at all relevant? If we need to bake some parts of guest_memfd > into the kernel in order to avoid nasty exports and/or ordering dependencies, then > we can do so. But that is 100% an implementation detail and in no way a design > goal. I agree, this is implementation detail and we need real code to discuss this better.
On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@google.com> wrote:
> > > 3) KVM should ideally associate the lifetime of backing
> > > pagetables/protection tables/RMP tables with the lifetime of the
> > > binding of memslots with guest_memfd.
> >
> > Again, please align your indentation.
> >
> > > - Today KVM SNP logic ties RMP table entry lifetimes with how
> > > long the folios are mapped in guest_memfd, which I think should be
> > > revisited.
> >
> > Why? Memslots are ephemeral per-"struct kvm" mappings. RMP entries and guest_memfd
> > inodes are tied to the Virtual Machine, not to the "struct kvm" instance.
>
> IIUC guest_memfd can only be accessed through the window of memslots
> and if there are no memslots I don't see the reason for memory still
> being associated with "virtual machine". Likely because I am yet to
> completely wrap my head around 'guest_memfd inodes are tied to the
> Virtual Machine, not to the "struct kvm" instance', I need to spend
> more time on this one.
>
I see the benefits of tying inodes to the virtual machine and
different guest_memfd files to different KVM instances. This allows us
to exercise intra-host migration usecases for TDX/SNP. But I think
this model doesn't allow us to reuse guest_memfd files for SNP VMs
during reboot.
Reboot scenario assuming reuse of existing guest_memfd inode for the
next instance:
1) Create a VM
2) Create guest_memfd files that pin KVM instance
3) Create memslots
4) Start the VM
5) For reboot/shutdown, Execute VM specific Termination (e.g.
KVM_TDX_TERMINATE_VM)
6) if allowed, delete the memslots
7) Create a new VM instance
8) Link the existing guest_memfd files to the new VM -> which creates
new files for the same inode.
9) Close the existing guest_memfd files and the existing VM
10) Jump to step 3
The difference between SNP and TDX is that TDX memory ownership is
limited to the duration the pages are mapped in the second stage
secure EPT tables, whereas SNP/RMP memory ownership lasts beyond
memslots and effectively remains till folios are punched out from
guest_memfd filemap. IIUC CCA might follow the suite of SNP in this
regard with the pfns populated in GPT entries.
I don't have a sense of how critical this problem could be, but this
would mean for every reboot all large memory allocations will have to
let go and need to be reallocated. For 1G support, we will be freeing
guest_memfd pages using a background thread which may add some delays
in being able to free up the memory in time.
Instead if we did this:
1) Support creating guest_memfd files for a certain VM type that
allows KVM to dictate the behavior of the guest_memfd.
2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and
memslot bindings
- Each binding will increase a refcount on both guest_memfd file
and KVM, so both can't go away while the binding exists.
3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind
operations while for TDX, KVM will invalidate secure EPT entries.
This can allow us to decouple memory lifecycle from VM lifecycle and
match the behavior with non-confidential VMs where memory can outlast
VMs. Though this approach will mean change in intrahost migration
implementation as we don't need to differentiate guest_memfd files and
inodes.
That being said, I might be missing something here and I don't have
any data to back the criticality of this usecase for SNP and possibly
CCA VMs.
On Fri, Jul 11, 2025 at 2:18 PM Vishal Annapurve <vannapurve@google.com> wrote: > > On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@google.com> wrote: > > > > 3) KVM should ideally associate the lifetime of backing > > > > pagetables/protection tables/RMP tables with the lifetime of the > > > > binding of memslots with guest_memfd. > > > > > > Again, please align your indentation. > > > > > > > - Today KVM SNP logic ties RMP table entry lifetimes with how > > > > long the folios are mapped in guest_memfd, which I think should be > > > > revisited. > > > > > > Why? Memslots are ephemeral per-"struct kvm" mappings. RMP entries and guest_memfd > > > inodes are tied to the Virtual Machine, not to the "struct kvm" instance. > > > > IIUC guest_memfd can only be accessed through the window of memslots > > and if there are no memslots I don't see the reason for memory still > > being associated with "virtual machine". Likely because I am yet to > > completely wrap my head around 'guest_memfd inodes are tied to the > > Virtual Machine, not to the "struct kvm" instance', I need to spend > > more time on this one. > > > > I see the benefits of tying inodes to the virtual machine and > different guest_memfd files to different KVM instances. This allows us > to exercise intra-host migration usecases for TDX/SNP. But I think > this model doesn't allow us to reuse guest_memfd files for SNP VMs > during reboot. > > Reboot scenario assuming reuse of existing guest_memfd inode for the > next instance: > 1) Create a VM > 2) Create guest_memfd files that pin KVM instance > 3) Create memslots > 4) Start the VM > 5) For reboot/shutdown, Execute VM specific Termination (e.g. > KVM_TDX_TERMINATE_VM) > 6) if allowed, delete the memslots > 7) Create a new VM instance > 8) Link the existing guest_memfd files to the new VM -> which creates > new files for the same inode. > 9) Close the existing guest_memfd files and the existing VM > 10) Jump to step 3 > > The difference between SNP and TDX is that TDX memory ownership is > limited to the duration the pages are mapped in the second stage > secure EPT tables, whereas SNP/RMP memory ownership lasts beyond > memslots and effectively remains till folios are punched out from > guest_memfd filemap. IIUC CCA might follow the suite of SNP in this > regard with the pfns populated in GPT entries. > > I don't have a sense of how critical this problem could be, but this > would mean for every reboot all large memory allocations will have to > let go and need to be reallocated. For 1G support, we will be freeing > guest_memfd pages using a background thread which may add some delays > in being able to free up the memory in time. > > Instead if we did this: > 1) Support creating guest_memfd files for a certain VM type that > allows KVM to dictate the behavior of the guest_memfd. > 2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and > memslot bindings > - Each binding will increase a refcount on both guest_memfd file > and KVM, so both can't go away while the binding exists. I think if we can ensure that any guest_memfd initiated interaction with KVM is only for invalidation and is based on binding and under filemap_invalidate_lock then there is no need to pin KVM on each binding, as binding/unbinding should be protected using filemap_invalidate_lock and so KVM can't go away during invalidation. > 3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind > operations while for TDX, KVM will invalidate secure EPT entries. > > This can allow us to decouple memory lifecycle from VM lifecycle and > match the behavior with non-confidential VMs where memory can outlast > VMs. Though this approach will mean change in intrahost migration > implementation as we don't need to differentiate guest_memfd files and > inodes. > > That being said, I might be missing something here and I don't have > any data to back the criticality of this usecase for SNP and possibly > CCA VMs.
On Wed, Jul 09, 2025, Vishal Annapurve wrote: > On Wed, Jul 9, 2025 at 8:00 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Jul 09, 2025, Vishal Annapurve wrote: > > > I think we can simplify the role of guest_memfd in line with discussion [1]: > > > > I genuinely don't understand what you're trying to "simplify". We need to define > > an ABI that is flexible and robust, but beyond that most of these guidelines boil > > down to "don't write bad code". > > My goal for bringing this discussion up is to see if we can better > define the role of guest_memfd and how it interacts with other layers, > as I see some scenarios that can be improved like kvm_gmem_populate[1] > where guest_memfd is trying to fault in pages on behalf of KVM. Ah, gotcha. From my perspective, it's all just KVM, which is why I'm not feeling the same sense of urgency to formally define anything. We want to encapsulate code, have separate of concerns, etc., but I don't see that as being anything unique or special to guest_memfd. We try to achieve the same for all major areas of KVM, though obviously with mixed results :-)
On Tue, 2025-07-08 at 11:55 -0700, Sean Christopherson wrote: > I think the answer is that we want to let guest_memfd live beyond the "struct kvm" > instance, but not beyond the Virtual Machine. From a past discussion on this topic[*]. > > [snip] > Exactly what that will look like in code is TBD, but the concept/logic holds up. > > [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com Thanks for digging this up. Makes sense. One gmemfd per VM, but struct kvm != a VM.
On 6/19/2025 4:59 PM, Xiaoyao Li wrote: > On 6/19/2025 4:13 PM, Yan Zhao wrote: >> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: >>> Hello, >>> >>> This patchset builds upon discussion at LPC 2024 and many guest_memfd >>> upstream calls to provide 1G page support for guest_memfd by taking >>> pages from HugeTLB. >>> >>> This patchset is based on Linux v6.15-rc6, and requires the mmap support >>> for guest_memfd patchset (Thanks Fuad!) [1]. >>> >>> For ease of testing, this series is also available, stitched together, >>> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page- >>> support-rfc-v2 >> Just to record a found issue -- not one that must be fixed. >> >> In TDX, the initial memory region is added as private memory during >> TD's build >> time, with its initial content copied from source pages in shared memory. >> The copy operation requires simultaneous access to both shared source >> memory >> and private target memory. >> >> Therefore, userspace cannot store the initial content in shared memory >> at the >> mmap-ed VA of a guest_memfd that performs in-place conversion between >> shared and >> private memory. This is because the guest_memfd will first unmap a PFN >> in shared >> page tables and then check for any extra refcount held for the shared >> PFN before >> converting it to private. > > I have an idea. > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place > conversion unmap the PFN in shared page tables while keeping the content > of the page unchanged, right? > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory > actually for non-CoCo case actually, that userspace first mmap() it and > ensure it's shared and writes the initial content to it, after it > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it > wants the private memory to be initialized with initial content, and > just do in-place TDH.PAGE.ADD in the hook. And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to explicitly request that the page range is converted to private and the content needs to be retained. So that TDX can identify which case needs to call in-place TDH.PAGE.ADD.
On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote: > On 6/19/2025 4:59 PM, Xiaoyao Li wrote: > > On 6/19/2025 4:13 PM, Yan Zhao wrote: > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: > > > > Hello, > > > > > > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > > > > upstream calls to provide 1G page support for guest_memfd by taking > > > > pages from HugeTLB. > > > > > > > > This patchset is based on Linux v6.15-rc6, and requires the mmap support > > > > for guest_memfd patchset (Thanks Fuad!) [1]. > > > > > > > > For ease of testing, this series is also available, stitched together, > > > > at > > > > https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page- > > > > support-rfc-v2 > > > Just to record a found issue -- not one that must be fixed. > > > > > > In TDX, the initial memory region is added as private memory during > > > TD's build > > > time, with its initial content copied from source pages in shared memory. > > > The copy operation requires simultaneous access to both shared > > > source memory > > > and private target memory. > > > > > > Therefore, userspace cannot store the initial content in shared > > > memory at the > > > mmap-ed VA of a guest_memfd that performs in-place conversion > > > between shared and > > > private memory. This is because the guest_memfd will first unmap a > > > PFN in shared > > > page tables and then check for any extra refcount held for the > > > shared PFN before > > > converting it to private. > > > > I have an idea. > > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place > > conversion unmap the PFN in shared page tables while keeping the content > > of the page unchanged, right? However, whenever there's a GUP in TDX to get the source page, there will be an extra page refcount. > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory > > actually for non-CoCo case actually, that userspace first mmap() it and > > ensure it's shared and writes the initial content to it, after it > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. The conversion request here will be declined therefore. > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it > > wants the private memory to be initialized with initial content, and > > just do in-place TDH.PAGE.ADD in the hook. > > And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to > explicitly request that the page range is converted to private and the > content needs to be retained. So that TDX can identify which case needs to > call in-place TDH.PAGE.ADD. >
On 6/19/2025 5:28 PM, Yan Zhao wrote: > On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote: >> On 6/19/2025 4:59 PM, Xiaoyao Li wrote: >>> On 6/19/2025 4:13 PM, Yan Zhao wrote: >>>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: >>>>> Hello, >>>>> >>>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd >>>>> upstream calls to provide 1G page support for guest_memfd by taking >>>>> pages from HugeTLB. >>>>> >>>>> This patchset is based on Linux v6.15-rc6, and requires the mmap support >>>>> for guest_memfd patchset (Thanks Fuad!) [1]. >>>>> >>>>> For ease of testing, this series is also available, stitched together, >>>>> at >>>>> https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page- >>>>> support-rfc-v2 >>>> Just to record a found issue -- not one that must be fixed. >>>> >>>> In TDX, the initial memory region is added as private memory during >>>> TD's build >>>> time, with its initial content copied from source pages in shared memory. >>>> The copy operation requires simultaneous access to both shared >>>> source memory >>>> and private target memory. >>>> >>>> Therefore, userspace cannot store the initial content in shared >>>> memory at the >>>> mmap-ed VA of a guest_memfd that performs in-place conversion >>>> between shared and >>>> private memory. This is because the guest_memfd will first unmap a >>>> PFN in shared >>>> page tables and then check for any extra refcount held for the >>>> shared PFN before >>>> converting it to private. >>> >>> I have an idea. >>> >>> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place >>> conversion unmap the PFN in shared page tables while keeping the content >>> of the page unchanged, right? > However, whenever there's a GUP in TDX to get the source page, there will be an > extra page refcount. The GUP in TDX happens after the gmem converts the page to private. In the view of TDX, the physical page is converted to private already and it contains the initial content. But the content is not usable for TDX until TDX calls in-place PAGE.ADD >>> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory >>> actually for non-CoCo case actually, that userspace first mmap() it and >>> ensure it's shared and writes the initial content to it, after it >>> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. > The conversion request here will be declined therefore. > > >>> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it >>> wants the private memory to be initialized with initial content, and >>> just do in-place TDH.PAGE.ADD in the hook. >> >> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to >> explicitly request that the page range is converted to private and the >> content needs to be retained. So that TDX can identify which case needs to >> call in-place TDH.PAGE.ADD. >>
On 6/19/2025 5:45 PM, Xiaoyao Li wrote: > On 6/19/2025 5:28 PM, Yan Zhao wrote: >> On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote: >>> On 6/19/2025 4:59 PM, Xiaoyao Li wrote: >>>> On 6/19/2025 4:13 PM, Yan Zhao wrote: >>>>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote: >>>>>> Hello, >>>>>> >>>>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd >>>>>> upstream calls to provide 1G page support for guest_memfd by taking >>>>>> pages from HugeTLB. >>>>>> >>>>>> This patchset is based on Linux v6.15-rc6, and requires the mmap >>>>>> support >>>>>> for guest_memfd patchset (Thanks Fuad!) [1]. >>>>>> >>>>>> For ease of testing, this series is also available, stitched >>>>>> together, >>>>>> at >>>>>> https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page- >>>>>> support-rfc-v2 >>>>> Just to record a found issue -- not one that must be fixed. >>>>> >>>>> In TDX, the initial memory region is added as private memory during >>>>> TD's build >>>>> time, with its initial content copied from source pages in shared >>>>> memory. >>>>> The copy operation requires simultaneous access to both shared >>>>> source memory >>>>> and private target memory. >>>>> >>>>> Therefore, userspace cannot store the initial content in shared >>>>> memory at the >>>>> mmap-ed VA of a guest_memfd that performs in-place conversion >>>>> between shared and >>>>> private memory. This is because the guest_memfd will first unmap a >>>>> PFN in shared >>>>> page tables and then check for any extra refcount held for the >>>>> shared PFN before >>>>> converting it to private. >>>> >>>> I have an idea. >>>> >>>> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place >>>> conversion unmap the PFN in shared page tables while keeping the >>>> content >>>> of the page unchanged, right? >> However, whenever there's a GUP in TDX to get the source page, there >> will be an >> extra page refcount. > > The GUP in TDX happens after the gmem converts the page to private. May it's not GUP since the page has been unmapped from userspace? (Sorry that I'm not familiar with the terminology) > In the view of TDX, the physical page is converted to private already > and it contains the initial content. But the content is not usable for > TDX until TDX calls in-place PAGE.ADD > >>>> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private >>>> memory >>>> actually for non-CoCo case actually, that userspace first mmap() it and >>>> ensure it's shared and writes the initial content to it, after it >>>> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE. >> The conversion request here will be declined therefore. >> >> >>>> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it >>>> wants the private memory to be initialized with initial content, and >>>> just do in-place TDH.PAGE.ADD in the hook. >>> >>> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to >>> explicitly request that the page range is converted to private and the >>> content needs to be retained. So that TDX can identify which case >>> needs to >>> call in-place TDH.PAGE.ADD. >>> > >
Ackerley Tng <ackerleytng@google.com> writes: > <snip> > > Here are some remaining issues/TODOs: > > 1. Memory error handling such as machine check errors have not been > implemented. > 2. I've not looked into preparedness of pages, only zeroing has been > considered. > 3. When allocating HugeTLB pages, if two threads allocate indices > mapping to the same huge page, the utilization in guest_memfd inode's > subpool may momentarily go over the subpool limit (the requested size > of the inode at guest_memfd creation time), causing one of the two > threads to get -ENOMEM. Suggestions to solve this are appreciated! > 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB > pages should be correct but needs testing and could be wrong. > 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd > HugeTLB pages after splitting should be correct but needs testing and > could be wrong. > 6. Page cache accounting: When a hugetlb page is split, guest_memfd will > incur page count in both NR_HUGETLB (counted at hugetlb allocation > time) and NR_FILE_PAGES stats (counted when split pages are added to > the filemap). Is this aligned with what people expect? > For people who might be testing this series with non-Coco VMs (heads up, Patrick and Nikita!), this currently splits the folio as long as some shareability in the huge folio is shared, which is probably unnecessary? IIUC core-mm doesn't support mapping at 1G but from a cursory reading it seems like the faulting function calling kvm_gmem_fault_shared() could possibly be able to map a 1G page at 4K. Looks like we might need another flag like GUEST_MEMFD_FLAG_SUPPORT_CONVERSION, which will gate initialization of the shareability maple tree/xarray. If shareability is NULL for the entire hugepage range, then no splitting will occur. For Coco VMs, this should be safe, since if this flag is not set, kvm_gmem_fault_shared() will always not be able to fault (the shareability value will be NULL. > Here are some optimizations that could be explored in future series: > > 1. Pages could be split from 1G to 2M first and only split to 4K if > necessary. > 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the > pages. > > <snip>
Ackerley Tng wrote: > Hello, > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > upstream calls to provide 1G page support for guest_memfd by taking > pages from HugeTLB. > > This patchset is based on Linux v6.15-rc6, and requires the mmap support > for guest_memfd patchset (Thanks Fuad!) [1]. Trying to manage dependencies I find that Ryan's just released series[1] is required to build this set. [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/ Specifically this patch: https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/ defines alloc_anon_secure_inode() Am I wrong in that? > > For ease of testing, this series is also available, stitched together, > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > I went digging in your git tree and then found Ryan's set. So thanks for the git tree. :-D However, it seems this add another dependency which should be managed in David's email of dependencies? Ira > This patchset can be divided into two sections: > > (a) Patches from the beginning up to and including "KVM: selftests: > Update script to map shared memory from guest_memfd" are a modified > version of "conversion support for guest_memfd", which Fuad is > managing [2]. > > (b) Patches after "KVM: selftests: Update script to map shared memory > from guest_memfd" till the end are patches that actually bring in 1G > page support for guest_memfd. > > These are the significant differences between (a) and [2]: > > + [2] uses an xarray to track sharability, but I used a maple tree > because for 1G pages, iterating pagewise to update shareability was > prohibitively slow even for testing. I was choosing from among > multi-index xarrays, interval trees and maple trees [3], and picked > maple trees because > + Maple trees were easier to figure out since I didn't have to > compute the correct multi-index order and handle edge cases if the > converted range wasn't a neat power of 2. > + Maple trees were easier to figure out as compared to updating > parts of a multi-index xarray. > + Maple trees had an easier API to use than interval trees. > + [2] doesn't yet have a conversion ioctl, but I needed it to test 1G > support end-to-end. > + (a) Removes guest_memfd from participating in LRU, which I needed, to > get conversion selftests to work as expected, since participation in > LRU was causing some unexpected refcounts on folios which was blocking > conversions. > > I am sending (a) in emails as well, as opposed to just leaving it on > GitHub, so that we can discuss by commenting inline on emails. If you'd > like to just look at 1G page support, here are some key takeaways from > the first section (a): > > + If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd > creation, guest_memfd will > + Track shareability (whether an index in the inode is guest-only or > if the host is allowed to fault memory at a given index). > + Always be used for guest faults - specifically, kvm_gmem_get_pfn() > will be used to provide pages for the guest. > + Always be used by KVM to check private/shared status of a gfn. > + guest_memfd now has conversion ioctls, allowing conversion to > private/shared > + Conversion can fail if there are unexpected refcounts on any > folios in the range. > > Focusing on (b) 1G page support, here's an overview: > > 1. A bunch of refactoring patches for HugeTLB that isolates the > allocation of a HugeTLB folio from other HugeTLB concepts such as > VMA-level reservations, and HugeTLBfs-specific concepts, such as > where memory policy is stored in the VMA, or where the subpool is > stored on the inode. > 2. A few patches that add a guestmem_hugetlb allocator within mm/. The > guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize > the memory management functions, and to cleanly handle cleanup, so > that folio cleanup can happen after the guest_memfd inode (and even > KVM) goes away. > 3. Some updates to guest_memfd to use the guestmem_hugetlb allocator. > 4. Selftests for 1G page support. > > Here are some remaining issues/TODOs: > > 1. Memory error handling such as machine check errors have not been > implemented. > 2. I've not looked into preparedness of pages, only zeroing has been > considered. > 3. When allocating HugeTLB pages, if two threads allocate indices > mapping to the same huge page, the utilization in guest_memfd inode's > subpool may momentarily go over the subpool limit (the requested size > of the inode at guest_memfd creation time), causing one of the two > threads to get -ENOMEM. Suggestions to solve this are appreciated! > 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB > pages should be correct but needs testing and could be wrong. > 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd > HugeTLB pages after splitting should be correct but needs testing and > could be wrong. > 6. Page cache accounting: When a hugetlb page is split, guest_memfd will > incur page count in both NR_HUGETLB (counted at hugetlb allocation > time) and NR_FILE_PAGES stats (counted when split pages are added to > the filemap). Is this aligned with what people expect? > > Here are some optimizations that could be explored in future series: > > 1. Pages could be split from 1G to 2M first and only split to 4K if > necessary. > 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the > pages. > > Here's RFC v1 [4] if you're interested in the motivation behind choosing > HugeTLB, or the history of this patch series. > > [1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/ > [2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/ > [3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/ > [4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/ >
Ira Weiny wrote: > Ackerley Tng wrote: > > Hello, > > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > > upstream calls to provide 1G page support for guest_memfd by taking > > pages from HugeTLB. > > > > This patchset is based on Linux v6.15-rc6, and requires the mmap support > > for guest_memfd patchset (Thanks Fuad!) [1]. > > Trying to manage dependencies I find that Ryan's just released series[1] > is required to build this set. > > [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/ > > Specifically this patch: > https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/ > > defines > > alloc_anon_secure_inode() Perhaps Ryan's set is not required? Just that patch? It looks like Ryan's 2/13 is the same as your 1/51 patch? https://lore.kernel.org/all/754b4898c3362050071f6dd09deb24f3c92a41c3.1747368092.git.afranji@google.com/ I'll pull 1/13 and see where I get. Ira > > Am I wrong in that? > > > > > For ease of testing, this series is also available, stitched together, > > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > > > > I went digging in your git tree and then found Ryan's set. So thanks for > the git tree. :-D > > However, it seems this add another dependency which should be managed in > David's email of dependencies? > > Ira >
Ira Weiny <ira.weiny@intel.com> writes:
> Ira Weiny wrote:
>> Ackerley Tng wrote:
>> > Hello,
>> >
>> > This patchset builds upon discussion at LPC 2024 and many guest_memfd
>> > upstream calls to provide 1G page support for guest_memfd by taking
>> > pages from HugeTLB.
>> >
>> > This patchset is based on Linux v6.15-rc6, and requires the mmap support
>> > for guest_memfd patchset (Thanks Fuad!) [1].
>>
>> Trying to manage dependencies I find that Ryan's just released series[1]
>> is required to build this set.
>>
>> [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
>>
>> Specifically this patch:
>> https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/
>>
>> defines
>>
>> alloc_anon_secure_inode()
>
> Perhaps Ryan's set is not required? Just that patch?
>
> It looks like Ryan's 2/13 is the same as your 1/51 patch?
>
> https://lore.kernel.org/all/754b4898c3362050071f6dd09deb24f3c92a41c3.1747368092.git.afranji@google.com/
>
> I'll pull 1/13 and see where I get.
>
> Ira
>
>>
>> Am I wrong in that?
>>
My bad, this patch was missing from this series:
From bd629d1ec6ffb7091a5f996dc7835abed8467f3e Mon Sep 17 00:00:00 2001
Message-ID: <bd629d1ec6ffb7091a5f996dc7835abed8467f3e.1747426836.git.ackerleytng@google.com>
From: Ackerley Tng <ackerleytng@google.com>
Date: Wed, 7 May 2025 07:59:28 -0700
Subject: [RFC PATCH v2 1/1] fs: Refactor to provide function that allocates a
secure anonymous inode
alloc_anon_secure_inode() returns an inode after running checks in
security_inode_init_security_anon().
Also refactor secretmem's file creation process to use the new
function.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I4eb8622775bc3d544ec695f453ffd747d9490e40
---
fs/anon_inodes.c | 22 ++++++++++++++++------
include/linux/fs.h | 1 +
mm/secretmem.c | 9 +--------
3 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 583ac81669c2..4c3110378647 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -55,17 +55,20 @@ static struct file_system_type anon_inode_fs_type = {
.kill_sb = kill_anon_super,
};
-static struct inode *anon_inode_make_secure_inode(
- const char *name,
- const struct inode *context_inode)
+static struct inode *anon_inode_make_secure_inode(struct super_block *s,
+ const char *name, const struct inode *context_inode,
+ bool fs_internal)
{
struct inode *inode;
int error;
- inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+ inode = alloc_anon_inode(s);
if (IS_ERR(inode))
return inode;
- inode->i_flags &= ~S_PRIVATE;
+
+ if (!fs_internal)
+ inode->i_flags &= ~S_PRIVATE;
+
error = security_inode_init_security_anon(inode, &QSTR(name),
context_inode);
if (error) {
@@ -75,6 +78,12 @@ static struct inode *anon_inode_make_secure_inode(
return inode;
}
+struct inode *alloc_anon_secure_inode(struct super_block *s, const char *name)
+{
+ return anon_inode_make_secure_inode(s, name, NULL, true);
+}
+EXPORT_SYMBOL_GPL(alloc_anon_secure_inode);
+
static struct file *__anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags,
@@ -88,7 +97,8 @@ static struct file *__anon_inode_getfile(const char *name,
return ERR_PTR(-ENOENT);
if (make_inode) {
- inode = anon_inode_make_secure_inode(name, context_inode);
+ inode = anon_inode_make_secure_inode(anon_inode_mnt->mnt_sb,
+ name, context_inode, false);
if (IS_ERR(inode)) {
file = ERR_CAST(inode);
goto err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..0fded2e3c661 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3550,6 +3550,7 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
extern const struct address_space_operations ram_aops;
extern int always_delete_dentry(const struct dentry *);
extern struct inode *alloc_anon_inode(struct super_block *);
+extern struct inode *alloc_anon_secure_inode(struct super_block *, const char *);
extern int simple_nosetlease(struct file *, int, struct file_lease **, void **);
extern const struct dentry_operations simple_dentry_operations;
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 1b0a214ee558..c0e459e58cb6 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -195,18 +195,11 @@ static struct file *secretmem_file_create(unsigned long flags)
struct file *file;
struct inode *inode;
const char *anon_name = "[secretmem]";
- int err;
- inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+ inode = alloc_anon_secure_inode(secretmem_mnt->mnt_sb, anon_name);
if (IS_ERR(inode))
return ERR_CAST(inode);
- err = security_inode_init_security_anon(inode, &QSTR(anon_name), NULL);
- if (err) {
- file = ERR_PTR(err);
- goto err_free_inode;
- }
-
file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
O_RDWR, &secretmem_fops);
if (IS_ERR(file))
--
2.49.0.1101.gccaa498523-goog
>> >
>> > For ease of testing, this series is also available, stitched together,
>> > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> >
>>
>> I went digging in your git tree and then found Ryan's set. So thanks for
>> the git tree. :-D
Glad that helped!
>>
>> However, it seems this add another dependency which should be managed in
>> David's email of dependencies?
This is a good idea. David, do you think these two patches should be
managed as a separate patch series in the email of dependencies?
+ (left out of RFCv2, but is above) "fs: Refactor to provide function that allocates a secure anonymous inode"
+ 01/51 "KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes"
They're being used by a few patch series now.
>>
>> Ira
>>
© 2016 - 2025 Red Hat, Inc.