[v2] 1G page support for guest_memfd

[RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ackerley Tng 8 months, 4 weeks ago

Hello,

This patchset builds upon discussion at LPC 2024 and many guest_memfd
upstream calls to provide 1G page support for guest_memfd by taking
pages from HugeTLB.

This patchset is based on Linux v6.15-rc6, and requires the mmap support
for guest_memfd patchset (Thanks Fuad!) [1].

For ease of testing, this series is also available, stitched together,
at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2

This patchset can be divided into two sections:

(a) Patches from the beginning up to and including "KVM: selftests:
    Update script to map shared memory from guest_memfd" are a modified
    version of "conversion support for guest_memfd", which Fuad is
    managing [2].

(b) Patches after "KVM: selftests: Update script to map shared memory
    from guest_memfd" till the end are patches that actually bring in 1G
    page support for guest_memfd.

These are the significant differences between (a) and [2]:

+ [2] uses an xarray to track sharability, but I used a maple tree
  because for 1G pages, iterating pagewise to update shareability was
  prohibitively slow even for testing. I was choosing from among
  multi-index xarrays, interval trees and maple trees [3], and picked
  maple trees because
    + Maple trees were easier to figure out since I didn't have to
      compute the correct multi-index order and handle edge cases if the
      converted range wasn't a neat power of 2.
    + Maple trees were easier to figure out as compared to updating
      parts of a multi-index xarray.
    + Maple trees had an easier API to use than interval trees.
+ [2] doesn't yet have a conversion ioctl, but I needed it to test 1G
  support end-to-end.
+ (a) Removes guest_memfd from participating in LRU, which I needed, to
  get conversion selftests to work as expected, since participation in
  LRU was causing some unexpected refcounts on folios which was blocking
  conversions.

I am sending (a) in emails as well, as opposed to just leaving it on
GitHub, so that we can discuss by commenting inline on emails. If you'd
like to just look at 1G page support, here are some key takeaways from
the first section (a):

+ If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd
  creation, guest_memfd will
    + Track shareability (whether an index in the inode is guest-only or
      if the host is allowed to fault memory at a given index).
    + Always be used for guest faults - specifically, kvm_gmem_get_pfn()
      will be used to provide pages for the guest.
    + Always be used by KVM to check private/shared status of a gfn.
+ guest_memfd now has conversion ioctls, allowing conversion to
  private/shared
    + Conversion can fail if there are unexpected refcounts on any
      folios in the range.

Focusing on (b) 1G page support, here's an overview:

1. A bunch of refactoring patches for HugeTLB that isolates the
   allocation of a HugeTLB folio from other HugeTLB concepts such as
   VMA-level reservations, and HugeTLBfs-specific concepts, such as
   where memory policy is stored in the VMA, or where the subpool is
   stored on the inode.
2. A few patches that add a guestmem_hugetlb allocator within mm/. The
   guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize
   the memory management functions, and to cleanly handle cleanup, so
   that folio cleanup can happen after the guest_memfd inode (and even
   KVM) goes away.
3. Some updates to guest_memfd to use the guestmem_hugetlb allocator.
4. Selftests for 1G page support.

Here are some remaining issues/TODOs:

1. Memory error handling such as machine check errors have not been
   implemented.
2. I've not looked into preparedness of pages, only zeroing has been
   considered.
3. When allocating HugeTLB pages, if two threads allocate indices
   mapping to the same huge page, the utilization in guest_memfd inode's
   subpool may momentarily go over the subpool limit (the requested size
   of the inode at guest_memfd creation time), causing one of the two
   threads to get -ENOMEM. Suggestions to solve this are appreciated!
4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
   pages should be correct but needs testing and could be wrong.
5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
   HugeTLB pages after splitting should be correct but needs testing and
   could be wrong.
6. Page cache accounting: When a hugetlb page is split, guest_memfd will
   incur page count in both NR_HUGETLB (counted at hugetlb allocation
   time) and NR_FILE_PAGES stats (counted when split pages are added to
   the filemap). Is this aligned with what people expect?

Here are some optimizations that could be explored in future series:

1. Pages could be split from 1G to 2M first and only split to 4K if
   necessary.
2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
   pages.

Here's RFC v1 [4] if you're interested in the motivation behind choosing
HugeTLB, or the history of this patch series.

[1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/
[3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/

---

Ackerley Tng (49):
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: Introduce and use shareability to guard faulting
  KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  KVM: guest_memfd: Skip LRU for guest_memfd folios
  KVM: Query guest_memfd for private/shared status
  KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
  KVM: selftests: Test flag validity after guest_memfd supports
    conversions
  KVM: selftests: Test faulting with respect to
    GUEST_MEMFD_FLAG_INIT_PRIVATE
  KVM: selftests: Refactor vm_mem_add to be more flexible
  KVM: selftests: Allow cleanup of ucall_pool from host
  KVM: selftests: Test conversion flows for guest_memfd
  KVM: selftests: Add script to exercise private_mem_conversions_test
  KVM: selftests: Update private_mem_conversions_test to mmap
    guest_memfd
  KVM: selftests: Update script to map shared memory from guest_memfd
  mm: hugetlb: Consolidate interpretation of gbl_chg within
    alloc_hugetlb_folio()
  mm: hugetlb: Cleanup interpretation of gbl_chg in
    alloc_hugetlb_folio()
  mm: hugetlb: Cleanup interpretation of map_chg_state within
    alloc_hugetlb_folio()
  mm: hugetlb: Rename alloc_surplus_hugetlb_folio
  mm: mempolicy: Refactor out policy_node_nodemask()
  mm: hugetlb: Inline huge_node() into callers
  mm: hugetlb: Refactor hugetlb allocation functions
  mm: hugetlb: Refactor out hugetlb_alloc_folio()
  mm: hugetlb: Add option to create new subpool without using surplus
  mm: truncate: Expose preparation steps for truncate_inode_pages_final
  mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  mm: Introduce guestmem_hugetlb to support folio_put() handling of
    guestmem pages
  mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
  mm: truncate: Expose truncate_inode_folio()
  KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff
    misalignment
  KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
  KVM: guest_memfd: Allocate and truncate from custom allocator
  mm: hugetlb: Add functions to add/delete folio from hugetlb lists
  mm: guestmem_hugetlb: Add support for splitting and merging pages
  mm: Convert split_folio() macro to function
  KVM: guest_memfd: Split allocator pages for guest_memfd use
  KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
  KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page
    status
  KVM: Add CAP to indicate support for HugeTLB as custom allocator
  KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  KVM: selftests: Update conversion flows test for HugeTLB
  KVM: selftests: Test truncation paths of guest_memfd
  KVM: selftests: Test allocation and conversion of subfolios
  KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  KVM: selftests: Support various types of backing sources for private
    memory
  KVM: selftests: Update test for various private memory backing source
    types
  KVM: selftests: Update private_mem_conversions_test.sh to test with
    HugeTLB pages
  KVM: selftests: Add script to test HugeTLB statistics
  KVM: selftests: Test guest_memfd for accuracy of st_blocks

Elliot Berman (1):
  filemap: Pass address_space mapping to ->free_folio()

Fuad Tabba (1):
  mm: Consolidate freeing of typed folios on final folio_put()

 Documentation/filesystems/locking.rst         |    2 +-
 Documentation/filesystems/vfs.rst             |   15 +-
 Documentation/virt/kvm/api.rst                |    5 +
 arch/arm64/include/asm/kvm_host.h             |    5 -
 arch/x86/include/asm/kvm_host.h               |   10 -
 arch/x86/kvm/x86.c                            |   53 +-
 fs/hugetlbfs/inode.c                          |    2 +-
 fs/nfs/dir.c                                  |    9 +-
 fs/orangefs/inode.c                           |    3 +-
 include/linux/fs.h                            |    2 +-
 include/linux/guestmem.h                      |   23 +
 include/linux/huge_mm.h                       |    6 +-
 include/linux/hugetlb.h                       |   19 +-
 include/linux/kvm_host.h                      |   32 +-
 include/linux/mempolicy.h                     |   11 +-
 include/linux/mm.h                            |    2 +
 include/linux/page-flags.h                    |   32 +
 include/uapi/linux/guestmem.h                 |   29 +
 include/uapi/linux/kvm.h                      |   16 +
 include/uapi/linux/magic.h                    |    1 +
 mm/Kconfig                                    |   13 +
 mm/Makefile                                   |    1 +
 mm/debug.c                                    |    1 +
 mm/filemap.c                                  |   12 +-
 mm/guestmem_hugetlb.c                         |  512 +++++
 mm/guestmem_hugetlb.h                         |    9 +
 mm/hugetlb.c                                  |  488 ++---
 mm/internal.h                                 |    1 -
 mm/memcontrol.c                               |    2 +
 mm/memory.c                                   |    1 +
 mm/mempolicy.c                                |   44 +-
 mm/secretmem.c                                |    3 +-
 mm/swap.c                                     |   32 +-
 mm/truncate.c                                 |   27 +-
 mm/vmscan.c                                   |    4 +-
 tools/testing/selftests/kvm/Makefile.kvm      |    2 +
 .../kvm/guest_memfd_conversions_test.c        |  797 ++++++++
 .../kvm/guest_memfd_hugetlb_reporting_test.c  |  384 ++++
 ...uest_memfd_provide_hugetlb_cgroup_mount.sh |   36 +
 .../testing/selftests/kvm/guest_memfd_test.c  |  293 ++-
 ...memfd_wrap_test_check_hugetlb_reporting.sh |   95 +
 .../testing/selftests/kvm/include/kvm_util.h  |  104 +-
 .../testing/selftests/kvm/include/test_util.h |   20 +-
 .../selftests/kvm/include/ucall_common.h      |    1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  465 +++--
 tools/testing/selftests/kvm/lib/test_util.c   |  102 +
 .../testing/selftests/kvm/lib/ucall_common.c  |   16 +-
 .../kvm/x86/private_mem_conversions_test.c    |  195 +-
 .../kvm/x86/private_mem_conversions_test.sh   |  100 +
 virt/kvm/Kconfig                              |    5 +
 virt/kvm/guest_memfd.c                        | 1655 ++++++++++++++++-
 virt/kvm/kvm_main.c                           |   14 +-
 virt/kvm/kvm_mm.h                             |    9 +-
 53 files changed, 5080 insertions(+), 640 deletions(-)
 create mode 100644 include/linux/guestmem.h
 create mode 100644 include/uapi/linux/guestmem.h
 create mode 100644 mm/guestmem_hugetlb.c
 create mode 100644 mm/guestmem_hugetlb.h
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh
 create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh

--
2.49.0.1045.g170613ef41-goog

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 8 months, 4 weeks ago

On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.

Do you have any more concrete numbers on benefits of 1GB huge pages for
guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
- Increase TLB hit rate and reduce page walks on TLB miss
- Improved IO performance
- Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
- Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
backing memory

Do you know how often the 1GB TDP mappings get shattered by shared pages?

Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
memory savings (for example dynamic PAMT), and the rest of the benefits don't
have numbers. How much are we getting for all the complexity, over say buddy
allocated 2MB pages?

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 8 months, 4 weeks ago

On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> > Hello,
> >
> > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > upstream calls to provide 1G page support for guest_memfd by taking
> > pages from HugeTLB.
>
> Do you have any more concrete numbers on benefits of 1GB huge pages for
> guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
> - Increase TLB hit rate and reduce page walks on TLB miss
> - Improved IO performance
> - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
> - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
> backing memory
>
> Do you know how often the 1GB TDP mappings get shattered by shared pages?
>
> Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
> memory savings (for example dynamic PAMT), and the rest of the benefits don't
> have numbers. How much are we getting for all the complexity, over say buddy
> allocated 2MB pages?

This series should work for any page sizes backed by hugetlb memory.
Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
essential for certain workloads and will emerge as guest_memfd users.
Features like KHO/memory persistence in addition also depend on
hugepage support in guest_memfd.

This series takes strides towards making guest_memfd compatible with
usecases where 1G pages are essential and non-confidential VMs are
already exercising them.

I think the main complexity here lies in supporting in-place
conversion which applies to any huge page size even for buddy
allocated 2MB pages or THP.

This complexity arises because page structs work at a fixed
granularity, future roadmap towards not having page structs for guest
memory (at least private memory to begin with) should help towards
greatly reducing this complexity.

That being said, DPAMT and huge page EPT mappings for TDX VMs remain
essential and complement this series well for better memory footprint
and overall performance of TDX VMs.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 8 months, 4 weeks ago

On Thu, 2025-05-15 at 11:42 -0700, Vishal Annapurve wrote:
> On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> > > Hello,
> > > 
> > > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > upstream calls to provide 1G page support for guest_memfd by taking
> > > pages from HugeTLB.
> > 
> > Do you have any more concrete numbers on benefits of 1GB huge pages for
> > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
> > - Increase TLB hit rate and reduce page walks on TLB miss
> > - Improved IO performance
> > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
> > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
> > backing memory
> > 
> > Do you know how often the 1GB TDP mappings get shattered by shared pages?
> > 
> > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
> > memory savings (for example dynamic PAMT), and the rest of the benefits don't
> > have numbers. How much are we getting for all the complexity, over say buddy
> > allocated 2MB pages?
> 
> This series should work for any page sizes backed by hugetlb memory.
> Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> essential for certain workloads and will emerge as guest_memfd users.
> Features like KHO/memory persistence in addition also depend on
> hugepage support in guest_memfd.
> 
> This series takes strides towards making guest_memfd compatible with
> usecases where 1G pages are essential and non-confidential VMs are
> already exercising them.
> 
> I think the main complexity here lies in supporting in-place
> conversion which applies to any huge page size even for buddy
> allocated 2MB pages or THP.
> 
> This complexity arises because page structs work at a fixed
> granularity, future roadmap towards not having page structs for guest
> memory (at least private memory to begin with) should help towards
> greatly reducing this complexity.
> 
> That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> essential and complement this series well for better memory footprint
> and overall performance of TDX VMs.

Hmm, this didn't really answer my questions about the concrete benefits.

I think it would help to include this kind of justification for the 1GB
guestmemfd pages. "essential for certain workloads and will emerge" is a bit
hard to review against...

I think one of the challenges with coco is that it's almost like a sprint to
reimplement virtualization. But enough things are changing at once that not all
of the normal assumptions hold, so it can't copy all the same solutions. The
recent example was that for TDX huge pages we found that normal promotion paths
weren't actually yielding any benefit for surprising TDX specific reasons.

On the TDX side we are also, at least currently, unmapping private pages while
they are mapped shared, so any 1GB pages would get split to 2MB if there are any
shared pages in them. I wonder how many 1GB pages there would be after all the
shared pages are converted. At smaller TD sizes, it could be not much.

So for TDX in isolation, it seems like jumping out too far ahead to effectively
consider the value. But presumably you guys are testing this on SEV or
something? Have you measured any performance improvement? For what kind of
applications? Or is the idea to basically to make guestmemfd work like however
Google does guest memory?

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 8 months, 4 weeks ago

On Thu, May 15, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-05-15 at 11:42 -0700, Vishal Annapurve wrote:
> > On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > > 
> > > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> > > > Hello,
> > > > 
> > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > upstream calls to provide 1G page support for guest_memfd by taking
> > > > pages from HugeTLB.
> > > 
> > > Do you have any more concrete numbers on benefits of 1GB huge pages for
> > > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
> > > - Increase TLB hit rate and reduce page walks on TLB miss
> > > - Improved IO performance
> > > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
> > > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
> > > backing memory
> > > 
> > > Do you know how often the 1GB TDP mappings get shattered by shared pages?
> > > 
> > > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
> > > memory savings (for example dynamic PAMT), and the rest of the benefits don't
> > > have numbers. How much are we getting for all the complexity, over say buddy
> > > allocated 2MB pages?

TDX may have bigger fish to fry, but some of us have bigger fish to fry than TDX :-)

> > This series should work for any page sizes backed by hugetlb memory.
> > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > essential for certain workloads and will emerge as guest_memfd users.
> > Features like KHO/memory persistence in addition also depend on
> > hugepage support in guest_memfd.
> > 
> > This series takes strides towards making guest_memfd compatible with
> > usecases where 1G pages are essential and non-confidential VMs are
> > already exercising them.
> > 
> > I think the main complexity here lies in supporting in-place
> > conversion which applies to any huge page size even for buddy
> > allocated 2MB pages or THP.
> > 
> > This complexity arises because page structs work at a fixed
> > granularity, future roadmap towards not having page structs for guest
> > memory (at least private memory to begin with) should help towards
> > greatly reducing this complexity.
> > 
> > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > essential and complement this series well for better memory footprint
> > and overall performance of TDX VMs.
> 
> Hmm, this didn't really answer my questions about the concrete benefits.
> 
> I think it would help to include this kind of justification for the 1GB
> guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> hard to review against...
> 
> I think one of the challenges with coco is that it's almost like a sprint to
> reimplement virtualization. But enough things are changing at once that not all
> of the normal assumptions hold, so it can't copy all the same solutions. The
> recent example was that for TDX huge pages we found that normal promotion paths
> weren't actually yielding any benefit for surprising TDX specific reasons.
> 
> On the TDX side we are also, at least currently, unmapping private pages while
> they are mapped shared, so any 1GB pages would get split to 2MB if there are any
> shared pages in them. I wonder how many 1GB pages there would be after all the
> shared pages are converted. At smaller TD sizes, it could be not much.

You're conflating two different things.  guest_memfd allocating and managing
1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
map memory into the guest using 4KiB pages.

> So for TDX in isolation, it seems like jumping out too far ahead to effectively
> consider the value. But presumably you guys are testing this on SEV or
> something? Have you measured any performance improvement? For what kind of
> applications? Or is the idea to basically to make guestmemfd work like however
> Google does guest memory?

The longer term goal of guest_memfd is to make it suitable for backing all VMs,
hence Vishal's "Non-CoCo VMs" comment.  Yes, some of this is useful for TDX, but
we (and others) want to use guest_memfd for far more than just CoCo VMs.  And
for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Jason Gunthorpe 8 months, 3 weeks ago

On Thu, May 15, 2025 at 05:57:57PM -0700, Sean Christopherson wrote:

> You're conflating two different things.  guest_memfd allocating and managing
> 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> map memory into the guest using 4KiB pages.

Even if KVM is limited to 4K the IOMMU might not be - alot of these
workloads have a heavy IO component and we need the iommu to perform
well too.

Frankly, I don't think there should be objection to making memory more
contiguous. There is alot of data that this always brings wins
somewhere for someone.

> The longer term goal of guest_memfd is to make it suitable for backing all VMs,
> hence Vishal's "Non-CoCo VMs" comment.  Yes, some of this is useful for TDX, but
> we (and others) want to use guest_memfd for far more than just CoCo VMs.  And
> for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.

Yes, even from an iommu perspective with 2D translation we need to
have the 1G pages from the S2 resident in the IOTLB or performance
falls off a cliff.

Jason

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 8 months, 3 weeks ago

On Fri, 2025-05-16 at 10:09 -0300, Jason Gunthorpe wrote:
> > You're conflating two different things.  guest_memfd allocating and managing
> > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> > granularity.  Allocating memory in 1GiB chunks is useful even if KVM can
> > only
> > map memory into the guest using 4KiB pages.
> 
> Even if KVM is limited to 4K the IOMMU might not be - alot of these
> workloads have a heavy IO component and we need the iommu to perform
> well too.

Oh, interesting point.

> 
> Frankly, I don't think there should be objection to making memory more
> contiguous. 

No objections from me to anything except the lack of concrete justification.

> There is alot of data that this always brings wins
> somewhere for someone.

For the direct map huge page benchmarking, they saw that sometimes 1GB pages
helped, but also sometimes 2MB pages helped. That 1GB will help *some* workload
doesn't seem surprising.

> 
> > The longer term goal of guest_memfd is to make it suitable for backing all
> > VMs,
> > hence Vishal's "Non-CoCo VMs" comment.  Yes, some of this is useful for TDX,
> > but
> > we (and others) want to use guest_memfd for far more than just CoCo VMs. 
> > And
> > for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
> 
> Yes, even from an iommu perspective with 2D translation we need to
> have the 1G pages from the S2 resident in the IOTLB or performance
> falls off a cliff.

"falls off a cliff" is the level of detail and the direction of hand waving I
have been hearing. But it also seems modern CPUs are quite good at hiding the
cost of walks with caches etc. Like how 5 level paging was made unconditional. I
didn't think about IOTLB though. Thanks for mentioning it.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 8 months, 4 weeks ago

On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > > Thinking from the TDX perspective, we might have bigger fish to fry than
> > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the
> > > > benefits don't have numbers. How much are we getting for all the
> > > > complexity, over say buddy allocated 2MB pages?
> 
> TDX may have bigger fish to fry, but some of us have bigger fish to fry than
> TDX :-)

Fair enough. But TDX is on the "roadmap". So it helps to say what the target of
this series is.

> 
> > > This series should work for any page sizes backed by hugetlb memory.
> > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > > essential for certain workloads and will emerge as guest_memfd users.
> > > Features like KHO/memory persistence in addition also depend on
> > > hugepage support in guest_memfd.
> > > 
> > > This series takes strides towards making guest_memfd compatible with
> > > usecases where 1G pages are essential and non-confidential VMs are
> > > already exercising them.
> > > 
> > > I think the main complexity here lies in supporting in-place
> > > conversion which applies to any huge page size even for buddy
> > > allocated 2MB pages or THP.
> > > 
> > > This complexity arises because page structs work at a fixed
> > > granularity, future roadmap towards not having page structs for guest
> > > memory (at least private memory to begin with) should help towards
> > > greatly reducing this complexity.
> > > 
> > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > > essential and complement this series well for better memory footprint
> > > and overall performance of TDX VMs.
> > 
> > Hmm, this didn't really answer my questions about the concrete benefits.
> > 
> > I think it would help to include this kind of justification for the 1GB
> > guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> > hard to review against...
> > 
> > I think one of the challenges with coco is that it's almost like a sprint to
> > reimplement virtualization. But enough things are changing at once that not
> > all of the normal assumptions hold, so it can't copy all the same solutions.
> > The recent example was that for TDX huge pages we found that normal
> > promotion paths weren't actually yielding any benefit for surprising TDX
> > specific reasons.
> > 
> > On the TDX side we are also, at least currently, unmapping private pages
> > while they are mapped shared, so any 1GB pages would get split to 2MB if
> > there are any shared pages in them. I wonder how many 1GB pages there would
> > be after all the shared pages are converted. At smaller TD sizes, it could
> > be not much.
> 
> You're conflating two different things.  guest_memfd allocating and managing
> 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> map memory into the guest using 4KiB pages.

I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
list quoted there was more about guest performance. Or maybe the clever page
table walkers that find contiguous small mappings could benefit guest
performance too? It's the kind of thing I'd like to see at least broadly called
out.

I'm thinking that Google must have a ridiculous amount of learnings about VM
memory management. And this is probably designed around those learnings. But
reviewers can't really evaluate it if they don't know the reasons and tradeoffs
taken. If it's going upstream, I think it should have at least the high level
reasoning explained.

I don't mean to harp on the point so hard, but I didn't expect it to be
controversial either.

> 
> > So for TDX in isolation, it seems like jumping out too far ahead to
> > effectively consider the value. But presumably you guys are testing this on
> > SEV or something? Have you measured any performance improvement? For what
> > kind of applications? Or is the idea to basically to make guestmemfd work
> > like however Google does guest memory?
> 
> The longer term goal of guest_memfd is to make it suitable for backing all
> VMs, hence Vishal's "Non-CoCo VMs" comment.

Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was
talking about pKVM.

>   Yes, some of this is useful for TDX, but we (and others) want to use
> guest_memfd for far more than just CoCo VMs. 


>  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
I've heard this a lot. It must be true, but I've never seen the actual numbers.
For a long time people believed 1GB huge pages on the direct map were critical,
but then benchmarking on a contemporary CPU couldn't find much difference
between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
the combined walks are huge, iTLB, etc, but I'd love to see a real number.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 8 months, 3 weeks ago

On Thu, May 15, 2025 at 7:12 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > > > Thinking from the TDX perspective, we might have bigger fish to fry than
> > > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the
> > > > > benefits don't have numbers. How much are we getting for all the
> > > > > complexity, over say buddy allocated 2MB pages?
> >
> > TDX may have bigger fish to fry, but some of us have bigger fish to fry than
> > TDX :-)
>
> Fair enough. But TDX is on the "roadmap". So it helps to say what the target of
> this series is.
>
> >
> > > > This series should work for any page sizes backed by hugetlb memory.
> > > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > > > essential for certain workloads and will emerge as guest_memfd users.
> > > > Features like KHO/memory persistence in addition also depend on
> > > > hugepage support in guest_memfd.
> > > >
> > > > This series takes strides towards making guest_memfd compatible with
> > > > usecases where 1G pages are essential and non-confidential VMs are
> > > > already exercising them.
> > > >
> > > > I think the main complexity here lies in supporting in-place
> > > > conversion which applies to any huge page size even for buddy
> > > > allocated 2MB pages or THP.
> > > >
> > > > This complexity arises because page structs work at a fixed
> > > > granularity, future roadmap towards not having page structs for guest
> > > > memory (at least private memory to begin with) should help towards
> > > > greatly reducing this complexity.
> > > >
> > > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > > > essential and complement this series well for better memory footprint
> > > > and overall performance of TDX VMs.
> > >
> > > Hmm, this didn't really answer my questions about the concrete benefits.
> > >
> > > I think it would help to include this kind of justification for the 1GB
> > > guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> > > hard to review against...
> > >
> > > I think one of the challenges with coco is that it's almost like a sprint to
> > > reimplement virtualization. But enough things are changing at once that not
> > > all of the normal assumptions hold, so it can't copy all the same solutions.
> > > The recent example was that for TDX huge pages we found that normal
> > > promotion paths weren't actually yielding any benefit for surprising TDX
> > > specific reasons.
> > >
> > > On the TDX side we are also, at least currently, unmapping private pages
> > > while they are mapped shared, so any 1GB pages would get split to 2MB if
> > > there are any shared pages in them. I wonder how many 1GB pages there would
> > > be after all the shared pages are converted. At smaller TD sizes, it could
> > > be not much.
> >
> > You're conflating two different things.  guest_memfd allocating and managing
> > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> > granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> > map memory into the guest using 4KiB pages.
>
> I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
> list quoted there was more about guest performance. Or maybe the clever page
> table walkers that find contiguous small mappings could benefit guest
> performance too? It's the kind of thing I'd like to see at least broadly called
> out.

The crux of this series really is hugetlb backing support for
guest_memfd and handling CoCo VMs irrespective of the page size as I
suggested earlier, so 2M page sizes will need to handle similar
complexity of in-place conversion.

Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
lower memory footprint using HVO and lower MMU/IOMMU page table memory
footprint among other improvements. These percentages carry a
substantial impact when working at the scale of large fleets of hosts
each carrying significant memory capacity.

guest_memfd hugepage support + hugepage EPT mapping support for TDX
VMs significantly help:
1) ~70% decrease in TDX VM boot up time
2) ~65% decrease in TDX VM shutdown time
3) ~90% decrease in TDX VM PAMT memory overhead
4) Improvement in TDX SEPT memory overhead

And we believe this combination should also help achieve better
performance with TDX connect in future.

Hugetlb huge pages are preferred as they are statically carved out at
boot and so provide much better guarantees of availability. Once the
pages are carved out, any VMs scheduled on such a host will need to
work with the same hugetlb memory sizes. This series attempts to use
hugetlb pages with in-place conversion, avoiding the double allocation
problem that otherwise results in significant memory overheads for
CoCo VMs.

>
> I'm thinking that Google must have a ridiculous amount of learnings about VM
> memory management. And this is probably designed around those learnings. But
> reviewers can't really evaluate it if they don't know the reasons and tradeoffs
> taken. If it's going upstream, I think it should have at least the high level
> reasoning explained.
>
> I don't mean to harp on the point so hard, but I didn't expect it to be
> controversial either.
>
> >
> > > So for TDX in isolation, it seems like jumping out too far ahead to
> > > effectively consider the value. But presumably you guys are testing this on
> > > SEV or something? Have you measured any performance improvement? For what
> > > kind of applications? Or is the idea to basically to make guestmemfd work
> > > like however Google does guest memory?
> >
> > The longer term goal of guest_memfd is to make it suitable for backing all
> > VMs, hence Vishal's "Non-CoCo VMs" comment.
>
> Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was
> talking about pKVM.
>
> >   Yes, some of this is useful for TDX, but we (and others) want to use
> > guest_memfd for far more than just CoCo VMs.
>
>
> >  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
> I've heard this a lot. It must be true, but I've never seen the actual numbers.
> For a long time people believed 1GB huge pages on the direct map were critical,
> but then benchmarking on a contemporary CPU couldn't find much difference
> between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
> the combined walks are huge, iTLB, etc, but I'd love to see a real number.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 8 months, 3 weeks ago

On Fri, May 16, 2025, Vishal Annapurve wrote:
> On Thu, May 15, 2025 at 7:12 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> > On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > You're conflating two different things.  guest_memfd allocating and managing
> > > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> > > granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> > > map memory into the guest using 4KiB pages.
> >
> > I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
> > list quoted there was more about guest performance. Or maybe the clever page
> > table walkers that find contiguous small mappings could benefit guest
> > performance too? It's the kind of thing I'd like to see at least broadly called
> > out.
> 
> The crux of this series really is hugetlb backing support for guest_memfd and
> handling CoCo VMs irrespective of the page size as I suggested earlier, so 2M
> page sizes will need to handle similar complexity of in-place conversion.
> 
> Google internally uses 1G hugetlb pages to achieve high bandwidth IO,

E.g. hitting target networking line rates is only possible with 1GiB mappings,
otherwise TLB pressure gets in the way.

> lower memory footprint using HVO and lower MMU/IOMMU page table memory
> footprint among other improvements. These percentages carry a substantial
> impact when working at the scale of large fleets of hosts each carrying
> significant memory capacity.

Yeah, 1.6% might sound small, but over however many bytes of RAM there are in
the fleet, it's a huge (lol) amount of memory saved.

> > >   Yes, some of this is useful for TDX, but we (and others) want to use
> > > guest_memfd for far more than just CoCo VMs.
> >
> >
> > >  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
> > I've heard this a lot. It must be true, but I've never seen the actual numbers.
> > For a long time people believed 1GB huge pages on the direct map were critical,
> > but then benchmarking on a contemporary CPU couldn't find much difference
> > between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
> > the combined walks are huge, iTLB, etc, but I'd love to see a real number.

The direct map is very, very different than userspace and thus guest mappings.
Software (hopefully) isn't using the direct map to index multi-TiB databases,
or to transfer GiBs of data over the network.  The amount of memory the kernel
is regularly accessing is an order or two magnitude smaller than single process
use cases.

A few examples from a quick search:

http://pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be
https://www.percona.com/blog/benchmark-postgresql-with-linux-hugepages/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 8 months, 3 weeks ago

On Fri, 2025-05-16 at 06:11 -0700, Vishal Annapurve wrote:
> The crux of this series really is hugetlb backing support for
> guest_memfd and handling CoCo VMs irrespective of the page size as I
> suggested earlier, so 2M page sizes will need to handle similar
> complexity of in-place conversion.

I assumed this part was added 1GB complexity:
 mm/hugetlb.c                                  |  488 ++---

I'll dig into the series and try to understand the point better.

> 
> Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
> lower memory footprint using HVO and lower MMU/IOMMU page table memory
> footprint among other improvements. These percentages carry a
> substantial impact when working at the scale of large fleets of hosts
> each carrying significant memory capacity.

There must have been a lot of measuring involved in that. But the numbers I was
hoping for were how much does *this* series help upstream.

> 
> guest_memfd hugepage support + hugepage EPT mapping support for TDX
> VMs significantly help:
> 1) ~70% decrease in TDX VM boot up time
> 2) ~65% decrease in TDX VM shutdown time
> 3) ~90% decrease in TDX VM PAMT memory overhead
> 4) Improvement in TDX SEPT memory overhead

Thanks. It is the difference between 4k mappings and 2MB mappings I guess? Or
are you saying this is the difference between 1GB contiguous pages for TDX at
2MB mapping, and 2MB contiguous pages at TDX 2MB mappings? The 1GB part is the
one I was curious about.

> 
> And we believe this combination should also help achieve better
> performance with TDX connect in future.

Please don't take this query as an objection that the series doesn't help TDX
enough or something like that. If it doesn't help TDX at all (not the case),
that is fine. The objection is only that the specific benefits and tradeoffs
around 1GB pages are not clear in the upstream posting.

> 
> Hugetlb huge pages are preferred as they are statically carved out at
> boot and so provide much better guarantees of availability.
> 

Reserved memory can provide physically contiguous pages more frequently. Seems
not surprising at all, and something that could have a number. 

>  Once the
> pages are carved out, any VMs scheduled on such a host will need to
> work with the same hugetlb memory sizes. This series attempts to use
> hugetlb pages with in-place conversion, avoiding the double allocation
> problem that otherwise results in significant memory overheads for
> CoCo VMs.

I asked this question assuming there were some measurements for the 1GB part of
this series. It sounds like the reasoning is instead that this is how Google
does things, which is backed by way more benchmarking than kernel patches are
used to getting. So it can just be reasonable assumed to be helpful.

But for upstream code, I'd expect there to be a bit more concrete than "we
believe" and "substantial impact". It seems like I'm in the minority here
though. So if no one else wants to pressure test the thinking in the usual way,
I guess I'll just have to wonder.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 8 months, 3 weeks ago

On Fri, May 16, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-05-16 at 06:11 -0700, Vishal Annapurve wrote:
> > Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
> > lower memory footprint using HVO and lower MMU/IOMMU page table memory
> > footprint among other improvements. These percentages carry a
> > substantial impact when working at the scale of large fleets of hosts
> > each carrying significant memory capacity.
> 
> There must have been a lot of measuring involved in that. But the numbers I was
> hoping for were how much does *this* series help upstream.

...

> I asked this question assuming there were some measurements for the 1GB part of
> this series. It sounds like the reasoning is instead that this is how Google
> does things, which is backed by way more benchmarking than kernel patches are
> used to getting. So it can just be reasonable assumed to be helpful.
> 
> But for upstream code, I'd expect there to be a bit more concrete than "we
> believe" and "substantial impact". It seems like I'm in the minority here
> though. So if no one else wants to pressure test the thinking in the usual way,
> I guess I'll just have to wonder.

From my perspective, 1GiB hugepage support in guest_memfd isn't about improving
CoCo performance, it's about achieving feature parity on guest_memfd with respect
to existing backing stores so that it's possible to use guest_memfd to back all
VM shapes in a fleet.

Let's assume there is significant value in backing non-CoCo VMs with 1GiB pages,
unless you want to re-litigate the existence of 1GiB support in HugeTLBFS.

If we assume 1GiB support is mandatory for non-CoCo VMs, then it becomes mandatory
for CoCo VMs as well, because it's the only realistic way to run CoCo VMs and
non-CoCo VMs on a single host.  Mixing 1GiB HugeTLBFS with any other backing store
for VMs simply isn't tenable due to the nature of 1GiB allocations.  E.g. grabbing
sub-1GiB chunks of memory for CoCo VMs quickly fragments memory to the point where
HugeTLBFS can't allocate memory for non-CoCo VMs.

Teaching HugeTLBFS to play nice with TDX and SNP isn't happening, which leaves
adding 1GiB support to guest_memfd as the only way forward.

Any boost to TDX (or SNP) performance is purely a bonus.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 8 months, 3 weeks ago

On Fri, 2025-05-16 at 10:51 -0700, Sean Christopherson wrote:
> From my perspective, 1GiB hugepage support in guest_memfd isn't about improving
> CoCo performance, it's about achieving feature parity on guest_memfd with respect
> to existing backing stores so that it's possible to use guest_memfd to back all
> VM shapes in a fleet.
> 
> Let's assume there is significant value in backing non-CoCo VMs with 1GiB pages,
> unless you want to re-litigate the existence of 1GiB support in HugeTLBFS.

I didn't expect to go in that direction when I first asked. But everyone says
huge, but no one knows the numbers. It can be a sign of things.

Meanwhile I'm watching patches to make 5 level paging walks unconditional fly by
because people couldn't find a cost to the extra level of walk. So re-litigate,
no. But I'll probably remain quietly suspicious of the exact cost/value. At
least on the CPU side, I totally missed the IOTLB side at first, sorry.

> 
> If we assume 1GiB support is mandatory for non-CoCo VMs, then it becomes mandatory
> for CoCo VMs as well, because it's the only realistic way to run CoCo VMs and
> non-CoCo VMs on a single host.  Mixing 1GiB HugeTLBFS with any other backing store
> for VMs simply isn't tenable due to the nature of 1GiB allocations.  E.g. grabbing
> sub-1GiB chunks of memory for CoCo VMs quickly fragments memory to the point where
> HugeTLBFS can't allocate memory for non-CoCo VMs.

It makes sense that there would be a difference in how many huge pages the non-
coco guests would get. Where I start to lose you is when you guys talk about
"mandatory" or similar. If you want upstream review, it would help to have more
numbers on the "why" question. At least for us folks outside the hyperscalars
where such things are not as obvious.

> 
> Teaching HugeTLBFS to play nice with TDX and SNP isn't happening, which leaves
> adding 1GiB support to guest_memfd as the only way forward.
> 
> Any boost to TDX (or SNP) performance is purely a bonus.

Most of the bullets in the talk were about mapping sizes AFAICT, so this is the
kind of reasoning I was hoping for. Thanks for elaborating on it, even though
still no one has any numbers besides the vmemmap savings.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Dave Hansen 8 months, 3 weeks ago

On 5/16/25 12:14, Edgecombe, Rick P wrote:
> Meanwhile I'm watching patches to make 5 level paging walks unconditional fly by
> because people couldn't find a cost to the extra level of walk. So re-litigate,
> no. But I'll probably remain quietly suspicious of the exact cost/value. At
> least on the CPU side, I totally missed the IOTLB side at first, sorry.

It's a little more complicated than just the depth of the worst-case walk.

In practice, many page walks can use the mid-level paging structure
caches because the mappings aren't sparse.

With 5-level paging in particular, userspace doesn't actually change
much at all. Its layout is pretty much the same unless folks are opting
in to the higher (5-level only) address space. So userspace isn't
sparse, at least at the scale of what 5-level paging is capable of.

For the kernel, things are a bit more spread out than they were before.
For instance, the direct map and vmalloc() are in separate p4d pages
when they used to be nestled together in the same half of one pgd.

But, again, they're not *that* sparse. The direct map, for example,
doesn't become more sparse, it just moves to a lower virtual address.
Ditto for vmalloc().  Just because 5-level paging has a massive
vmalloc() area doesn't mean we use it.

Basically, 5-level paging adds a level to the top of the page walk, and
we're really good at caching those when they're not accessed sparsely.

CPUs are not as good at caching the leaf side of the page walk. There
are tricks like AMD's TLB coalescing that help. But, generally, each
walk on the leaf end of the walks eats a TLB entry. Those just don't
cache as well as the top of the tree.

That's why we need to be more maniacal about reducing leaf levels than
the levels toward the root.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 8 months, 3 weeks ago

On Fri, 2025-05-16 at 13:25 -0700, Dave Hansen wrote:
> It's a little more complicated than just the depth of the worst-case walk.
> 
> In practice, many page walks can use the mid-level paging structure
> caches because the mappings aren't sparse.
> 
> With 5-level paging in particular, userspace doesn't actually change
> much at all. Its layout is pretty much the same unless folks are opting
> in to the higher (5-level only) address space. So userspace isn't
> sparse, at least at the scale of what 5-level paging is capable of.
> 
> For the kernel, things are a bit more spread out than they were before.
> For instance, the direct map and vmalloc() are in separate p4d pages
> when they used to be nestled together in the same half of one pgd.
> 
> But, again, they're not *that* sparse. The direct map, for example,
> doesn't become more sparse, it just moves to a lower virtual address.
> Ditto for vmalloc().  Just because 5-level paging has a massive
> vmalloc() area doesn't mean we use it.
> 
> Basically, 5-level paging adds a level to the top of the page walk, and
> we're really good at caching those when they're not accessed sparsely.
> 
> CPUs are not as good at caching the leaf side of the page walk. There
> are tricks like AMD's TLB coalescing that help. But, generally, each
> walk on the leaf end of the walks eats a TLB entry. Those just don't
> cache as well as the top of the tree.
> 
> That's why we need to be more maniacal about reducing leaf levels than
> the levels toward the root.

Makes sense. For what is easy for the CPU to cache, it can be more about the
address space layout then the length of the walk.

Going off topic from this patchset...

I have a possibly fun related anecdote. A while ago when I was doing the KVM XO
stuff, I was trying to test how much worse the performance was from caches being
forced to deal with the sparser GPA accesses. The test was to modify the guest
to force all the executable GVA mappings to go on the XO alias. I was confused
to find that KVM XO was faster than the normal layout by a small, but consistent
amount. It had me scratching my head. It turned out that the NX huge page
mitigation was able to maintain large pages for the data accesses because all
the executable accesses were moved off of the main GPA alias.

My takeaway was that the real world implementations can interact in surprising
ways, and for at least my ability to reason about it, it's good to verify with a
test when possible.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ackerley Tng 2 weeks, 1 day ago

Ackerley Tng <ackerleytng@google.com> writes:

Re-using this thread to collect discussions related to guest_memfd
HugeTLB support, also trimmed cc list.

Here's the latest public version Vishal and I have:

  https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25

On the guest_memfd call on 2026-01-22, Michael found another bug to do
with multiple threads trying to allocate within the same huge page at
the same time.

The fix we're using to make progress is to use hugetlb_fault_mutex_lock.

unsigned int gmem_hugetlb_mapping_index_lock(struct address_space *mapping,
					     pgoff_t index, u8 page_order)
{
	pgoff_t index_floor = round_down(index, 1ULL << page_order);

	return hugetlb_fault_mutex_lock(mapping, index_floor);
}

void gmem_hugetlb_mapping_index_unlock(unsigned int hash)
{
	hugetlb_fault_mutex_unlock(hash);
}

and then

static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
{
        ... declarations ...

	if (gmem_is_hugetlb(gi->flags))
		lock_id = gmem_hugetlb_mapping_index_lock(mapping, index, gi->page_order);

        ... and this right at the end ...

	if (gmem_is_hugetlb(gi->flags))
		gmem_hugetlb_mapping_index_unlock(lock_id);
}

Yan also found some bugs (thanks!) and there's a discussion at [*].

[*] https://lore.kernel.org/all/CAEvNRgGG+xYhsz62foOrTeAxUCYxpCKCJnNgTAMYMV=w2eq+6Q@mail.gmail.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ackerley Tng 7 months, 2 weeks ago

Ackerley Tng <ackerleytng@google.com> writes:

> Hello,
>
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.
>
> [...]

At the guest_memfd upstream call today (2025-06-26), we talked about
when to merge folios with respect to conversions.

Just want to call out that in this RFCv2, we managed to get conversions
working with merges happening as soon as possible.

"As soon as possible" means merges happen as long as shareability is all
private (or all meaningless) within an aligned hugepage range. We try to
merge after every conversion request and on truncation. On truncation,
shareability becomes meaningless.

On explicit truncation (e.g. fallocate(PUNCH_HOLE)), truncation can fail
if there are unexpected refcounts (because we can't merge with
unexpected refcounts). Explicit truncation will succeed only if
refcounts are expected, and merge is performed before finally removing
from filemap.

On truncation caused by file close or inode release, guest_memfd may not
hold the last refcount on the folio. Only in this case, we defer merging
to the folio_put() callback, and because the callback can be called from
atomic context, the merge is further deferred to be performed by a
kernel worker.

Deferment of merging is already minimized so that most of the
restructuring is synchronous with some userspace-initiated action
(conversion or explicit truncation). The only deferred merge is when the
file is closed, and in that case there's no way to reject/fail this file
close.

(There are possible optimizations here - Yan suggested [1] checking if
the folio_put() was called from interrupt context - I have not tried
implementing that yet)

I did propose an explicit guest_memfd merge ioctl, but since RFCv2
works, I was thinking to to have the merge ioctl be a separate
optimization/project/patch series if it turns out that merging
as-soon-as-possible is an inefficient strategy, or if some VM use cases
prefer to have an explicit merge ioctl.

During the call, Michael also brought up that SNP adds some constraints
with respect to guest accepting pages/levels.

Could you please expand on that? Suppose for an SNP guest,

1. Guest accepted a page at 2M level
2. Guest converts a 4K sub page to shared
3. guest_memfd requests unmapping of the guest-requested 4K range
   (the rest of the 2M remains mapped into stage 2 page tables)
4. guest_memfd splits the huge page to 4K pages (the 4K is set to
   SHAREABILITY_ALL, the rest of the 2M is still SHAREABILITY_GUEST)

Can the SNP guest continue to use the rest of the 2M page or must it
re-accept all the pages at 4K?

And for the reverse:

1. Guest accepted a 2M range at 4K
2. guest_memfd merges the full 2M range to a single 2M page

Must the SNP guest re-accept at 2M for the guest to continue
functioning, or will the SNP guest continue to work (just with poorer
performance than if the memory was accepted at 2M)?

[1] https://lore.kernel.org/all/aDfT35EsYP%2FByf7Z@yzhao56-desk.sh.intel.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Yan Zhao 7 months, 3 weeks ago

On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.
> 
> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> for guest_memfd patchset (Thanks Fuad!) [1].
> 
> For ease of testing, this series is also available, stitched together,
> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2

Just to record a found issue -- not one that must be fixed.

In TDX, the initial memory region is added as private memory during TD's build
time, with its initial content copied from source pages in shared memory.
The copy operation requires simultaneous access to both shared source memory
and private target memory.

Therefore, userspace cannot store the initial content in shared memory at the
mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
private memory. This is because the guest_memfd will first unmap a PFN in shared
page tables and then check for any extra refcount held for the shared PFN before
converting it to private.

Currently, we tested the initial memory region using the in-place conversion
version of guest_memfd as backend by modifying QEMU to add an extra anonymous
backend to hold the source initial content in shared memory. The extra anonymous
backend is freed after finishing ading the initial memory region.

This issue is benign for TDX, as the initial memory region can also utilize the
traditional guest_memfd, which only allows 4KB mappings. This is acceptable for
now, as the initial memory region typically involves a small amount of memory,
and we may not enable huge pages for ranges covered by the initial memory region
in the near future.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Xiaoyao Li 7 months, 3 weeks ago

On 6/19/2025 4:13 PM, Yan Zhao wrote:
> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>> Hello,
>>
>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>> upstream calls to provide 1G page support for guest_memfd by taking
>> pages from HugeTLB.
>>
>> This patchset is based on Linux v6.15-rc6, and requires the mmap support
>> for guest_memfd patchset (Thanks Fuad!) [1].
>>
>> For ease of testing, this series is also available, stitched together,
>> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>   
> Just to record a found issue -- not one that must be fixed.
> 
> In TDX, the initial memory region is added as private memory during TD's build
> time, with its initial content copied from source pages in shared memory.
> The copy operation requires simultaneous access to both shared source memory
> and private target memory.
> 
> Therefore, userspace cannot store the initial content in shared memory at the
> mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> private memory. This is because the guest_memfd will first unmap a PFN in shared
> page tables and then check for any extra refcount held for the shared PFN before
> converting it to private.

I have an idea.

If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place 
conversion unmap the PFN in shared page tables while keeping the content 
of the page unchanged, right?

So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory 
actually for non-CoCo case actually, that userspace first mmap() it and 
ensure it's shared and writes the initial content to it, after it 
userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.

For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it 
wants the private memory to be initialized with initial content, and 
just do in-place TDH.PAGE.ADD in the hook.

> Currently, we tested the initial memory region using the in-place conversion
> version of guest_memfd as backend by modifying QEMU to add an extra anonymous
> backend to hold the source initial content in shared memory. The extra anonymous
> backend is freed after finishing ading the initial memory region.
> 
> This issue is benign for TDX, as the initial memory region can also utilize the
> traditional guest_memfd, which only allows 4KB mappings. This is acceptable for
> now, as the initial memory region typically involves a small amount of memory,
> and we may not enable huge pages for ranges covered by the initial memory region
> in the near future.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months, 1 week ago

On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> >> Hello,
> >>
> >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> >> upstream calls to provide 1G page support for guest_memfd by taking
> >> pages from HugeTLB.
> >>
> >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> >> for guest_memfd patchset (Thanks Fuad!) [1].
> >>
> >> For ease of testing, this series is also available, stitched together,
> >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Just to record a found issue -- not one that must be fixed.
> >
> > In TDX, the initial memory region is added as private memory during TD's build
> > time, with its initial content copied from source pages in shared memory.
> > The copy operation requires simultaneous access to both shared source memory
> > and private target memory.
> >
> > Therefore, userspace cannot store the initial content in shared memory at the
> > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > page tables and then check for any extra refcount held for the shared PFN before
> > converting it to private.
>
> I have an idea.
>
> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> conversion unmap the PFN in shared page tables while keeping the content
> of the page unchanged, right?

That's correct.

>
> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> actually for non-CoCo case actually, that userspace first mmap() it and
> ensure it's shared and writes the initial content to it, after it
> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.

I think you mean pKVM by non-coco VMs that care about private memory.
Yes, initial memory regions can start as shared which userspace can
populate and then convert the ranges to private.

>
> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> wants the private memory to be initialized with initial content, and
> just do in-place TDH.PAGE.ADD in the hook.

I think this scheme will be cleaner:
1) Userspace marks the guest_memfd ranges corresponding to initial
payload as shared.
2) Userspace mmaps and populates the ranges.
3) Userspace converts those guest_memfd ranges to private.
4) For both SNP and TDX, userspace continues to invoke corresponding
initial payload preparation operations via existing KVM ioctls e.g.
KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
   - SNP/TDX KVM logic fetches the right pfns for the target gfns
using the normal paths supported by KVM and passes those pfns directly
to the right trusted module to initialize the "encrypted" memory
contents.
       - Avoiding any GUP or memcpy from source addresses.

i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.

Since we need to support VMs that will/won't use in-place conversion,
I think operations like KVM_TDX_INIT_MEM_REGION can introduce explicit
flags to allow userspace to indicate whether to assume in-place
conversion or not. Maybe
kvm_tdx_init_mem_region.source_addr/kvm_sev_snp_launch_update.uaddr
can be null in the scenarios where in-place conversion is used.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Yan Zhao 7 months, 1 week ago

On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >
> > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > >> Hello,
> > >>
> > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > >> upstream calls to provide 1G page support for guest_memfd by taking
> > >> pages from HugeTLB.
> > >>
> > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > >>
> > >> For ease of testing, this series is also available, stitched together,
> > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > >
> > > Just to record a found issue -- not one that must be fixed.
> > >
> > > In TDX, the initial memory region is added as private memory during TD's build
> > > time, with its initial content copied from source pages in shared memory.
> > > The copy operation requires simultaneous access to both shared source memory
> > > and private target memory.
> > >
> > > Therefore, userspace cannot store the initial content in shared memory at the
> > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > page tables and then check for any extra refcount held for the shared PFN before
> > > converting it to private.
> >
> > I have an idea.
> >
> > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > conversion unmap the PFN in shared page tables while keeping the content
> > of the page unchanged, right?
> 
> That's correct.
> 
> >
> > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > actually for non-CoCo case actually, that userspace first mmap() it and
> > ensure it's shared and writes the initial content to it, after it
> > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> 
> I think you mean pKVM by non-coco VMs that care about private memory.
> Yes, initial memory regions can start as shared which userspace can
> populate and then convert the ranges to private.
> 
> >
> > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > wants the private memory to be initialized with initial content, and
> > just do in-place TDH.PAGE.ADD in the hook.
> 
> I think this scheme will be cleaner:
> 1) Userspace marks the guest_memfd ranges corresponding to initial
> payload as shared.
> 2) Userspace mmaps and populates the ranges.
> 3) Userspace converts those guest_memfd ranges to private.
> 4) For both SNP and TDX, userspace continues to invoke corresponding
> initial payload preparation operations via existing KVM ioctls e.g.
> KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
>    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> using the normal paths supported by KVM and passes those pfns directly
> to the right trusted module to initialize the "encrypted" memory
> contents.
>        - Avoiding any GUP or memcpy from source addresses.
One caveat:

when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
Then kvm_gmem_prepare_folio() is further invoked to zero the folio.

> i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
So, upon here, the pages should not contain the original content?

> Since we need to support VMs that will/won't use in-place conversion,
> I think operations like KVM_TDX_INIT_MEM_REGION can introduce explicit
> flags to allow userspace to indicate whether to assume in-place
> conversion or not. Maybe
> kvm_tdx_init_mem_region.source_addr/kvm_sev_snp_launch_update.uaddr
> can be null in the scenarios where in-place conversion is used.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months, 1 week ago

On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > >
> > > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > >> Hello,
> > > >>
> > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > >> upstream calls to provide 1G page support for guest_memfd by taking
> > > >> pages from HugeTLB.
> > > >>
> > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > > >>
> > > >> For ease of testing, this series is also available, stitched together,
> > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > > >
> > > > Just to record a found issue -- not one that must be fixed.
> > > >
> > > > In TDX, the initial memory region is added as private memory during TD's build
> > > > time, with its initial content copied from source pages in shared memory.
> > > > The copy operation requires simultaneous access to both shared source memory
> > > > and private target memory.
> > > >
> > > > Therefore, userspace cannot store the initial content in shared memory at the
> > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > > page tables and then check for any extra refcount held for the shared PFN before
> > > > converting it to private.
> > >
> > > I have an idea.
> > >
> > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > > conversion unmap the PFN in shared page tables while keeping the content
> > > of the page unchanged, right?
> >
> > That's correct.
> >
> > >
> > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > > actually for non-CoCo case actually, that userspace first mmap() it and
> > > ensure it's shared and writes the initial content to it, after it
> > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> >
> > I think you mean pKVM by non-coco VMs that care about private memory.
> > Yes, initial memory regions can start as shared which userspace can
> > populate and then convert the ranges to private.
> >
> > >
> > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > > wants the private memory to be initialized with initial content, and
> > > just do in-place TDH.PAGE.ADD in the hook.
> >
> > I think this scheme will be cleaner:
> > 1) Userspace marks the guest_memfd ranges corresponding to initial
> > payload as shared.
> > 2) Userspace mmaps and populates the ranges.
> > 3) Userspace converts those guest_memfd ranges to private.
> > 4) For both SNP and TDX, userspace continues to invoke corresponding
> > initial payload preparation operations via existing KVM ioctls e.g.
> > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
> >    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> > using the normal paths supported by KVM and passes those pfns directly
> > to the right trusted module to initialize the "encrypted" memory
> > contents.
> >        - Avoiding any GUP or memcpy from source addresses.
> One caveat:
>
> when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
> Then kvm_gmem_prepare_folio() is further invoked to zero the folio.

Given that confidential VMs have their own way of initializing private
memory, I think zeroing makes sense for only shared memory ranges.
i.e. something like below:
1) Don't zero at allocation time.
2) If faulting in a shared page and its not uptodate, then zero the
page and set the page as uptodate.
3) Clear uptodate flag on private to shared conversion.
4) For faults on private ranges, don't zero the memory.

There might be some other considerations here e.g. pKVM needs
non-destructive conversion operation, which might need a way to enable
zeroing at allocation time only.

On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear
pages on future platforms [1].

[1] https://lore.kernel.org/lkml/6de76911-5007-4170-bf74-e1d045c68465@intel.com/

>
> > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
> So, upon here, the pages should not contain the original content?
>

Pages should contain the original content. Michael is already
experimenting with similar logic [2] for SNP.

[2] https://lore.kernel.org/lkml/20250613005400.3694904-6-michael.roth@amd.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Yan Zhao 7 months, 1 week ago

On Mon, Jun 30, 2025 at 07:14:07AM -0700, Vishal Annapurve wrote:
> On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > > >
> > > > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > > >> Hello,
> > > > >>
> > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > >> upstream calls to provide 1G page support for guest_memfd by taking
> > > > >> pages from HugeTLB.
> > > > >>
> > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > > > >>
> > > > >> For ease of testing, this series is also available, stitched together,
> > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > > > >
> > > > > Just to record a found issue -- not one that must be fixed.
> > > > >
> > > > > In TDX, the initial memory region is added as private memory during TD's build
> > > > > time, with its initial content copied from source pages in shared memory.
> > > > > The copy operation requires simultaneous access to both shared source memory
> > > > > and private target memory.
> > > > >
> > > > > Therefore, userspace cannot store the initial content in shared memory at the
> > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > > > page tables and then check for any extra refcount held for the shared PFN before
> > > > > converting it to private.
> > > >
> > > > I have an idea.
> > > >
> > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > > > conversion unmap the PFN in shared page tables while keeping the content
> > > > of the page unchanged, right?
> > >
> > > That's correct.
> > >
> > > >
> > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > > > actually for non-CoCo case actually, that userspace first mmap() it and
> > > > ensure it's shared and writes the initial content to it, after it
> > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> > >
> > > I think you mean pKVM by non-coco VMs that care about private memory.
> > > Yes, initial memory regions can start as shared which userspace can
> > > populate and then convert the ranges to private.
> > >
> > > >
> > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > > > wants the private memory to be initialized with initial content, and
> > > > just do in-place TDH.PAGE.ADD in the hook.
> > >
> > > I think this scheme will be cleaner:
> > > 1) Userspace marks the guest_memfd ranges corresponding to initial
> > > payload as shared.
> > > 2) Userspace mmaps and populates the ranges.
> > > 3) Userspace converts those guest_memfd ranges to private.
> > > 4) For both SNP and TDX, userspace continues to invoke corresponding
> > > initial payload preparation operations via existing KVM ioctls e.g.
> > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
> > >    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> > > using the normal paths supported by KVM and passes those pfns directly
> > > to the right trusted module to initialize the "encrypted" memory
> > > contents.
> > >        - Avoiding any GUP or memcpy from source addresses.
> > One caveat:
> >
> > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
> > Then kvm_gmem_prepare_folio() is further invoked to zero the folio.
> 
> Given that confidential VMs have their own way of initializing private
> memory, I think zeroing makes sense for only shared memory ranges.
> i.e. something like below:
> 1) Don't zero at allocation time.
> 2) If faulting in a shared page and its not uptodate, then zero the
> page and set the page as uptodate.
> 3) Clear uptodate flag on private to shared conversion.
> 4) For faults on private ranges, don't zero the memory.
> 
> There might be some other considerations here e.g. pKVM needs
> non-destructive conversion operation, which might need a way to enable
> zeroing at allocation time only.
> 
> On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear
> pages on future platforms [1].
Yes, TDX does not need to clear pages on private page allocation.
But current kvm_gmem_prepare_folio() clears private pages in the common path
for both TDX and SEV-SNP.

I just wanted to point out that it's a kind of obstacle that need to be removed
to implement the proposed approach.


> [1] https://lore.kernel.org/lkml/6de76911-5007-4170-bf74-e1d045c68465@intel.com/
> 
> >
> > > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
> > So, upon here, the pages should not contain the original content?
> >
> 
> Pages should contain the original content. Michael is already
> experimenting with similar logic [2] for SNP.
> 
> [2] https://lore.kernel.org/lkml/20250613005400.3694904-6-michael.roth@amd.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months, 1 week ago

On Mon, Jun 30, 2025 at 10:26 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, Jun 30, 2025 at 07:14:07AM -0700, Vishal Annapurve wrote:
> > On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> > > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > > > >
> > > > > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > > > >> Hello,
> > > > > >>
> > > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > > >> upstream calls to provide 1G page support for guest_memfd by taking
> > > > > >> pages from HugeTLB.
> > > > > >>
> > > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > > > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > > > > >>
> > > > > >> For ease of testing, this series is also available, stitched together,
> > > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > > > > >
> > > > > > Just to record a found issue -- not one that must be fixed.
> > > > > >
> > > > > > In TDX, the initial memory region is added as private memory during TD's build
> > > > > > time, with its initial content copied from source pages in shared memory.
> > > > > > The copy operation requires simultaneous access to both shared source memory
> > > > > > and private target memory.
> > > > > >
> > > > > > Therefore, userspace cannot store the initial content in shared memory at the
> > > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > > > > page tables and then check for any extra refcount held for the shared PFN before
> > > > > > converting it to private.
> > > > >
> > > > > I have an idea.
> > > > >
> > > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > > > > conversion unmap the PFN in shared page tables while keeping the content
> > > > > of the page unchanged, right?
> > > >
> > > > That's correct.
> > > >
> > > > >
> > > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > > > > actually for non-CoCo case actually, that userspace first mmap() it and
> > > > > ensure it's shared and writes the initial content to it, after it
> > > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> > > >
> > > > I think you mean pKVM by non-coco VMs that care about private memory.
> > > > Yes, initial memory regions can start as shared which userspace can
> > > > populate and then convert the ranges to private.
> > > >
> > > > >
> > > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > > > > wants the private memory to be initialized with initial content, and
> > > > > just do in-place TDH.PAGE.ADD in the hook.
> > > >
> > > > I think this scheme will be cleaner:
> > > > 1) Userspace marks the guest_memfd ranges corresponding to initial
> > > > payload as shared.
> > > > 2) Userspace mmaps and populates the ranges.
> > > > 3) Userspace converts those guest_memfd ranges to private.
> > > > 4) For both SNP and TDX, userspace continues to invoke corresponding
> > > > initial payload preparation operations via existing KVM ioctls e.g.
> > > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
> > > >    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> > > > using the normal paths supported by KVM and passes those pfns directly
> > > > to the right trusted module to initialize the "encrypted" memory
> > > > contents.
> > > >        - Avoiding any GUP or memcpy from source addresses.
> > > One caveat:
> > >
> > > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
> > > Then kvm_gmem_prepare_folio() is further invoked to zero the folio.
> >
> > Given that confidential VMs have their own way of initializing private
> > memory, I think zeroing makes sense for only shared memory ranges.
> > i.e. something like below:
> > 1) Don't zero at allocation time.
> > 2) If faulting in a shared page and its not uptodate, then zero the
> > page and set the page as uptodate.
> > 3) Clear uptodate flag on private to shared conversion.
> > 4) For faults on private ranges, don't zero the memory.
> >
> > There might be some other considerations here e.g. pKVM needs
> > non-destructive conversion operation, which might need a way to enable
> > zeroing at allocation time only.
> >
> > On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear
> > pages on future platforms [1].
> Yes, TDX does not need to clear pages on private page allocation.
> But current kvm_gmem_prepare_folio() clears private pages in the common path
> for both TDX and SEV-SNP.
>
> I just wanted to point out that it's a kind of obstacle that need to be removed
> to implement the proposed approach.
>

Proposed approach will work with 4K pages without any additional
changes. For huge pages it's easy to prototype this approach by just
disabling zeroing logic in guest mem on faulting and instead always
doing zeroing on allocation.

I would be curious to understand if we need zeroing on conversion for
Confidential VMs. If not, then the simple rule of zeroing on
allocation only will work for all usecases.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Tue, Jul 01, 2025, Vishal Annapurve wrote:
> I would be curious to understand if we need zeroing on conversion for
> Confidential VMs. If not, then the simple rule of zeroing on
> allocation only will work for all usecases.

Unless I'm misunderstanding what your asking, pKVM very specific does NOT want
zeroing on conversion, because one of its use cases is in-place conversion, e.g.
to fill a shared buffer and then convert it to private so that the buffer can be
processed in the TEE.

Some architectures, e.g. SNP and TDX, may effectively require zeroing on conversion,
but that's essentially a property of the architecture, i.e. an arch/vendor specific
detail.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Mon, Jul 7, 2025 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 01, 2025, Vishal Annapurve wrote:
> > I would be curious to understand if we need zeroing on conversion for
> > Confidential VMs. If not, then the simple rule of zeroing on
> > allocation only will work for all usecases.
>
> Unless I'm misunderstanding what your asking, pKVM very specific does NOT want
> zeroing on conversion, because one of its use cases is in-place conversion, e.g.
> to fill a shared buffer and then convert it to private so that the buffer can be
> processed in the TEE.

Yeah, that makes sense. So "just zero on allocation" (and no more
zeroing during conversion) policy will work for pKVM.

>
> Some architectures, e.g. SNP and TDX, may effectively require zeroing on conversion,
> but that's essentially a property of the architecture, i.e. an arch/vendor specific
> detail.

Conversion operation is a unique capability supported by guest_memfd
files so my intention of bringing up zeroing was to better understand
the need and clarify the role of guest_memfd in handling zeroing
during conversion.

Not sure if I am misinterpreting you, but treating "zeroing during
conversion" as the responsibility of arch/vendor specific
implementation outside of guest_memfd sounds good to me.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 7 months ago

On Mon, 2025-07-07 at 17:14 -0700, Vishal Annapurve wrote:
> > 
> > Some architectures, e.g. SNP and TDX, may effectively require zeroing on
> > conversion,
> > but that's essentially a property of the architecture, i.e. an arch/vendor
> > specific
> > detail.
> 
> Conversion operation is a unique capability supported by guest_memfd
> files so my intention of bringing up zeroing was to better understand
> the need and clarify the role of guest_memfd in handling zeroing
> during conversion.
> 
> Not sure if I am misinterpreting you, but treating "zeroing during
> conversion" as the responsibility of arch/vendor specific
> implementation outside of guest_memfd sounds good to me.

For TDX if we don't zero on conversion from private->shared we will be dependent
on behavior of the CPU when reading memory with keyid 0, which was previously
encrypted and has some protection bits set. I don't *think* the behavior is
architectural. So it might be prudent to either make it so, or zero it in the
kernel in order to not make non-architectual behavior into userspace ABI.

Up the thread Vishal says we need to support operations that use in-place
conversion (overloaded term now I think, btw). Why exactly is pKVM using
private/shared conversion for this private data provisioning? Instead of a
special provisioning operation like the others? (Xiaoyao's suggestion)

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-07-07 at 17:14 -0700, Vishal Annapurve wrote:
> > > 
> > > Some architectures, e.g. SNP and TDX, may effectively require zeroing on
> > > conversion,
> > > but that's essentially a property of the architecture, i.e. an arch/vendor
> > > specific
> > > detail.
> > 
> > Conversion operation is a unique capability supported by guest_memfd
> > files so my intention of bringing up zeroing was to better understand
> > the need and clarify the role of guest_memfd in handling zeroing
> > during conversion.
> > 
> > Not sure if I am misinterpreting you, but treating "zeroing during
> > conversion" as the responsibility of arch/vendor specific
> > implementation outside of guest_memfd sounds good to me.
> 
> For TDX if we don't zero on conversion from private->shared we will be dependent
> on behavior of the CPU when reading memory with keyid 0, which was previously
> encrypted and has some protection bits set. I don't *think* the behavior is
> architectural. So it might be prudent to either make it so, or zero it in the
> kernel in order to not make non-architectual behavior into userspace ABI.

Ya, by "vendor specific", I was also lumping in cases where the kernel would need
to zero memory in order to not end up with effectively undefined behavior.

> Up the thread Vishal says we need to support operations that use in-place
> conversion (overloaded term now I think, btw). Why exactly is pKVM using
> private/shared conversion for this private data provisioning?

Because it's literally converting memory from shared to private?  And IICU, it's
not a one-time provisioning, e.g. memory can go:

  shared => fill => private => consume => shared => fill => private => consume

> Instead of a special provisioning operation like the others? (Xiaoyao's
> suggestion)

Are you referring to this suggestion?

 : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
 : explicitly request that the page range is converted to private and the
 : content needs to be retained. So that TDX can identify which case needs
 : to call in-place TDH.PAGE.ADD.

If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
userspace has explicit control over what happens to the data during conversion,
and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
shared => private and only for select VM types.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 7 months ago

On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > For TDX if we don't zero on conversion from private->shared we will be
> > dependent
> > on behavior of the CPU when reading memory with keyid 0, which was
> > previously
> > encrypted and has some protection bits set. I don't *think* the behavior is
> > architectural. So it might be prudent to either make it so, or zero it in
> > the
> > kernel in order to not make non-architectual behavior into userspace ABI.
> 
> Ya, by "vendor specific", I was also lumping in cases where the kernel would
> need to zero memory in order to not end up with effectively undefined
> behavior.

Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
the answer is sort of yes, even though TDX doesn't require it. But we actually
don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
that the operation is a to-shared conversion and not another type of private
zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
set in gmem such that it knows it should zero it.

> 
> > Up the thread Vishal says we need to support operations that use in-place
> > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > private/shared conversion for this private data provisioning?
> 
> Because it's literally converting memory from shared to private?  And IICU,
> it's
> not a one-time provisioning, e.g. memory can go:
> 
>   shared => fill => private => consume => shared => fill => private => consume
> 
> > Instead of a special provisioning operation like the others? (Xiaoyao's
> > suggestion)
> 
> Are you referring to this suggestion?

Yea, in general to make it a specific operation preserving operation.

> 
>  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
>  : explicitly request that the page range is converted to private and the
>  : content needs to be retained. So that TDX can identify which case needs
>  : to call in-place TDH.PAGE.ADD.
> 
> If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> userspace has explicit control over what happens to the data during
> conversion,
> and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> shared => private and only for select VM types.

Ok, we should POC how it works with TDX.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > For TDX if we don't zero on conversion from private->shared we will be
> > > dependent
> > > on behavior of the CPU when reading memory with keyid 0, which was
> > > previously
> > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > architectural. So it might be prudent to either make it so, or zero it in
> > > the
> > > kernel in order to not make non-architectual behavior into userspace ABI.
> >
> > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > need to zero memory in order to not end up with effectively undefined
> > behavior.
>
> Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> the answer is sort of yes, even though TDX doesn't require it. But we actually
> don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> that the operation is a to-shared conversion and not another type of private
> zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> set in gmem such that it knows it should zero it.

If the answer is that "always zero on private to shared conversions"
for all CC VMs, then does the scheme outlined in [1] make sense for
handling the private -> shared conversions? For pKVM, there can be a
VM type check to avoid the zeroing during conversions and instead just
zero on allocations. This allows delaying zeroing until the fault time
for CC VMs and can be done in guest_memfd centrally. We will need more
inputs from the SEV side for this discussion.

[1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/

>
> >
> > > Up the thread Vishal says we need to support operations that use in-place
> > > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > > private/shared conversion for this private data provisioning?
> >
> > Because it's literally converting memory from shared to private?  And IICU,
> > it's
> > not a one-time provisioning, e.g. memory can go:
> >
> >   shared => fill => private => consume => shared => fill => private => consume
> >
> > > Instead of a special provisioning operation like the others? (Xiaoyao's
> > > suggestion)
> >
> > Are you referring to this suggestion?
>
> Yea, in general to make it a specific operation preserving operation.
>
> >
> >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> >  : explicitly request that the page range is converted to private and the
> >  : content needs to be retained. So that TDX can identify which case needs
> >  : to call in-place TDH.PAGE.ADD.
> >
> > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > userspace has explicit control over what happens to the data during
> > conversion,
> > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > shared => private and only for select VM types.
>
> Ok, we should POC how it works with TDX.

I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
1) Conversions are always content-preserving for pKVM.
2) Shared to private conversions are always content-preserving for all
VMs as far as guest_memfd is concerned.
3) Private to shared conversions are not content-preserving for CC VMs
as far as guest_memfd is concerned, subject to more discussions.

[2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > dependent
> > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > previously
> > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > the
> > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > >
> > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > need to zero memory in order to not end up with effectively undefined
> > > behavior.
> >
> > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > that the operation is a to-shared conversion and not another type of private
> > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > set in gmem such that it knows it should zero it.
> 
> If the answer is that "always zero on private to shared conversions"
> for all CC VMs,

pKVM VMs *are* CoCo VMs.  Just because pKVM doesn't rely on third party firmware
to provide confidentiality and integrity doesn't make it any less of a CoCo VM.

> > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > >  : explicitly request that the page range is converted to private and the
> > >  : content needs to be retained. So that TDX can identify which case needs
> > >  : to call in-place TDH.PAGE.ADD.
> > >
> > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > userspace has explicit control over what happens to the data during
> > > conversion,
> > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > shared => private and only for select VM types.
> >
> > Ok, we should POC how it works with TDX.
> 
> I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> 1) Conversions are always content-preserving for pKVM.

No?  Perserving contents on private => shared is a security vulnerability waiting
to happen.

> 2) Shared to private conversions are always content-preserving for all
> VMs as far as guest_memfd is concerned.

There is no "as far as guest_memfd is concerned".  Userspace doesn't care whether
code lives in guest_memfd.c versus arch/xxx/kvm, the only thing that matters is
the behavior that userspace sees.  I don't want to end up with userspace ABI that
is vendor/VM specific.

> 3) Private to shared conversions are not content-preserving for CC VMs
> as far as guest_memfd is concerned, subject to more discussions.
> 
> [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Fuad Tabba 7 months ago

Hi Sean,

On Tue, 8 Jul 2025 at 16:39, Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > > dependent
> > > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > > previously
> > > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > > the
> > > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > > >
> > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > > need to zero memory in order to not end up with effectively undefined
> > > > behavior.
> > >
> > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > > that the operation is a to-shared conversion and not another type of private
> > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > > set in gmem such that it knows it should zero it.
> >
> > If the answer is that "always zero on private to shared conversions"
> > for all CC VMs,
>
> pKVM VMs *are* CoCo VMs.  Just because pKVM doesn't rely on third party firmware
> to provide confidentiality and integrity doesn't make it any less of a CoCo VM.



> > > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > > >  : explicitly request that the page range is converted to private and the
> > > >  : content needs to be retained. So that TDX can identify which case needs
> > > >  : to call in-place TDH.PAGE.ADD.
> > > >
> > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > > userspace has explicit control over what happens to the data during
> > > > conversion,
> > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > > shared => private and only for select VM types.
> > >
> > > Ok, we should POC how it works with TDX.
> >
> > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > 1) Conversions are always content-preserving for pKVM.
>
> No?  Perserving contents on private => shared is a security vulnerability waiting
> to happen.

Actually it is one of the requirements for pKVM as well as its current
behavior. We would like to preserve contents both ways, private <=>
shared, since it is required by some of the potential use cases (e.g.,
guest handling video encoding/decoding).

To make it clear, I'm talking about explicit sharing from the guest,
not relinquishing memory back to the host. In the case of
relinquishing (and guest teardown), relinquished memory is poisoned
(zeroed) in pKVM.

Cheers,
/fuad

> > 2) Shared to private conversions are always content-preserving for all
> > VMs as far as guest_memfd is concerned.
>
> There is no "as far as guest_memfd is concerned".  Userspace doesn't care whether
> code lives in guest_memfd.c versus arch/xxx/kvm, the only thing that matters is
> the behavior that userspace sees.  I don't want to end up with userspace ABI that
> is vendor/VM specific.
>
> > 3) Private to shared conversions are not content-preserving for CC VMs
> > as far as guest_memfd is concerned, subject to more discussions.
> >
> > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Tue, Jul 08, 2025, Fuad Tabba wrote:
> > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > > 1) Conversions are always content-preserving for pKVM.
> >
> > No?  Perserving contents on private => shared is a security vulnerability waiting
> > to happen.
> 
> Actually it is one of the requirements for pKVM as well as its current
> behavior. We would like to preserve contents both ways, private <=>
> shared, since it is required by some of the potential use cases (e.g.,
> guest handling video encoding/decoding).
> 
> To make it clear, I'm talking about explicit sharing from the guest,
> not relinquishing memory back to the host. In the case of
> relinquishing (and guest teardown), relinquished memory is poisoned
> (zeroed) in pKVM.

I forget, what's the "explicit sharing" flow look like?  E.g. how/when does pKVM
know it's ok to convert memory from private to shared?  I think we'd still want
to make data preservation optional, e.g. to avoid potential leakage with setups
where memory is private by default, but a flag in KVM's uAPI might not be a good
fit since whether or not to preserve data is more of a guest decision (or at least
needs to be ok'd by the guest).

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Fuad Tabba 7 months ago

On Tue, 8 Jul 2025 at 18:25, Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Fuad Tabba wrote:
> > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > > > 1) Conversions are always content-preserving for pKVM.
> > >
> > > No?  Perserving contents on private => shared is a security vulnerability waiting
> > > to happen.
> >
> > Actually it is one of the requirements for pKVM as well as its current
> > behavior. We would like to preserve contents both ways, private <=>
> > shared, since it is required by some of the potential use cases (e.g.,
> > guest handling video encoding/decoding).
> >
> > To make it clear, I'm talking about explicit sharing from the guest,
> > not relinquishing memory back to the host. In the case of
> > relinquishing (and guest teardown), relinquished memory is poisoned
> > (zeroed) in pKVM.
>
> I forget, what's the "explicit sharing" flow look like?  E.g. how/when does pKVM
> know it's ok to convert memory from private to shared?  I think we'd still want
> to make data preservation optional, e.g. to avoid potential leakage with setups
> where memory is private by default, but a flag in KVM's uAPI might not be a good
> fit since whether or not to preserve data is more of a guest decision (or at least
> needs to be ok'd by the guest).

In pKVM all sharing and unsharing is triggered by the guest via
hypercalls. The host cannot unshare. That said, making data
preservation optional works for pKVM and is a good idea, for the
reasons that you've mentioned.

Cheers,
/fuad

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ackerley Tng 6 months, 3 weeks ago

Fuad Tabba <tabba@google.com> writes:

> On Tue, 8 Jul 2025 at 18:25, Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Tue, Jul 08, 2025, Fuad Tabba wrote:
>> > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
>> > > > 1) Conversions are always content-preserving for pKVM.
>> > >
>> > > No?  Perserving contents on private => shared is a security vulnerability waiting
>> > > to happen.
>> >
>> > Actually it is one of the requirements for pKVM as well as its current
>> > behavior. We would like to preserve contents both ways, private <=>
>> > shared, since it is required by some of the potential use cases (e.g.,
>> > guest handling video encoding/decoding).
>> >
>> > To make it clear, I'm talking about explicit sharing from the guest,
>> > not relinquishing memory back to the host. In the case of
>> > relinquishing (and guest teardown), relinquished memory is poisoned
>> > (zeroed) in pKVM.
>>
>> I forget, what's the "explicit sharing" flow look like?  E.g. how/when does pKVM
>> know it's ok to convert memory from private to shared?  I think we'd still want
>> to make data preservation optional, e.g. to avoid potential leakage with setups
>> where memory is private by default, but a flag in KVM's uAPI might not be a good
>> fit since whether or not to preserve data is more of a guest decision (or at least
>> needs to be ok'd by the guest).
>
> In pKVM all sharing and unsharing is triggered by the guest via
> hypercalls. The host cannot unshare.

In pKVM's case, would the conversion ioctl be disabled completely, or
would the ioctl be allowed, but conversion always checks with pKVM to
see if the guest had previously requested a unshare?

> That said, making data
> preservation optional works for pKVM and is a good idea, for the
> reasons that you've mentioned.
>
> Cheers,
> /fuad

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 7 months ago

On Tue, 2025-07-08 at 08:07 -0700, Vishal Annapurve wrote:
> On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > dependent
> > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > previously
> > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > the
> > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > > 
> > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > need to zero memory in order to not end up with effectively undefined
> > > behavior.
> > 
> > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > that the operation is a to-shared conversion and not another type of private
> > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > set in gmem such that it knows it should zero it.
> 
> If the answer is that "always zero on private to shared conversions"
> for all CC VMs, then does the scheme outlined in [1] make sense for
> handling the private -> shared conversions? For pKVM, there can be a
> VM type check to avoid the zeroing during conversions and instead just
> zero on allocations. This allows delaying zeroing until the fault time
> for CC VMs and can be done in guest_memfd centrally. We will need more
> inputs from the SEV side for this discussion.
> 
> [1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/

It's nice that we don't double zero (since TDX module will do it too) for
private allocation/mapping. Seems ok to me.

> 
> > 
> > > 
> > > > Up the thread Vishal says we need to support operations that use in-place
> > > > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > > > private/shared conversion for this private data provisioning?
> > > 
> > > Because it's literally converting memory from shared to private?  And IICU,
> > > it's
> > > not a one-time provisioning, e.g. memory can go:
> > > 
> > >   shared => fill => private => consume => shared => fill => private => consume
> > > 
> > > > Instead of a special provisioning operation like the others? (Xiaoyao's
> > > > suggestion)
> > > 
> > > Are you referring to this suggestion?
> > 
> > Yea, in general to make it a specific operation preserving operation.
> > 
> > > 
> > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > >  : explicitly request that the page range is converted to private and the
> > >  : content needs to be retained. So that TDX can identify which case needs
> > >  : to call in-place TDH.PAGE.ADD.
> > > 
> > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > userspace has explicit control over what happens to the data during
> > > conversion,
> > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > shared => private and only for select VM types.
> > 
> > Ok, we should POC how it works with TDX.
> 
> I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> 1) Conversions are always content-preserving for pKVM.
> 2) Shared to private conversions are always content-preserving for all
> VMs as far as guest_memfd is concerned.
> 3) Private to shared conversions are not content-preserving for CC VMs
> as far as guest_memfd is concerned, subject to more discussions.
> 
> [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

Right, I read that. I still don't see why pKVM needs to do normal private/shared
conversion for data provisioning. Vs a dedicated operation/flag to make it a
special case.

I'm trying to suggest there could be a benefit to making all gmem VM types
behave the same. If conversions are always content preserving for pKVM, why
can't userspace  always use the operation that says preserve content? Vs
changing the behavior of the common operations?

So for all VM types, the user ABI would be:
private->shared          - Always zero's page
shared->private          - Always destructive
shared->private (w/flag) - Always preserves data or return error if not possible


Do you see a problem?

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Tue, Jul 8, 2025 at 8:31 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-07-08 at 08:07 -0700, Vishal Annapurve wrote:
> > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > > dependent
> > > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > > previously
> > > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > > the
> > > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > > >
> > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > > need to zero memory in order to not end up with effectively undefined
> > > > behavior.
> > >
> > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > > that the operation is a to-shared conversion and not another type of private
> > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > > set in gmem such that it knows it should zero it.
> >
> > If the answer is that "always zero on private to shared conversions"
> > for all CC VMs, then does the scheme outlined in [1] make sense for
> > handling the private -> shared conversions? For pKVM, there can be a
> > VM type check to avoid the zeroing during conversions and instead just
> > zero on allocations. This allows delaying zeroing until the fault time
> > for CC VMs and can be done in guest_memfd centrally. We will need more
> > inputs from the SEV side for this discussion.
> >
> > [1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/
>
> It's nice that we don't double zero (since TDX module will do it too) for
> private allocation/mapping. Seems ok to me.
>
> >
> > >
> > > >
> > > > > Up the thread Vishal says we need to support operations that use in-place
> > > > > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > > > > private/shared conversion for this private data provisioning?
> > > >
> > > > Because it's literally converting memory from shared to private?  And IICU,
> > > > it's
> > > > not a one-time provisioning, e.g. memory can go:
> > > >
> > > >   shared => fill => private => consume => shared => fill => private => consume
> > > >
> > > > > Instead of a special provisioning operation like the others? (Xiaoyao's
> > > > > suggestion)
> > > >
> > > > Are you referring to this suggestion?
> > >
> > > Yea, in general to make it a specific operation preserving operation.
> > >
> > > >
> > > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > > >  : explicitly request that the page range is converted to private and the
> > > >  : content needs to be retained. So that TDX can identify which case needs
> > > >  : to call in-place TDH.PAGE.ADD.
> > > >
> > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > > userspace has explicit control over what happens to the data during
> > > > conversion,
> > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > > shared => private and only for select VM types.
> > >
> > > Ok, we should POC how it works with TDX.
> >
> > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > 1) Conversions are always content-preserving for pKVM.
> > 2) Shared to private conversions are always content-preserving for all
> > VMs as far as guest_memfd is concerned.
> > 3) Private to shared conversions are not content-preserving for CC VMs
> > as far as guest_memfd is concerned, subject to more discussions.
> >
> > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/
>
> Right, I read that. I still don't see why pKVM needs to do normal private/shared
> conversion for data provisioning. Vs a dedicated operation/flag to make it a
> special case.

It's dictated by pKVM usecases, memory contents need to be preserved
for every conversion not just for initial payload population.

>
> I'm trying to suggest there could be a benefit to making all gmem VM types
> behave the same. If conversions are always content preserving for pKVM, why
> can't userspace  always use the operation that says preserve content? Vs
> changing the behavior of the common operations?

I don't see a benefit of userspace passing a flag that's kind of
default for the VM type (assuming pKVM will use a special VM type).
Common operations in guest_memfd will need to either check for the
userspace passed flag or the VM type, so no major change in
guest_memfd implementation for either mechanism.

>
> So for all VM types, the user ABI would be:
> private->shared          - Always zero's page
> shared->private          - Always destructive
> shared->private (w/flag) - Always preserves data or return error if not possible
>
>
> Do you see a problem?
>

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 7 months ago

On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > Right, I read that. I still don't see why pKVM needs to do normal
> > private/shared
> > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > special case.
> 
> It's dictated by pKVM usecases, memory contents need to be preserved
> for every conversion not just for initial payload population.

We are weighing pros/cons between:
 - Unifying this uABI across all gmemfd VM types
 - Userspace for one VM type passing a flag for it's special non-shared use case

I don't see how passing a flag or not is dictated by pKVM use case.

P.S. This doesn't really impact TDX I think. Except that TDX development needs
to work in the code without bumping anything. So just wishing to work in code
with less conditionals.

> 
> > 
> > I'm trying to suggest there could be a benefit to making all gmem VM types
> > behave the same. If conversions are always content preserving for pKVM, why
> > can't userspace  always use the operation that says preserve content? Vs
> > changing the behavior of the common operations?
> 
> I don't see a benefit of userspace passing a flag that's kind of
> default for the VM type (assuming pKVM will use a special VM type).

The benefit is that we don't need to have special VM default behavior for
gmemfd. Think about if some day (very hypothetical and made up) we want to add a
mode for TDX that adds new private data to a running guest (with special accept
on the guest side or something). Then we might want to add a flag to override
the default destructive behavior. Then maybe pKVM wants to add a "don't
preserve" operation and it adds a second flag to not destroy. Now gmemfd has
lots of VM specific flags. The point of this example is to show how unified uABI
can he helpful.

> Common operations in guest_memfd will need to either check for the
> userspace passed flag or the VM type, so no major change in
> guest_memfd implementation for either mechanism.

While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
fd tied to a VM? I think there is interest in de-coupling it? Is the VM type
sticky?

It seems the more they are separate, the better it will be to not have VM-aware
behavior living in gmem.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > > Right, I read that. I still don't see why pKVM needs to do normal
> > > private/shared
> > > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > > special case.
> > 
> > It's dictated by pKVM usecases, memory contents need to be preserved
> > for every conversion not just for initial payload population.
> 
> We are weighing pros/cons between:
>  - Unifying this uABI across all gmemfd VM types
>  - Userspace for one VM type passing a flag for it's special non-shared use case
> 
> I don't see how passing a flag or not is dictated by pKVM use case.

Yep.  Baking the behavior of a single usecase into the kernel's ABI is rarely a
good idea.  Just because pKVM's current usecases always wants contents to be
preserved doesn't mean that pKVM will never change.

As a general rule, KVM should push policy to userspace whenever possible.

> P.S. This doesn't really impact TDX I think. Except that TDX development needs
> to work in the code without bumping anything. So just wishing to work in code
> with less conditionals.
> 
> > 
> > > 
> > > I'm trying to suggest there could be a benefit to making all gmem VM types
> > > behave the same. If conversions are always content preserving for pKVM, why
> > > can't userspace  always use the operation that says preserve content? Vs
> > > changing the behavior of the common operations?
> > 
> > I don't see a benefit of userspace passing a flag that's kind of
> > default for the VM type (assuming pKVM will use a special VM type).
> 
> The benefit is that we don't need to have special VM default behavior for
> gmemfd. Think about if some day (very hypothetical and made up) we want to add a
> mode for TDX that adds new private data to a running guest (with special accept
> on the guest side or something). Then we might want to add a flag to override
> the default destructive behavior. Then maybe pKVM wants to add a "don't
> preserve" operation and it adds a second flag to not destroy. Now gmemfd has
> lots of VM specific flags. The point of this example is to show how unified uABI
> can he helpful.

Yep again. Pivoting on the VM type would be completely inflexible.  If pKVM gains
a usecase that wants to zero memory on conversions, we're hosed.  If SNP or TDX
gains the ability to preserve data on conversions, we're hosed.

The VM type may restrict what is possible, but (a) that should be abstracted,
e.g. by defining the allowed flags during guest_memfd creation, and (b) the
capabilities of the guest_memfd instance need to be communicated to userspace.

> > Common operations in guest_memfd will need to either check for the
> > userspace passed flag or the VM type, so no major change in
> > guest_memfd implementation for either mechanism.
> 
> While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
> fd tied to a VM?

Yes.

> I think there is interest in de-coupling it?

No?  Even if we get to a point where multiple distinct VMs can bind to a single
guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole
owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
non-trivial complexity for zero practical benefit.

> Is the VM type sticky?
> 
> It seems the more they are separate, the better it will be to not have VM-aware
> behavior living in gmem.

Ya.  A guest_memfd instance may have capabilities/features that are restricted
and/or defined based on the properties of the owning VM, but we should do our
best to make guest_memfd itself blissly unaware of the VM type.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > > > Right, I read that. I still don't see why pKVM needs to do normal
> > > > private/shared
> > > > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > > > special case.
> > >
> > > It's dictated by pKVM usecases, memory contents need to be preserved
> > > for every conversion not just for initial payload population.
> >
> > We are weighing pros/cons between:
> >  - Unifying this uABI across all gmemfd VM types
> >  - Userspace for one VM type passing a flag for it's special non-shared use case
> >
> > I don't see how passing a flag or not is dictated by pKVM use case.
>
> Yep.  Baking the behavior of a single usecase into the kernel's ABI is rarely a
> good idea.  Just because pKVM's current usecases always wants contents to be
> preserved doesn't mean that pKVM will never change.
>
> As a general rule, KVM should push policy to userspace whenever possible.
>
> > P.S. This doesn't really impact TDX I think. Except that TDX development needs
> > to work in the code without bumping anything. So just wishing to work in code
> > with less conditionals.
> >
> > >
> > > >
> > > > I'm trying to suggest there could be a benefit to making all gmem VM types
> > > > behave the same. If conversions are always content preserving for pKVM, why
> > > > can't userspace  always use the operation that says preserve content? Vs
> > > > changing the behavior of the common operations?
> > >
> > > I don't see a benefit of userspace passing a flag that's kind of
> > > default for the VM type (assuming pKVM will use a special VM type).
> >
> > The benefit is that we don't need to have special VM default behavior for
> > gmemfd. Think about if some day (very hypothetical and made up) we want to add a
> > mode for TDX that adds new private data to a running guest (with special accept
> > on the guest side or something). Then we might want to add a flag to override
> > the default destructive behavior. Then maybe pKVM wants to add a "don't
> > preserve" operation and it adds a second flag to not destroy. Now gmemfd has
> > lots of VM specific flags. The point of this example is to show how unified uABI
> > can he helpful.
>
> Yep again. Pivoting on the VM type would be completely inflexible.  If pKVM gains
> a usecase that wants to zero memory on conversions, we're hosed.  If SNP or TDX
> gains the ability to preserve data on conversions, we're hosed.
>
> The VM type may restrict what is possible, but (a) that should be abstracted,
> e.g. by defining the allowed flags during guest_memfd creation, and (b) the
> capabilities of the guest_memfd instance need to be communicated to userspace.

Ok, I concur with this: It's beneficial to keep a unified ABI that
allows guest_memfd to make runtime decisions without relying on VM
type as far as possible.

Few points that seem important here:
1) Userspace can and should be able to only dictate if memory contents
need to be preserved on shared to private conversion.
   -> For SNP/TDX VMs:
        * Only usecase for preserving contents is initial memory
population, which can be achieved by:
               -  Userspace converting the ranges to shared,
populating the contents, converting them back to private and then
calling SNP/TDX specific existing ABI functions.
        * For runtime conversions, guest_memfd can't ensure memory
contents are preserved during shared to private conversions as the
architectures don't support that behavior.
        * So IMO, this "preserve" flag doesn't make sense for SNP/TDX
VMs, even if we add this flag, today guest_memfd should effectively
mark this unsupported based on the backing architecture support.
2) For pKVM, if userspace wants to specify a "preserve" flag then this
flag can be allowed based on the known capabilities of the backing
architecture.

So this topic is still orthogonal to "zeroing on private to shared conversion".





>
> > > Common operations in guest_memfd will need to either check for the
> > > userspace passed flag or the VM type, so no major change in
> > > guest_memfd implementation for either mechanism.
> >
> > While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
> > fd tied to a VM?
>
> Yes.
>
> > I think there is interest in de-coupling it?
>
> No?  Even if we get to a point where multiple distinct VMs can bind to a single
> guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole
> owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> non-trivial complexity for zero practical benefit.
>
> > Is the VM type sticky?
> >
> > It seems the more they are separate, the better it will be to not have VM-aware
> > behavior living in gmem.
>
> Ya.  A guest_memfd instance may have capabilities/features that are restricted
> and/or defined based on the properties of the owning VM, but we should do our
> best to make guest_memfd itself blissly unaware of the VM type.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
> Few points that seem important here:
> 1) Userspace can and should be able to only dictate if memory contents
> need to be preserved on shared to private conversion.

No, I was wrong, pKVM has use cases where it's desirable to preserve data on
private => shared conversions.

Side topic, if you're going to use fancy indentation, align the indentation so
it's actually readable.

>   -> For SNP/TDX VMs:
>        * Only usecase for preserving contents is initial memory
>          population, which can be achieved by:
>               -  Userspace converting the ranges to shared, populating the contents,
>                  converting them back to private and then calling SNP/TDX specific
>                  existing ABI functions.
>        * For runtime conversions, guest_memfd can't ensure memory contents are
>          preserved during shared to private conversions as the architectures
>          don't support that behavior.
>        * So IMO, this "preserve" flag doesn't make sense for SNP/TDX VMs, even

It makes sense, it's just not supported by the architecture *at runtime*.  Case
in point, *something* needs to allow preserving data prior to launching the VM.
If we want to go with the PRIVATE => SHARED => FILL => PRIVATE approach for TDX
and SNP, then we'll probably want to allow PRESERVE only until the VM image is
finalized.

>          if we add this flag, today guest_memfd should effectively mark this
>          unsupported based on the backing architecture support.
>
> 2) For pKVM, if userspace wants to specify a "preserve" flag then this

There is no "For pKVM".  We are defining uAPI for guest_memfd.  I.e. this statement
holds true for all implementations: PRESERVE is allowed based on the capabilities
of the architecture.

> So this topic is still orthogonal to "zeroing on private to shared conversion".

As above, no.  pKVM might not expose PRESERVE to _userspace_ since all current
conversions are initiated by the guest, but for guest_memfd itself, this is all
one and the same.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Tue, Jul 8, 2025 at 12:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> > On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
> > Few points that seem important here:
> > 1) Userspace can and should be able to only dictate if memory contents
> > need to be preserved on shared to private conversion.
>
> No, I was wrong, pKVM has use cases where it's desirable to preserve data on
> private => shared conversions.
>
> Side topic, if you're going to use fancy indentation, align the indentation so
> it's actually readable.
>
> >   -> For SNP/TDX VMs:
> >        * Only usecase for preserving contents is initial memory
> >          population, which can be achieved by:
> >               -  Userspace converting the ranges to shared, populating the contents,
> >                  converting them back to private and then calling SNP/TDX specific
> >                  existing ABI functions.
> >        * For runtime conversions, guest_memfd can't ensure memory contents are
> >          preserved during shared to private conversions as the architectures
> >          don't support that behavior.
> >        * So IMO, this "preserve" flag doesn't make sense for SNP/TDX VMs, even
>
> It makes sense, it's just not supported by the architecture *at runtime*.  Case
> in point, *something* needs to allow preserving data prior to launching the VM.
> If we want to go with the PRIVATE => SHARED => FILL => PRIVATE approach for TDX
> and SNP, then we'll probably want to allow PRESERVE only until the VM image is
> finalized.

Maybe we can simplify the story a bit here for today, how about:
1) For shared to private conversions:
       * Is it safe to say that the conversion itself is always
content preserving, it's upto the
           architecture what it does with memory contents on the private faults?
                 - During initial memory setup, userspace can control
how private memory would
                   be faulted in by architecture supported ABI operations.
                 - After initial memory setup, userspace can't control
how private memory would
                   be faulted in.
2) For private to shared conversions:
       * Architecture decides what should be done with the memory on
shared faults.
                 - guest_memfd can query architecture whether to zero
memory or not.

-> guest_memfd will only take on the responsibility of zeroing if
needed by the architecture on shared faults.
-> Architecture is responsible for the behavior on private faults.

In future, if there is a usecase for controlling runtime behavior of
private faults, architecture can expose additional ABI that userspace
can use after initiating guest_memfd conversion.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 7 months ago

On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote:
> > I think there is interest in de-coupling it?
> 
> No?

I'm talking about the intra-host migration/reboot optimization stuff. And not
doing a good job, sorry.

>   Even if we get to a point where multiple distinct VMs can bind to a single
> guest_memfd, e.g. for inter-VM shared memory, there will still need to be a
> sole
> owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> non-trivial complexity for zero practical benefit.

I'm talking about moving a gmem fd between different VMs or something using
KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to
feel out where the concepts are headed. It kind of allows gmem fds (or just
their source memory?) to live beyond a VM lifecycle.

[0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote:
> > > I think there is interest in de-coupling it?
> > 
> > No?
> 
> I'm talking about the intra-host migration/reboot optimization stuff. And not
> doing a good job, sorry.
> 
> >   Even if we get to a point where multiple distinct VMs can bind to a single
> > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a
> > sole
> > owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> > non-trivial complexity for zero practical benefit.
> 
> I'm talking about moving a gmem fd between different VMs or something using
> KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to
> feel out where the concepts are headed. It kind of allows gmem fds (or just
> their source memory?) to live beyond a VM lifecycle.

I think the answer is that we want to let guest_memfd live beyond the "struct kvm"
instance, but not beyond the Virtual Machine.  From a past discussion on this topic[*].

 : No go.  Because again, the inode (physical memory) is coupled to the virtual machine
 : as a thing, not to a "struct kvm".  Or more concretely, the inode is coupled to an
 : ASID or an HKID, and there can be multiple "struct kvm" objects associated with a
 : single ASID.  And at some point in the future, I suspect we'll have multiple KVM
 : objects per HKID too.
 : 
 : The current SEV use case is for the migration helper, where two KVM objects share
 : a single ASID (the "real" VM and the helper).  I suspect TDX will end up with
 : similar behavior where helper "VMs" can use the HKID of the "real" VM.  For KVM,
 : that means multiple struct kvm objects being associated with a single HKID.
 : 
 : To prevent use-after-free, KVM "just" needs to ensure the helper instances can't
 : outlive the real instance, i.e. can't use the HKID/ASID after the owning virtual
 : machine has been destroyed.
 : 
 : To put it differently, "struct kvm" is a KVM software construct that _usually_,
 : but not always, is associated 1:1 with a virtual machine.
 : 
 : And FWIW, stashing the pointer without holding a reference would not be a complete
 : solution, because it couldn't guard against KVM reusing a pointer.  E.g. if a
 : struct kvm was unbound and then freed, KVM could reuse the same memory for a new
 : struct kvm, with a different ASID/HKID, and get a false negative on the rebinding
 : check.

Exactly what that will look like in code is TBD, but the concept/logic holds up.

[*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com

> [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Tue, Jul 8, 2025 at 11:55 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote:
> > > > I think there is interest in de-coupling it?
> > >
> > > No?
> >
> > I'm talking about the intra-host migration/reboot optimization stuff. And not
> > doing a good job, sorry.
> >
> > >   Even if we get to a point where multiple distinct VMs can bind to a single
> > > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a
> > > sole
> > > owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> > > non-trivial complexity for zero practical benefit.
> >
> > I'm talking about moving a gmem fd between different VMs or something using
> > KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to
> > feel out where the concepts are headed. It kind of allows gmem fds (or just
> > their source memory?) to live beyond a VM lifecycle.
>
> I think the answer is that we want to let guest_memfd live beyond the "struct kvm"
> instance, but not beyond the Virtual Machine.  From a past discussion on this topic[*].
>
>  : No go.  Because again, the inode (physical memory) is coupled to the virtual machine
>  : as a thing, not to a "struct kvm".  Or more concretely, the inode is coupled to an
>  : ASID or an HKID, and there can be multiple "struct kvm" objects associated with a
>  : single ASID.  And at some point in the future, I suspect we'll have multiple KVM
>  : objects per HKID too.
>  :
>  : The current SEV use case is for the migration helper, where two KVM objects share
>  : a single ASID (the "real" VM and the helper).  I suspect TDX will end up with
>  : similar behavior where helper "VMs" can use the HKID of the "real" VM.  For KVM,
>  : that means multiple struct kvm objects being associated with a single HKID.
>  :
>  : To prevent use-after-free, KVM "just" needs to ensure the helper instances can't
>  : outlive the real instance, i.e. can't use the HKID/ASID after the owning virtual
>  : machine has been destroyed.
>  :
>  : To put it differently, "struct kvm" is a KVM software construct that _usually_,
>  : but not always, is associated 1:1 with a virtual machine.
>  :
>  : And FWIW, stashing the pointer without holding a reference would not be a complete
>  : solution, because it couldn't guard against KVM reusing a pointer.  E.g. if a
>  : struct kvm was unbound and then freed, KVM could reuse the same memory for a new
>  : struct kvm, with a different ASID/HKID, and get a false negative on the rebinding
>  : check.
>
> Exactly what that will look like in code is TBD, but the concept/logic holds up.

I think we can simplify the role of guest_memfd in line with discussion [1]:
1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
         - It allows fallocate to populate/deallocate memory
2) guest_memfd supports the notion of private/shared faults.
3) guest_memfd supports memory access control:
         - It allows shared faults from userspace, KVM, IOMMU
         - It allows private faults from KVM, IOMMU
4) guest_memfd supports changing access control on its ranges between
shared/private.
         - It notifies the users to invalidate their mappings for the
ranges getting converted/truncated.

Responsibilities that ideally should not be taken up by guest_memfd:
1) guest_memfd can not initiate pre-faulting on behalf of it's users.
2) guest_memfd should not be directly communicating with the
underlying architecture layers.
         - All communication should go via KVM/IOMMU.
3) KVM should ideally associate the lifetime of backing
pagetables/protection tables/RMP tables with the lifetime of the
binding of memslots with guest_memfd.
         - Today KVM SNP logic ties RMP table entry lifetimes with how
long the folios are mapped in guest_memfd, which I think should be
revisited.

Some very early thoughts on how guest_memfd could be laid out for the long term:
1) guest_memfd code ideally should be built-in to the kernel.
2) guest_memfd instances should still be created using KVM IOCTLs that
carry specific capabilities/restrictions for its users based on the
backing VM/arch.
3) Any outgoing communication from guest_memfd to it's users like
userspace/KVM/IOMMU should be via notifiers to invalidate similar to
how MMU notifiers work.
4) KVM and IOMMU can implement intermediate layers to handle
interaction with guest_memfd.
     - e.g. there could be a layer within kvm that handles:
             - creating guest_memfd files and associating a
kvm_gmem_context with those files.
             - memslot binding
                       - kvm_gmem_context will be used to bind kvm
memslots with the context ranges.
             - invalidate notifier handling
                        - kvm_gmem_context will be used to intercept
guest_memfd callbacks and
                          translate them to the right GPA ranges.
             - linking
                        - kvm_gmem_context can be linked to different
KVM instances.

This line of thinking can allow cleaner separation between
guest_memfd/KVM/IOMMU [2].

[1] https://lore.kernel.org/lkml/CAGtprH-+gPN8J_RaEit=M_ErHWTmFHeCipC6viT6PHhG3ELg6A@mail.gmail.com/#t
[2] https://lore.kernel.org/lkml/31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com/



>
> [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com
>
> > [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> > https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 7 months ago

On Wed, 2025-07-09 at 07:28 -0700, Vishal Annapurve wrote:
> I think we can simplify the role of guest_memfd in line with discussion [1]:
> 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
>          - It allows fallocate to populate/deallocate memory
> 2) guest_memfd supports the notion of private/shared faults.
> 3) guest_memfd supports memory access control:
>          - It allows shared faults from userspace, KVM, IOMMU
>          - It allows private faults from KVM, IOMMU
> 4) guest_memfd supports changing access control on its ranges between
> shared/private.
>          - It notifies the users to invalidate their mappings for the
> ranges getting converted/truncated.

KVM needs to know if a GFN is private/shared. I think it is also intended to now
be a repository for this information, right? Besides invalidations, it needs to
be queryable.

> 
> Responsibilities that ideally should not be taken up by guest_memfd:
> 1) guest_memfd can not initiate pre-faulting on behalf of it's users.
> 2) guest_memfd should not be directly communicating with the
> underlying architecture layers.
>          - All communication should go via KVM/IOMMU.

Maybe stronger, there should be generic gmem behaviors. Not any special
if (vm_type == tdx) type logic. 

> 3) KVM should ideally associate the lifetime of backing
> pagetables/protection tables/RMP tables with the lifetime of the
> binding of memslots with guest_memfd.
>          - Today KVM SNP logic ties RMP table entry lifetimes with how
> long the folios are mapped in guest_memfd, which I think should be
> revisited.

I don't understand the problem. KVM needs to respond to user accessible
invalidations, but how long it keeps other resources around could be useful for
various optimizations. Like deferring work to a work queue or something.

I think it would help to just target the ackerly series goals. We should get
that code into shape and this kind of stuff will fall out of it.

> 
> Some very early thoughts on how guest_memfd could be laid out for the long term:
> 1) guest_memfd code ideally should be built-in to the kernel.
> 2) guest_memfd instances should still be created using KVM IOCTLs that
> carry specific capabilities/restrictions for its users based on the
> backing VM/arch.
> 3) Any outgoing communication from guest_memfd to it's users like
> userspace/KVM/IOMMU should be via notifiers to invalidate similar to
> how MMU notifiers work.
> 4) KVM and IOMMU can implement intermediate layers to handle
> interaction with guest_memfd.
>      - e.g. there could be a layer within kvm that handles:
>              - creating guest_memfd files and associating a
> kvm_gmem_context with those files.
>              - memslot binding
>                        - kvm_gmem_context will be used to bind kvm
> memslots with the context ranges.
>              - invalidate notifier handling
>                         - kvm_gmem_context will be used to intercept
> guest_memfd callbacks and
>                           translate them to the right GPA ranges.
>              - linking
>                         - kvm_gmem_context can be linked to different
> KVM instances.

We can probably look at the code to decide these.

> 
> This line of thinking can allow cleaner separation between
> guest_memfd/KVM/IOMMU [2].
> 
> [1] https://lore.kernel.org/lkml/CAGtprH-+gPN8J_RaEit=M_ErHWTmFHeCipC6viT6PHhG3ELg6A@mail.gmail.com/#t
> [2] https://lore.kernel.org/lkml/31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com/
> 
> 
> 
> > 
> > [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com
> > 
> > > [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> > > https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Wed, Jul 9, 2025 at 8:17 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2025-07-09 at 07:28 -0700, Vishal Annapurve wrote:
> > I think we can simplify the role of guest_memfd in line with discussion [1]:
> > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
> >          - It allows fallocate to populate/deallocate memory
> > 2) guest_memfd supports the notion of private/shared faults.
> > 3) guest_memfd supports memory access control:
> >          - It allows shared faults from userspace, KVM, IOMMU
> >          - It allows private faults from KVM, IOMMU
> > 4) guest_memfd supports changing access control on its ranges between
> > shared/private.
> >          - It notifies the users to invalidate their mappings for the
> > ranges getting converted/truncated.
>
> KVM needs to know if a GFN is private/shared. I think it is also intended to now
> be a repository for this information, right? Besides invalidations, it needs to
> be queryable.

Yeah, that interface can be added as well. Though, if possible KVM can
just directly pass the fault type to guest_memfd and it can return an
error if the fault type doesn't match the permission. Additionally KVM
does query the mapping order for a certain pfn/gfn which will need to
be supported as well.

>
> >
> > Responsibilities that ideally should not be taken up by guest_memfd:
> > 1) guest_memfd can not initiate pre-faulting on behalf of it's users.
> > 2) guest_memfd should not be directly communicating with the
> > underlying architecture layers.
> >          - All communication should go via KVM/IOMMU.
>
> Maybe stronger, there should be generic gmem behaviors. Not any special
> if (vm_type == tdx) type logic.
>
> > 3) KVM should ideally associate the lifetime of backing
> > pagetables/protection tables/RMP tables with the lifetime of the
> > binding of memslots with guest_memfd.
> >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > long the folios are mapped in guest_memfd, which I think should be
> > revisited.
>
> I don't understand the problem. KVM needs to respond to user accessible
> invalidations, but how long it keeps other resources around could be useful for
> various optimizations. Like deferring work to a work queue or something.

I don't think it could be deferred to a work queue as the RMP table
entries will need to be removed synchronously once the last reference
on the guest_memfd drops, unless memory itself is kept around after
filemap eviction. I can see benefits of this approach for handling
scenarios like intrahost-migration.

>
> I think it would help to just target the ackerly series goals. We should get
> that code into shape and this kind of stuff will fall out of it.
>
> >
> > Some very early thoughts on how guest_memfd could be laid out for the long term:
> > 1) guest_memfd code ideally should be built-in to the kernel.
> > 2) guest_memfd instances should still be created using KVM IOCTLs that
> > carry specific capabilities/restrictions for its users based on the
> > backing VM/arch.
> > 3) Any outgoing communication from guest_memfd to it's users like
> > userspace/KVM/IOMMU should be via notifiers to invalidate similar to
> > how MMU notifiers work.
> > 4) KVM and IOMMU can implement intermediate layers to handle
> > interaction with guest_memfd.
> >      - e.g. there could be a layer within kvm that handles:
> >              - creating guest_memfd files and associating a
> > kvm_gmem_context with those files.
> >              - memslot binding
> >                        - kvm_gmem_context will be used to bind kvm
> > memslots with the context ranges.
> >              - invalidate notifier handling
> >                         - kvm_gmem_context will be used to intercept
> > guest_memfd callbacks and
> >                           translate them to the right GPA ranges.
> >              - linking
> >                         - kvm_gmem_context can be linked to different
> > KVM instances.
>
> We can probably look at the code to decide these.
>

Agree.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> I think we can simplify the role of guest_memfd in line with discussion [1]:

I genuinely don't understand what you're trying to "simplify".  We need to define
an ABI that is flexible and robust, but beyond that most of these guidelines boil
down to "don't write bad code".

> 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.

No, guest_memfd is a memory provider for KVM guests.  That memory *might* be
mapped by userspace and/or into IOMMU page tables in order out of functional
necessity, but guest_memfd exists solely to serve memory to KVM guests, full stop.

> 3) KVM should ideally associate the lifetime of backing
> pagetables/protection tables/RMP tables with the lifetime of the
> binding of memslots with guest_memfd.

Again, please align your indentation.

>          - Today KVM SNP logic ties RMP table entry lifetimes with how
>            long the folios are mapped in guest_memfd, which I think should be
>            revisited.

Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
inodes are tied to the Virtual Machine, not to the "struct kvm" instance.

> Some very early thoughts on how guest_memfd could be laid out for the long term:
> 1) guest_memfd code ideally should be built-in to the kernel.

Why?  How is this at all relevant?  If we need to bake some parts of guest_memfd
into the kernel in order to avoid nasty exports and/or ordering dependencies, then
we can do so.  But that is 100% an implementation detail and in no way a design
goal.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Wed, Jul 9, 2025 at 8:00 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> > I think we can simplify the role of guest_memfd in line with discussion [1]:
>
> I genuinely don't understand what you're trying to "simplify".  We need to define
> an ABI that is flexible and robust, but beyond that most of these guidelines boil
> down to "don't write bad code".

My goal for bringing this discussion up is to see if we can better
define the role of guest_memfd and how it interacts with other layers,
as I see some scenarios that can be improved like kvm_gmem_populate[1]
where guest_memfd is trying to fault in pages on behalf of KVM.

[1] https://lore.kernel.org/lkml/20250703062641.3247-1-yan.y.zhao@intel.com/

>
> > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
>
> No, guest_memfd is a memory provider for KVM guests.  That memory *might* be
> mapped by userspace and/or into IOMMU page tables in order out of functional
> necessity, but guest_memfd exists solely to serve memory to KVM guests, full stop.

I look at this as guest_memfd should serve memory to KVM guests and to
other users by following some KVM/Arch related guidelines e.g. for CC
VMs, guest_memfd can handle certain behavior differently.

>
> > 3) KVM should ideally associate the lifetime of backing
> > pagetables/protection tables/RMP tables with the lifetime of the
> > binding of memslots with guest_memfd.
>
> Again, please align your indentation.
>
> >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> >            long the folios are mapped in guest_memfd, which I think should be
> >            revisited.
>
> Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> inodes are tied to the Virtual Machine, not to the "struct kvm" instance.

IIUC guest_memfd can only be accessed through the window of memslots
and if there are no memslots I don't see the reason for memory still
being associated with "virtual machine". Likely because I am yet to
completely wrap my head around 'guest_memfd inodes are tied to the
Virtual Machine, not to the "struct kvm" instance', I need to spend
more time on this one.

>
> > Some very early thoughts on how guest_memfd could be laid out for the long term:
> > 1) guest_memfd code ideally should be built-in to the kernel.
>
> Why?  How is this at all relevant?  If we need to bake some parts of guest_memfd
> into the kernel in order to avoid nasty exports and/or ordering dependencies, then
> we can do so.  But that is 100% an implementation detail and in no way a design
> goal.

I agree, this is implementation detail and we need real code to
discuss this better.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@google.com> wrote:
> > > 3) KVM should ideally associate the lifetime of backing
> > > pagetables/protection tables/RMP tables with the lifetime of the
> > > binding of memslots with guest_memfd.
> >
> > Again, please align your indentation.
> >
> > >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > >            long the folios are mapped in guest_memfd, which I think should be
> > >            revisited.
> >
> > Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> > inodes are tied to the Virtual Machine, not to the "struct kvm" instance.
>
> IIUC guest_memfd can only be accessed through the window of memslots
> and if there are no memslots I don't see the reason for memory still
> being associated with "virtual machine". Likely because I am yet to
> completely wrap my head around 'guest_memfd inodes are tied to the
> Virtual Machine, not to the "struct kvm" instance', I need to spend
> more time on this one.
>

I see the benefits of tying inodes to the virtual machine and
different guest_memfd files to different KVM instances. This allows us
to exercise intra-host migration usecases for TDX/SNP. But I think
this model doesn't allow us to reuse guest_memfd files for SNP VMs
during reboot.

Reboot scenario assuming reuse of existing guest_memfd inode for the
next instance:
1) Create a VM
2) Create guest_memfd files that pin KVM instance
3) Create memslots
4) Start the VM
5) For reboot/shutdown, Execute VM specific Termination (e.g.
KVM_TDX_TERMINATE_VM)
6) if allowed, delete the memslots
7) Create a new VM instance
8) Link the existing guest_memfd files to the new VM -> which creates
new files for the same inode.
9) Close the existing guest_memfd files and the existing VM
10) Jump to step 3

The difference between SNP and TDX is that TDX memory ownership is
limited to the duration the pages are mapped in the second stage
secure EPT tables, whereas SNP/RMP memory ownership lasts beyond
memslots and effectively remains till folios are punched out from
guest_memfd filemap. IIUC CCA might follow the suite of SNP in this
regard with the pfns populated in GPT entries.

I don't have a sense of how critical this problem could be, but this
would mean for every reboot all large memory allocations will have to
let go and need to be reallocated. For 1G support, we will be freeing
guest_memfd pages using a background thread which may add some delays
in being able to free up the memory in time.

Instead if we did this:
1) Support creating guest_memfd files for a certain VM type that
allows KVM to dictate the behavior of the guest_memfd.
2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and
memslot bindings
    - Each binding will increase a refcount on both guest_memfd file
and KVM, so both can't go away while the binding exists.
3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind
operations while for TDX, KVM will invalidate secure EPT entries.

This can allow us to decouple memory lifecycle from VM lifecycle and
match the behavior with non-confidential VMs where memory can outlast
VMs. Though this approach will mean change in intrahost migration
implementation as we don't need to differentiate guest_memfd files and
inodes.

That being said, I might be missing something here and I don't have
any data to back the criticality of this usecase for SNP and possibly
CCA VMs.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Vishal Annapurve 7 months ago

On Fri, Jul 11, 2025 at 2:18 PM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@google.com> wrote:
> > > > 3) KVM should ideally associate the lifetime of backing
> > > > pagetables/protection tables/RMP tables with the lifetime of the
> > > > binding of memslots with guest_memfd.
> > >
> > > Again, please align your indentation.
> > >
> > > >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > > >            long the folios are mapped in guest_memfd, which I think should be
> > > >            revisited.
> > >
> > > Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> > > inodes are tied to the Virtual Machine, not to the "struct kvm" instance.
> >
> > IIUC guest_memfd can only be accessed through the window of memslots
> > and if there are no memslots I don't see the reason for memory still
> > being associated with "virtual machine". Likely because I am yet to
> > completely wrap my head around 'guest_memfd inodes are tied to the
> > Virtual Machine, not to the "struct kvm" instance', I need to spend
> > more time on this one.
> >
>
> I see the benefits of tying inodes to the virtual machine and
> different guest_memfd files to different KVM instances. This allows us
> to exercise intra-host migration usecases for TDX/SNP. But I think
> this model doesn't allow us to reuse guest_memfd files for SNP VMs
> during reboot.
>
> Reboot scenario assuming reuse of existing guest_memfd inode for the
> next instance:
> 1) Create a VM
> 2) Create guest_memfd files that pin KVM instance
> 3) Create memslots
> 4) Start the VM
> 5) For reboot/shutdown, Execute VM specific Termination (e.g.
> KVM_TDX_TERMINATE_VM)
> 6) if allowed, delete the memslots
> 7) Create a new VM instance
> 8) Link the existing guest_memfd files to the new VM -> which creates
> new files for the same inode.
> 9) Close the existing guest_memfd files and the existing VM
> 10) Jump to step 3
>
> The difference between SNP and TDX is that TDX memory ownership is
> limited to the duration the pages are mapped in the second stage
> secure EPT tables, whereas SNP/RMP memory ownership lasts beyond
> memslots and effectively remains till folios are punched out from
> guest_memfd filemap. IIUC CCA might follow the suite of SNP in this
> regard with the pfns populated in GPT entries.
>
> I don't have a sense of how critical this problem could be, but this
> would mean for every reboot all large memory allocations will have to
> let go and need to be reallocated. For 1G support, we will be freeing
> guest_memfd pages using a background thread which may add some delays
> in being able to free up the memory in time.
>
> Instead if we did this:
> 1) Support creating guest_memfd files for a certain VM type that
> allows KVM to dictate the behavior of the guest_memfd.
> 2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and
> memslot bindings
>     - Each binding will increase a refcount on both guest_memfd file
> and KVM, so both can't go away while the binding exists.

I think if we can ensure that any guest_memfd initiated interaction
with KVM is only for invalidation and is based on binding and under
filemap_invalidate_lock then there is no need to pin KVM on each
binding, as binding/unbinding should be protected using
filemap_invalidate_lock and so KVM can't go away during invalidation.



> 3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind
> operations while for TDX, KVM will invalidate secure EPT entries.
>
> This can allow us to decouple memory lifecycle from VM lifecycle and
> match the behavior with non-confidential VMs where memory can outlast
> VMs. Though this approach will mean change in intrahost migration
> implementation as we don't need to differentiate guest_memfd files and
> inodes.
>
> That being said, I might be missing something here and I don't have
> any data to back the criticality of this usecase for SNP and possibly
> CCA VMs.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Sean Christopherson 7 months ago

On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> On Wed, Jul 9, 2025 at 8:00 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> > > I think we can simplify the role of guest_memfd in line with discussion [1]:
> >
> > I genuinely don't understand what you're trying to "simplify".  We need to define
> > an ABI that is flexible and robust, but beyond that most of these guidelines boil
> > down to "don't write bad code".
> 
> My goal for bringing this discussion up is to see if we can better
> define the role of guest_memfd and how it interacts with other layers,
> as I see some scenarios that can be improved like kvm_gmem_populate[1]
> where guest_memfd is trying to fault in pages on behalf of KVM.

Ah, gotcha.  From my perspective, it's all just KVM, which is why I'm not feeling
the same sense of urgency to formally define anything.  We want to encapsulate
code, have separate of concerns, etc., but I don't see that as being anything
unique or special to guest_memfd.  We try to achieve the same for all major areas
of KVM, though obviously with mixed results :-)

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Edgecombe, Rick P 7 months ago

On Tue, 2025-07-08 at 11:55 -0700, Sean Christopherson wrote:
> I think the answer is that we want to let guest_memfd live beyond the "struct kvm"
> instance, but not beyond the Virtual Machine.  From a past discussion on this topic[*].
> 
> 
[snip]
> Exactly what that will look like in code is TBD, but the concept/logic holds up.
> 
> [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com

Thanks for digging this up. Makes sense. One gmemfd per VM, but 
struct kvm != a VM.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Xiaoyao Li 7 months, 3 weeks ago

On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
> On 6/19/2025 4:13 PM, Yan Zhao wrote:
>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>>> Hello,
>>>
>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>>> upstream calls to provide 1G page support for guest_memfd by taking
>>> pages from HugeTLB.
>>>
>>> This patchset is based on Linux v6.15-rc6, and requires the mmap support
>>> for guest_memfd patchset (Thanks Fuad!) [1].
>>>
>>> For ease of testing, this series is also available, stitched together,
>>> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page- 
>>> support-rfc-v2
>> Just to record a found issue -- not one that must be fixed.
>>
>> In TDX, the initial memory region is added as private memory during 
>> TD's build
>> time, with its initial content copied from source pages in shared memory.
>> The copy operation requires simultaneous access to both shared source 
>> memory
>> and private target memory.
>>
>> Therefore, userspace cannot store the initial content in shared memory 
>> at the
>> mmap-ed VA of a guest_memfd that performs in-place conversion between 
>> shared and
>> private memory. This is because the guest_memfd will first unmap a PFN 
>> in shared
>> page tables and then check for any extra refcount held for the shared 
>> PFN before
>> converting it to private.
> 
> I have an idea.
> 
> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place 
> conversion unmap the PFN in shared page tables while keeping the content 
> of the page unchanged, right?
> 
> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory 
> actually for non-CoCo case actually, that userspace first mmap() it and 
> ensure it's shared and writes the initial content to it, after it 
> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> 
> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it 
> wants the private memory to be initialized with initial content, and 
> just do in-place TDH.PAGE.ADD in the hook.

And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to 
explicitly request that the page range is converted to private and the 
content needs to be retained. So that TDX can identify which case needs 
to call in-place TDH.PAGE.ADD.

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Yan Zhao 7 months, 3 weeks ago

On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote:
> On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
> > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > > Hello,
> > > > 
> > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > upstream calls to provide 1G page support for guest_memfd by taking
> > > > pages from HugeTLB.
> > > > 
> > > > This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > > for guest_memfd patchset (Thanks Fuad!) [1].
> > > > 
> > > > For ease of testing, this series is also available, stitched together,
> > > > at
> > > > https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-
> > > > support-rfc-v2
> > > Just to record a found issue -- not one that must be fixed.
> > > 
> > > In TDX, the initial memory region is added as private memory during
> > > TD's build
> > > time, with its initial content copied from source pages in shared memory.
> > > The copy operation requires simultaneous access to both shared
> > > source memory
> > > and private target memory.
> > > 
> > > Therefore, userspace cannot store the initial content in shared
> > > memory at the
> > > mmap-ed VA of a guest_memfd that performs in-place conversion
> > > between shared and
> > > private memory. This is because the guest_memfd will first unmap a
> > > PFN in shared
> > > page tables and then check for any extra refcount held for the
> > > shared PFN before
> > > converting it to private.
> > 
> > I have an idea.
> > 
> > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > conversion unmap the PFN in shared page tables while keeping the content
> > of the page unchanged, right?
However, whenever there's a GUP in TDX to get the source page, there will be an
extra page refcount.

> > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > actually for non-CoCo case actually, that userspace first mmap() it and
> > ensure it's shared and writes the initial content to it, after it
> > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
The conversion request here will be declined therefore.


> > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > wants the private memory to be initialized with initial content, and
> > just do in-place TDH.PAGE.ADD in the hook.
> 
> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> explicitly request that the page range is converted to private and the
> content needs to be retained. So that TDX can identify which case needs to
> call in-place TDH.PAGE.ADD.
>

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Xiaoyao Li 7 months, 3 weeks ago

On 6/19/2025 5:28 PM, Yan Zhao wrote:
> On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote:
>> On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
>>> On 6/19/2025 4:13 PM, Yan Zhao wrote:
>>>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>>>>> Hello,
>>>>>
>>>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>>>>> upstream calls to provide 1G page support for guest_memfd by taking
>>>>> pages from HugeTLB.
>>>>>
>>>>> This patchset is based on Linux v6.15-rc6, and requires the mmap support
>>>>> for guest_memfd patchset (Thanks Fuad!) [1].
>>>>>
>>>>> For ease of testing, this series is also available, stitched together,
>>>>> at
>>>>> https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-
>>>>> support-rfc-v2
>>>> Just to record a found issue -- not one that must be fixed.
>>>>
>>>> In TDX, the initial memory region is added as private memory during
>>>> TD's build
>>>> time, with its initial content copied from source pages in shared memory.
>>>> The copy operation requires simultaneous access to both shared
>>>> source memory
>>>> and private target memory.
>>>>
>>>> Therefore, userspace cannot store the initial content in shared
>>>> memory at the
>>>> mmap-ed VA of a guest_memfd that performs in-place conversion
>>>> between shared and
>>>> private memory. This is because the guest_memfd will first unmap a
>>>> PFN in shared
>>>> page tables and then check for any extra refcount held for the
>>>> shared PFN before
>>>> converting it to private.
>>>
>>> I have an idea.
>>>
>>> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
>>> conversion unmap the PFN in shared page tables while keeping the content
>>> of the page unchanged, right?
> However, whenever there's a GUP in TDX to get the source page, there will be an
> extra page refcount.

The GUP in TDX happens after the gmem converts the page to private.

In the view of TDX, the physical page is converted to private already 
and it contains the initial content. But the content is not usable for 
TDX until TDX calls in-place PAGE.ADD

>>> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
>>> actually for non-CoCo case actually, that userspace first mmap() it and
>>> ensure it's shared and writes the initial content to it, after it
>>> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> The conversion request here will be declined therefore.
> 
> 
>>> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
>>> wants the private memory to be initialized with initial content, and
>>> just do in-place TDH.PAGE.ADD in the hook.
>>
>> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
>> explicitly request that the page range is converted to private and the
>> content needs to be retained. So that TDX can identify which case needs to
>> call in-place TDH.PAGE.ADD.
>>

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Xiaoyao Li 7 months, 3 weeks ago

On 6/19/2025 5:45 PM, Xiaoyao Li wrote:
> On 6/19/2025 5:28 PM, Yan Zhao wrote:
>> On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote:
>>> On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
>>>> On 6/19/2025 4:13 PM, Yan Zhao wrote:
>>>>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>>>>>> Hello,
>>>>>>
>>>>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>>>>>> upstream calls to provide 1G page support for guest_memfd by taking
>>>>>> pages from HugeTLB.
>>>>>>
>>>>>> This patchset is based on Linux v6.15-rc6, and requires the mmap 
>>>>>> support
>>>>>> for guest_memfd patchset (Thanks Fuad!) [1].
>>>>>>
>>>>>> For ease of testing, this series is also available, stitched 
>>>>>> together,
>>>>>> at
>>>>>> https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-
>>>>>> support-rfc-v2
>>>>> Just to record a found issue -- not one that must be fixed.
>>>>>
>>>>> In TDX, the initial memory region is added as private memory during
>>>>> TD's build
>>>>> time, with its initial content copied from source pages in shared 
>>>>> memory.
>>>>> The copy operation requires simultaneous access to both shared
>>>>> source memory
>>>>> and private target memory.
>>>>>
>>>>> Therefore, userspace cannot store the initial content in shared
>>>>> memory at the
>>>>> mmap-ed VA of a guest_memfd that performs in-place conversion
>>>>> between shared and
>>>>> private memory. This is because the guest_memfd will first unmap a
>>>>> PFN in shared
>>>>> page tables and then check for any extra refcount held for the
>>>>> shared PFN before
>>>>> converting it to private.
>>>>
>>>> I have an idea.
>>>>
>>>> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
>>>> conversion unmap the PFN in shared page tables while keeping the 
>>>> content
>>>> of the page unchanged, right?
>> However, whenever there's a GUP in TDX to get the source page, there 
>> will be an
>> extra page refcount.
> 
> The GUP in TDX happens after the gmem converts the page to private.

May it's not GUP since the page has been unmapped from userspace? (Sorry 
that I'm not familiar with the terminology)

> In the view of TDX, the physical page is converted to private already 
> and it contains the initial content. But the content is not usable for 
> TDX until TDX calls in-place PAGE.ADD
> 
>>>> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private 
>>>> memory
>>>> actually for non-CoCo case actually, that userspace first mmap() it and
>>>> ensure it's shared and writes the initial content to it, after it
>>>> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
>> The conversion request here will be declined therefore.
>>
>>
>>>> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
>>>> wants the private memory to be initialized with initial content, and
>>>> just do in-place TDH.PAGE.ADD in the hook.
>>>
>>> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
>>> explicitly request that the page range is converted to private and the
>>> content needs to be retained. So that TDX can identify which case 
>>> needs to
>>> call in-place TDH.PAGE.ADD.
>>>
> 
>

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ackerley Tng 8 months, 3 weeks ago

Ackerley Tng <ackerleytng@google.com> writes:

> <snip>
>
> Here are some remaining issues/TODOs:
>
> 1. Memory error handling such as machine check errors have not been
>    implemented.
> 2. I've not looked into preparedness of pages, only zeroing has been
>    considered.
> 3. When allocating HugeTLB pages, if two threads allocate indices
>    mapping to the same huge page, the utilization in guest_memfd inode's
>    subpool may momentarily go over the subpool limit (the requested size
>    of the inode at guest_memfd creation time), causing one of the two
>    threads to get -ENOMEM. Suggestions to solve this are appreciated!
> 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
>    pages should be correct but needs testing and could be wrong.
> 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
>    HugeTLB pages after splitting should be correct but needs testing and
>    could be wrong.
> 6. Page cache accounting: When a hugetlb page is split, guest_memfd will
>    incur page count in both NR_HUGETLB (counted at hugetlb allocation
>    time) and NR_FILE_PAGES stats (counted when split pages are added to
>    the filemap). Is this aligned with what people expect?
>

For people who might be testing this series with non-Coco VMs (heads up,
Patrick and Nikita!), this currently splits the folio as long as some
shareability in the huge folio is shared, which is probably unnecessary?

IIUC core-mm doesn't support mapping at 1G but from a cursory reading it
seems like the faulting function calling kvm_gmem_fault_shared() could
possibly be able to map a 1G page at 4K.

Looks like we might need another flag like
GUEST_MEMFD_FLAG_SUPPORT_CONVERSION, which will gate initialization of
the shareability maple tree/xarray.

If shareability is NULL for the entire hugepage range, then no splitting
will occur.

For Coco VMs, this should be safe, since if this flag is not set,
kvm_gmem_fault_shared() will always not be able to fault (the
shareability value will be NULL.

> Here are some optimizations that could be explored in future series:
>
> 1. Pages could be split from 1G to 2M first and only split to 4K if
>    necessary.
> 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
>    pages.
>
> <snip>

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ira Weiny 8 months, 3 weeks ago

Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.
> 
> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> for guest_memfd patchset (Thanks Fuad!) [1].

Trying to manage dependencies I find that Ryan's just released series[1]
is required to build this set.

[1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/

Specifically this patch:
	https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/

	defines

	alloc_anon_secure_inode()

Am I wrong in that?

> 
> For ease of testing, this series is also available, stitched together,
> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> 

I went digging in your git tree and then found Ryan's set.  So thanks for
the git tree.  :-D

However, it seems this add another dependency which should be managed in
David's email of dependencies?

Ira

> This patchset can be divided into two sections:
> 
> (a) Patches from the beginning up to and including "KVM: selftests:
>     Update script to map shared memory from guest_memfd" are a modified
>     version of "conversion support for guest_memfd", which Fuad is
>     managing [2].
> 
> (b) Patches after "KVM: selftests: Update script to map shared memory
>     from guest_memfd" till the end are patches that actually bring in 1G
>     page support for guest_memfd.
> 
> These are the significant differences between (a) and [2]:
> 
> + [2] uses an xarray to track sharability, but I used a maple tree
>   because for 1G pages, iterating pagewise to update shareability was
>   prohibitively slow even for testing. I was choosing from among
>   multi-index xarrays, interval trees and maple trees [3], and picked
>   maple trees because
>     + Maple trees were easier to figure out since I didn't have to
>       compute the correct multi-index order and handle edge cases if the
>       converted range wasn't a neat power of 2.
>     + Maple trees were easier to figure out as compared to updating
>       parts of a multi-index xarray.
>     + Maple trees had an easier API to use than interval trees.
> + [2] doesn't yet have a conversion ioctl, but I needed it to test 1G
>   support end-to-end.
> + (a) Removes guest_memfd from participating in LRU, which I needed, to
>   get conversion selftests to work as expected, since participation in
>   LRU was causing some unexpected refcounts on folios which was blocking
>   conversions.
> 
> I am sending (a) in emails as well, as opposed to just leaving it on
> GitHub, so that we can discuss by commenting inline on emails. If you'd
> like to just look at 1G page support, here are some key takeaways from
> the first section (a):
> 
> + If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd
>   creation, guest_memfd will
>     + Track shareability (whether an index in the inode is guest-only or
>       if the host is allowed to fault memory at a given index).
>     + Always be used for guest faults - specifically, kvm_gmem_get_pfn()
>       will be used to provide pages for the guest.
>     + Always be used by KVM to check private/shared status of a gfn.
> + guest_memfd now has conversion ioctls, allowing conversion to
>   private/shared
>     + Conversion can fail if there are unexpected refcounts on any
>       folios in the range.
> 
> Focusing on (b) 1G page support, here's an overview:
> 
> 1. A bunch of refactoring patches for HugeTLB that isolates the
>    allocation of a HugeTLB folio from other HugeTLB concepts such as
>    VMA-level reservations, and HugeTLBfs-specific concepts, such as
>    where memory policy is stored in the VMA, or where the subpool is
>    stored on the inode.
> 2. A few patches that add a guestmem_hugetlb allocator within mm/. The
>    guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize
>    the memory management functions, and to cleanly handle cleanup, so
>    that folio cleanup can happen after the guest_memfd inode (and even
>    KVM) goes away.
> 3. Some updates to guest_memfd to use the guestmem_hugetlb allocator.
> 4. Selftests for 1G page support.
> 
> Here are some remaining issues/TODOs:
> 
> 1. Memory error handling such as machine check errors have not been
>    implemented.
> 2. I've not looked into preparedness of pages, only zeroing has been
>    considered.
> 3. When allocating HugeTLB pages, if two threads allocate indices
>    mapping to the same huge page, the utilization in guest_memfd inode's
>    subpool may momentarily go over the subpool limit (the requested size
>    of the inode at guest_memfd creation time), causing one of the two
>    threads to get -ENOMEM. Suggestions to solve this are appreciated!
> 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
>    pages should be correct but needs testing and could be wrong.
> 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
>    HugeTLB pages after splitting should be correct but needs testing and
>    could be wrong.
> 6. Page cache accounting: When a hugetlb page is split, guest_memfd will
>    incur page count in both NR_HUGETLB (counted at hugetlb allocation
>    time) and NR_FILE_PAGES stats (counted when split pages are added to
>    the filemap). Is this aligned with what people expect?
> 
> Here are some optimizations that could be explored in future series:
> 
> 1. Pages could be split from 1G to 2M first and only split to 4K if
>    necessary.
> 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
>    pages.
> 
> Here's RFC v1 [4] if you're interested in the motivation behind choosing
> HugeTLB, or the history of this patch series.
> 
> [1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> [2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/
> [3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/
> [4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/
>

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ira Weiny 8 months, 3 weeks ago

Ira Weiny wrote:
> Ackerley Tng wrote:
> > Hello,
> > 
> > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > upstream calls to provide 1G page support for guest_memfd by taking
> > pages from HugeTLB.
> > 
> > This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > for guest_memfd patchset (Thanks Fuad!) [1].
> 
> Trying to manage dependencies I find that Ryan's just released series[1]
> is required to build this set.
> 
> [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> 
> Specifically this patch:
> 	https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/
> 
> 	defines
> 
> 	alloc_anon_secure_inode()

Perhaps Ryan's set is not required?  Just that patch?

It looks like Ryan's 2/13 is the same as your 1/51 patch?

https://lore.kernel.org/all/754b4898c3362050071f6dd09deb24f3c92a41c3.1747368092.git.afranji@google.com/

I'll pull 1/13 and see where I get.

Ira

> 
> Am I wrong in that?
> 
> > 
> > For ease of testing, this series is also available, stitched together,
> > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > 
> 
> I went digging in your git tree and then found Ryan's set.  So thanks for
> the git tree.  :-D
> 
> However, it seems this add another dependency which should be managed in
> David's email of dependencies?
> 
> Ira
>

Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Posted by Ackerley Tng 8 months, 3 weeks ago

Ira Weiny <ira.weiny@intel.com> writes:

> Ira Weiny wrote:
>> Ackerley Tng wrote:
>> > Hello,
>> > 
>> > This patchset builds upon discussion at LPC 2024 and many guest_memfd
>> > upstream calls to provide 1G page support for guest_memfd by taking
>> > pages from HugeTLB.
>> > 
>> > This patchset is based on Linux v6.15-rc6, and requires the mmap support
>> > for guest_memfd patchset (Thanks Fuad!) [1].
>> 
>> Trying to manage dependencies I find that Ryan's just released series[1]
>> is required to build this set.
>> 
>> [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
>> 
>> Specifically this patch:
>> 	https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/
>> 
>> 	defines
>> 
>> 	alloc_anon_secure_inode()
>
> Perhaps Ryan's set is not required?  Just that patch?
>
> It looks like Ryan's 2/13 is the same as your 1/51 patch?
>
> https://lore.kernel.org/all/754b4898c3362050071f6dd09deb24f3c92a41c3.1747368092.git.afranji@google.com/
>
> I'll pull 1/13 and see where I get.
>
> Ira
>
>> 
>> Am I wrong in that?
>>

My bad, this patch was missing from this series:

From bd629d1ec6ffb7091a5f996dc7835abed8467f3e Mon Sep 17 00:00:00 2001
Message-ID: <bd629d1ec6ffb7091a5f996dc7835abed8467f3e.1747426836.git.ackerleytng@google.com>
From: Ackerley Tng <ackerleytng@google.com>
Date: Wed, 7 May 2025 07:59:28 -0700
Subject: [RFC PATCH v2 1/1] fs: Refactor to provide function that allocates a
 secure anonymous inode

alloc_anon_secure_inode() returns an inode after running checks in
security_inode_init_security_anon().

Also refactor secretmem's file creation process to use the new
function.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I4eb8622775bc3d544ec695f453ffd747d9490e40
---
 fs/anon_inodes.c   | 22 ++++++++++++++++------
 include/linux/fs.h |  1 +
 mm/secretmem.c     |  9 +--------
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 583ac81669c2..4c3110378647 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -55,17 +55,20 @@ static struct file_system_type anon_inode_fs_type = {
 	.kill_sb	= kill_anon_super,
 };
 
-static struct inode *anon_inode_make_secure_inode(
-	const char *name,
-	const struct inode *context_inode)
+static struct inode *anon_inode_make_secure_inode(struct super_block *s,
+		const char *name, const struct inode *context_inode,
+		bool fs_internal)
 {
 	struct inode *inode;
 	int error;
 
-	inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+	inode = alloc_anon_inode(s);
 	if (IS_ERR(inode))
 		return inode;
-	inode->i_flags &= ~S_PRIVATE;
+
+	if (!fs_internal)
+		inode->i_flags &= ~S_PRIVATE;
+
 	error =	security_inode_init_security_anon(inode, &QSTR(name),
 						  context_inode);
 	if (error) {
@@ -75,6 +78,12 @@ static struct inode *anon_inode_make_secure_inode(
 	return inode;
 }
 
+struct inode *alloc_anon_secure_inode(struct super_block *s, const char *name)
+{
+	return anon_inode_make_secure_inode(s, name, NULL, true);
+}
+EXPORT_SYMBOL_GPL(alloc_anon_secure_inode);
+
 static struct file *__anon_inode_getfile(const char *name,
 					 const struct file_operations *fops,
 					 void *priv, int flags,
@@ -88,7 +97,8 @@ static struct file *__anon_inode_getfile(const char *name,
 		return ERR_PTR(-ENOENT);
 
 	if (make_inode) {
-		inode =	anon_inode_make_secure_inode(name, context_inode);
+		inode = anon_inode_make_secure_inode(anon_inode_mnt->mnt_sb,
+						     name, context_inode, false);
 		if (IS_ERR(inode)) {
 			file = ERR_CAST(inode);
 			goto err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..0fded2e3c661 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3550,6 +3550,7 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
 extern const struct address_space_operations ram_aops;
 extern int always_delete_dentry(const struct dentry *);
 extern struct inode *alloc_anon_inode(struct super_block *);
+extern struct inode *alloc_anon_secure_inode(struct super_block *, const char *);
 extern int simple_nosetlease(struct file *, int, struct file_lease **, void **);
 extern const struct dentry_operations simple_dentry_operations;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 1b0a214ee558..c0e459e58cb6 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -195,18 +195,11 @@ static struct file *secretmem_file_create(unsigned long flags)
 	struct file *file;
 	struct inode *inode;
 	const char *anon_name = "[secretmem]";
-	int err;
 
-	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	inode = alloc_anon_secure_inode(secretmem_mnt->mnt_sb, anon_name);
 	if (IS_ERR(inode))
 		return ERR_CAST(inode);
 
-	err = security_inode_init_security_anon(inode, &QSTR(anon_name), NULL);
-	if (err) {
-		file = ERR_PTR(err);
-		goto err_free_inode;
-	}
-
 	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
 				 O_RDWR, &secretmem_fops);
 	if (IS_ERR(file))
-- 
2.49.0.1101.gccaa498523-goog

>> > 
>> > For ease of testing, this series is also available, stitched together,
>> > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> > 
>> 
>> I went digging in your git tree and then found Ryan's set.  So thanks for
>> the git tree.  :-D

Glad that helped!

>> 
>> However, it seems this add another dependency which should be managed in
>> David's email of dependencies?

This is a good idea. David, do you think these two patches should be
managed as a separate patch series in the email of dependencies?

+ (left out of RFCv2, but is above) "fs: Refactor to provide function that allocates a secure anonymous inode"
+ 01/51 "KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes"

They're being used by a few patch series now.

>> 
>> Ira
>>