[RFC PATCH 00/39] 1G page support for guest_memfd

Ackerley Tng posted 39 patches 1 year, 4 months ago
There is a newer version of this series
fs/hugetlbfs/inode.c                          |   35 +-
include/linux/hugetlb.h                       |   54 +-
include/linux/kvm_host.h                      |    1 +
include/linux/mempolicy.h                     |    2 +
include/linux/mm.h                            |    1 +
include/uapi/linux/kvm.h                      |   26 +
include/uapi/linux/magic.h                    |    1 +
mm/hugetlb.c                                  |  346 ++--
mm/hugetlb_vmemmap.h                          |   11 -
mm/mempolicy.c                                |   36 +-
mm/truncate.c                                 |   26 +-
tools/include/linux/kernel.h                  |    4 +-
tools/testing/selftests/kvm/Makefile          |    3 +
.../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
.../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
.../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
.../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
.../testing/selftests/kvm/include/kvm_util.h  |   45 +-
.../testing/selftests/kvm/include/test_util.h |   18 +
tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
.../kvm/x86_64/private_mem_conversions_test.c |  158 +-
.../x86_64/private_mem_conversions_test.sh    |   91 +
.../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
virt/kvm/kvm_main.c                           |   17 +
virt/kvm/kvm_mm.h                             |   16 +
27 files changed, 3288 insertions(+), 443 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
[RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Ackerley Tng 1 year, 4 months ago
Hello,

This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.

The patchset covers:

+ How to get 1G pages
+ Allowing mmap() of guest_memfd to userspace so that both private and shared
  memory can use the same physical pages
+ Splitting and reconstructing pages to support conversions and mmap()
+ How the VM, userspace and guest_memfd interact to support conversions
+ Selftests to test all the above
    + Selftests also demonstrate the conversion flow between VM, userspace and
      guest_memfd.

Why 1G pages in guest memfd?

Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.

+ Performance is improved with 1G pages by more TLB hits and faster page walks
  on TLB misses.
+ Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).

Options for 1G page support:

1. HugeTLB
2. Contiguous Memory Allocator (CMA)
3. Other suggestions are welcome!

Comparison between options:

1. HugeTLB
    + Refactor HugeTLB to separate allocator from the rest of HugeTLB
    + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
        + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
    + Pro: Can provide iterative steps toward new future allocator
        + Unexplored: Managing userspace-visible changes
            + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
              but not when future allocator is used
2. CMA
    + Port some HugeTLB features to be applied on CMA
    + Pro: Clean slate

What would refactoring HugeTLB involve?

(Some refactoring was done in this RFC, more can be done.)

1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
    + Brings more modularity to HugeTLB
    + No functionality change intended
    + Likely step towards HugeTLB's integration into core-mm
2. guest_memfd will use just the allocator component of HugeTLB, not including
   the complex parts of HugeTLB like
    + Userspace reservations (resv_map)
    + Shared PMD mappings
    + Special page walkers

What features would need to be ported to CMA?

+ Improved allocation guarantees
    + Per NUMA node pool of huge pages
    + Subpools per guest_memfd
+ Memory savings
    + Something like HugeTLB Vmemmap Optimization
+ Configuration/reporting features
    + Configuration of number of pages available (and per NUMA node) at and
      after host boot
    + Reporting of memory usage/availability statistics at runtime

HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.

To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.

Improving how guest_memfd uses the allocator in a future revision of this RFC:

To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:

+ reserve(node, page_size, num_pages) => opaque handle
    + Used when a guest_memfd inode is created to reserve memory from backend
      allocator
+ allocate(handle, mempolicy, page_size) => folio
    + To allocate a folio from guest_memfd's reservation
+ split(handle, folio, target_page_size) => void
    + To take a huge folio, and split it to smaller folios, restore to filemap
+ reconstruct(handle, first_folio, nr_pages) => void
    + To take a folio, and reconstruct a huge folio out of nr_pages from the
      first_folio
+ free(handle, folio) => void
    + To return folio to guest_memfd's reservation
+ error(handle, folio) => void
    + To handle memory errors
+ unreserve(handle) => void
    + To return guest_memfd's reservation to allocator backend

Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.

Overview of patches:

+ Patches 01-12
    + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
      HugeTLB, and to expose HugeTLB functions.
+ Patches 13-16
    + Letting guest_memfd use HugeTLB
    + Creation of each guest_memfd reserves pages from HugeTLB's global hstate
      and puts it into the guest_memfd inode's subpool
    + Each folio allocation takes a page from the guest_memfd inode's subpool
+ Patches 17-21
    + Selftests for new HugeTLB features in guest_memfd
+ Patches 22-24
    + More small changes on the HugeTLB side to expose functions needed by
      guest_memfd
+ Patch 25:
    + Uses the newly available functions from patches 22-24 to split HugeTLB
      pages. In this patch, HugeTLB folios are always split to 4K before any
      usage, private or shared.
+ Patches 26-28
    + Allow mmap() in guest_memfd and faulting in shared pages
+ Patch 29
    + Enables conversion between private/shared pages
+ Patch 30
    + Required to zero folios after conversions to avoid leaking initialized
      kernel memory
+ Patch 31-38
    + Add selftests to test mapping pages to userspace, guest/host memory
      sharing and update conversions tests
    + Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
+ Patch 39
    + Dynamically split and reconstruct HugeTLB pages instead of always
      splitting before use. All earlier selftests are expected to still pass.

TODOs:

+ Add logic to wait for safe_refcount [1]
+ Look into lazy splitting/reconstruction of pages
    + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
      mem_attr_array and faultability updated, the pages in the requested range
      are also split/reconstructed as necessary. We want to look into delaying
      splitting/reconstruction to fault time.
+ Solve race between folios being faulted in and being truncated
    + When running private_mem_conversions_test with more than 1 vCPU, a folio
      getting truncated may get faulted in by another process, causing elevated
      mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
+ Add intermediate splits (1G should first split to 2M and not split directly to
  4K)
+ Use guest's lock instead of hugetlb_lock
+ Use multi-index xarray/replace xarray with some other data struct for
  faultability flag
+ Refactor HugeTLB better, present generic allocator interface

Please let us know your thoughts on:

+ HugeTLB as the choice of transitional allocator backend
+ Refactoring HugeTLB to provide generic allocator interface
+ Shared/private conversion flow
    + Requiring user to request kernel to unmap pages from userspace using
      madvise(MADV_DONTNEED)
    + Failing conversion on elevated mapcounts/pincounts/refcounts
+ Process of splitting/reconstructing page
+ Anything else!

[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com/T/

Ackerley Tng (37):
  mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
  mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
  mm: hugetlb: Remove unnecessary check for avoid_reserve
  mm: mempolicy: Refactor out policy_node_nodemask()
  mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
    interpret mempolicy instead of vma
  mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
  mm: hugetlb: Refactor out hugetlb_alloc_folio
  mm: truncate: Expose preparation steps for truncate_inode_pages_final
  mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  mm: hugetlb: Add option to create new subpool without using surplus
  mm: hugetlb: Expose hugetlb_acct_memory()
  mm: hugetlb: Move and expose hugetlb_zero_partial_page()
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: hugetlb: initialization and cleanup
  KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
  KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  KVM: selftests: Support various types of backing sources for private
    memory
  KVM: selftests: Update test for various private memory backing source
    types
  KVM: selftests: Add private_mem_conversions_test.sh
  KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  mm: hugetlb: Expose vmemmap optimization functions
  mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
  mm: hugetlb: Add functions to add/move/remove from hugetlb lists
  KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  KVM: guest_memfd: Allow mmapping guest_memfd files
  KVM: guest_memfd: Use vm_type to determine default faultability
  KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
  KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  KVM: selftests: Allow vm_set_memory_attributes to be used without
    asserting return value of 0
  KVM: selftests: Test using guest_memfd memory from userspace
  KVM: selftests: Test guest_memfd memory sharing between guest and host
  KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
    guest_memfd
  KVM: selftests: Test that pinned pages block KVM from setting memory
    attributes to PRIVATE
  KVM: selftests: Refactor vm_mem_add to be more flexible
  KVM: selftests: Add helper to perform madvise by memslots
  KVM: selftests: Update private_mem_conversions_test for mmap()able
    guest_memfd

Vishal Annapurve (2):
  KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
  KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

 fs/hugetlbfs/inode.c                          |   35 +-
 include/linux/hugetlb.h                       |   54 +-
 include/linux/kvm_host.h                      |    1 +
 include/linux/mempolicy.h                     |    2 +
 include/linux/mm.h                            |    1 +
 include/uapi/linux/kvm.h                      |   26 +
 include/uapi/linux/magic.h                    |    1 +
 mm/hugetlb.c                                  |  346 ++--
 mm/hugetlb_vmemmap.h                          |   11 -
 mm/mempolicy.c                                |   36 +-
 mm/truncate.c                                 |   26 +-
 tools/include/linux/kernel.h                  |    4 +-
 tools/testing/selftests/kvm/Makefile          |    3 +
 .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
 .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
 .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
 .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
 .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
 .../testing/selftests/kvm/include/test_util.h |   18 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
 tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
 .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
 .../x86_64/private_mem_conversions_test.sh    |   91 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
 virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
 virt/kvm/kvm_main.c                           |   17 +
 virt/kvm/kvm_mm.h                             |   16 +
 27 files changed, 3288 insertions(+), 443 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
 create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh

--
2.46.0.598.g6f2099f65c-goog
Re: [RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Michal Hocko 1 year, 4 months ago
Cc Oscar for awareness

On Tue 10-09-24 23:43:31, Ackerley Tng wrote:
> Hello,
> 
> This patchset is our exploration of how to support 1G pages in guest_memfd, and
> how the pages will be used in Confidential VMs.
> 
> The patchset covers:
> 
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and shared
>   memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
>     + Selftests also demonstrate the conversion flow between VM, userspace and
>       guest_memfd.
> 
> Why 1G pages in guest memfd?
> 
> Bring guest_memfd to performance and memory savings parity with VMs that are
> backed by HugeTLBfs.
> 
> + Performance is improved with 1G pages by more TLB hits and faster page walks
>   on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
> 
> Options for 1G page support:
> 
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
> 
> Comparison between options:
> 
> 1. HugeTLB
>     + Refactor HugeTLB to separate allocator from the rest of HugeTLB
>     + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
>         + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
>     + Pro: Can provide iterative steps toward new future allocator
>         + Unexplored: Managing userspace-visible changes
>             + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
>               but not when future allocator is used
> 2. CMA
>     + Port some HugeTLB features to be applied on CMA
>     + Pro: Clean slate
> 
> What would refactoring HugeTLB involve?
> 
> (Some refactoring was done in this RFC, more can be done.)
> 
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
>     + Brings more modularity to HugeTLB
>     + No functionality change intended
>     + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not including
>    the complex parts of HugeTLB like
>     + Userspace reservations (resv_map)
>     + Shared PMD mappings
>     + Special page walkers
> 
> What features would need to be ported to CMA?
> 
> + Improved allocation guarantees
>     + Per NUMA node pool of huge pages
>     + Subpools per guest_memfd
> + Memory savings
>     + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
>     + Configuration of number of pages available (and per NUMA node) at and
>       after host boot
>     + Reporting of memory usage/availability statistics at runtime
> 
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
> 
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages would
> have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed and
> reapplied on the new guest_memfd memory. This not only slows down memory
> allocation but also trims the benefits of HVO. Memory would have to be reserved
> on the host to facilitate these transitions.
> 
> Improving how guest_memfd uses the allocator in a future revision of this RFC:
> 
> To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
> should be limited to these allocator functions:
> 
> + reserve(node, page_size, num_pages) => opaque handle
>     + Used when a guest_memfd inode is created to reserve memory from backend
>       allocator
> + allocate(handle, mempolicy, page_size) => folio
>     + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
>     + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
>     + To take a folio, and reconstruct a huge folio out of nr_pages from the
>       first_folio
> + free(handle, folio) => void
>     + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
>     + To handle memory errors
> + unreserve(handle) => void
>     + To return guest_memfd's reservation to allocator backend
> 
> Userspace should only provide a page size when creating a guest_memfd and should
> not have to specify HugeTLB.
> 
> Overview of patches:
> 
> + Patches 01-12
>     + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
>       HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
>     + Letting guest_memfd use HugeTLB
>     + Creation of each guest_memfd reserves pages from HugeTLB's global hstate
>       and puts it into the guest_memfd inode's subpool
>     + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
>     + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
>     + More small changes on the HugeTLB side to expose functions needed by
>       guest_memfd
> + Patch 25:
>     + Uses the newly available functions from patches 22-24 to split HugeTLB
>       pages. In this patch, HugeTLB folios are always split to 4K before any
>       usage, private or shared.
> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages
> + Patch 30
>     + Required to zero folios after conversions to avoid leaking initialized
>       kernel memory
> + Patch 31-38
>     + Add selftests to test mapping pages to userspace, guest/host memory
>       sharing and update conversions tests
>     + Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
> + Patch 39
>     + Dynamically split and reconstruct HugeTLB pages instead of always
>       splitting before use. All earlier selftests are expected to still pass.
> 
> TODOs:
> 
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
>     + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
>       mem_attr_array and faultability updated, the pages in the requested range
>       are also split/reconstructed as necessary. We want to look into delaying
>       splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
>     + When running private_mem_conversions_test with more than 1 vCPU, a folio
>       getting truncated may get faulted in by another process, causing elevated
>       mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
>   4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
>   faultability flag
> + Refactor HugeTLB better, present generic allocator interface
> 
> Please let us know your thoughts on:
> 
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
>     + Requiring user to request kernel to unmap pages from userspace using
>       madvise(MADV_DONTNEED)
>     + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
> 
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com/T/
> 
> Ackerley Tng (37):
>   mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
>   mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
>   mm: hugetlb: Remove unnecessary check for avoid_reserve
>   mm: mempolicy: Refactor out policy_node_nodemask()
>   mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
>     interpret mempolicy instead of vma
>   mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
>   mm: hugetlb: Refactor out hugetlb_alloc_folio
>   mm: truncate: Expose preparation steps for truncate_inode_pages_final
>   mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
>   mm: hugetlb: Add option to create new subpool without using surplus
>   mm: hugetlb: Expose hugetlb_acct_memory()
>   mm: hugetlb: Move and expose hugetlb_zero_partial_page()
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of
>     anonymous inodes
>   KVM: guest_memfd: hugetlb: initialization and cleanup
>   KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
>   KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
>   KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
>   KVM: selftests: Support various types of backing sources for private
>     memory
>   KVM: selftests: Update test for various private memory backing source
>     types
>   KVM: selftests: Add private_mem_conversions_test.sh
>   KVM: selftests: Test that guest_memfd usage is reported via hugetlb
>   mm: hugetlb: Expose vmemmap optimization functions
>   mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
>   mm: hugetlb: Add functions to add/move/remove from hugetlb lists
>   KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
>   KVM: guest_memfd: Allow mmapping guest_memfd files
>   KVM: guest_memfd: Use vm_type to determine default faultability
>   KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
>   KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
>   KVM: selftests: Allow vm_set_memory_attributes to be used without
>     asserting return value of 0
>   KVM: selftests: Test using guest_memfd memory from userspace
>   KVM: selftests: Test guest_memfd memory sharing between guest and host
>   KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
>     guest_memfd
>   KVM: selftests: Test that pinned pages block KVM from setting memory
>     attributes to PRIVATE
>   KVM: selftests: Refactor vm_mem_add to be more flexible
>   KVM: selftests: Add helper to perform madvise by memslots
>   KVM: selftests: Update private_mem_conversions_test for mmap()able
>     guest_memfd
> 
> Vishal Annapurve (2):
>   KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
>   KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
> 
>  fs/hugetlbfs/inode.c                          |   35 +-
>  include/linux/hugetlb.h                       |   54 +-
>  include/linux/kvm_host.h                      |    1 +
>  include/linux/mempolicy.h                     |    2 +
>  include/linux/mm.h                            |    1 +
>  include/uapi/linux/kvm.h                      |   26 +
>  include/uapi/linux/magic.h                    |    1 +
>  mm/hugetlb.c                                  |  346 ++--
>  mm/hugetlb_vmemmap.h                          |   11 -
>  mm/mempolicy.c                                |   36 +-
>  mm/truncate.c                                 |   26 +-
>  tools/include/linux/kernel.h                  |    4 +-
>  tools/testing/selftests/kvm/Makefile          |    3 +
>  .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
>  .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
>  .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
>  .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
>  .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
>  .../testing/selftests/kvm/include/test_util.h |   18 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
>  tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
>  .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
>  .../x86_64/private_mem_conversions_test.sh    |   91 +
>  .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
>  virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
>  virt/kvm/kvm_main.c                           |   17 +
>  virt/kvm/kvm_mm.h                             |   16 +
>  27 files changed, 3288 insertions(+), 443 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
>  create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
> 
> --
> 2.46.0.598.g6f2099f65c-goog

-- 
Michal Hocko
SUSE Labs
RE: [RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Du, Fan 1 year, 4 months ago

> -----Original Message-----
> From: Ackerley Tng <ackerleytng@google.com>
> Sent: Wednesday, September 11, 2024 7:44 AM
> To: tabba@google.com; quic_eberman@quicinc.com; roypat@amazon.co.uk;
> jgg@nvidia.com; peterx@redhat.com; david@redhat.com;
> rientjes@google.com; fvdl@google.com; jthoughton@google.com;
> seanjc@google.com; pbonzini@redhat.com; Li, Zhiquan1
> <zhiquan1.li@intel.com>; Du, Fan <fan.du@intel.com>; Miao, Jun
> <jun.miao@intel.com>; Yamahata, Isaku <isaku.yamahata@intel.com>;
> muchun.song@linux.dev; mike.kravetz@oracle.com
> Cc: Aktas, Erdem <erdemaktas@google.com>; Annapurve, Vishal
> <vannapurve@google.com>; ackerleytng@google.com; qperret@google.com;
> jhubbard@nvidia.com; willy@infradead.org; shuah@kernel.org;
> brauner@kernel.org; bfoster@redhat.com; kent.overstreet@linux.dev;
> pvorel@suse.cz; rppt@kernel.org; richard.weiyang@gmail.com;
> anup@brainfault.org; Xu, Haibo1 <haibo1.xu@intel.com>;
> ajones@ventanamicro.com; vkuznets@redhat.com; Wieczor-Retman, Maciej
> <maciej.wieczor-retman@intel.com>; pgonda@google.com;
> oliver.upton@linux.dev; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> kvm@vger.kernel.org; linux-kselftest@vger.kernel.org; linux-
> fsdevel@kvack.org
> Subject: [RFC PATCH 00/39] 1G page support for guest_memfd
> 
> Hello,
> 
> This patchset is our exploration of how to support 1G pages in guest_memfd,
> and
> how the pages will be used in Confidential VMs.
> 
> The patchset covers:
> 
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and
> shared

Hi Ackerley

Thanks for posting new version :)

W.r.t above description and below patch snippet from Patch 26-29,
Does this new design aim to backup shared and private GPA with a single
Hugetlb spool which equal VM instance total memory?

By my understanding, before this new changes, shared memfd and gmem fd
has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool.

Does Qemu require new changes as well? I'd like to have a test of this series
if you can share Qemu branch?

> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages

Thanks!

>   memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
>     + Selftests also demonstrate the conversion flow between VM, userspace
> and
>       guest_memfd.
> 
> Why 1G pages in guest memfd?
> 
> Bring guest_memfd to performance and memory savings parity with VMs that
> are
> backed by HugeTLBfs.
> 
> + Performance is improved with 1G pages by more TLB hits and faster page
> walks
>   on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap
> Optimization (HVO).
> 
> Options for 1G page support:
> 
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
> 
> Comparison between options:
> 
> 1. HugeTLB
>     + Refactor HugeTLB to separate allocator from the rest of HugeTLB
>     + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
>         + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed
> VMs
>     + Pro: Can provide iterative steps toward new future allocator
>         + Unexplored: Managing userspace-visible changes
>             + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
>               but not when future allocator is used
> 2. CMA
>     + Port some HugeTLB features to be applied on CMA
>     + Pro: Clean slate
> 
> What would refactoring HugeTLB involve?
> 
> (Some refactoring was done in this RFC, more can be done.)
> 
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
>     + Brings more modularity to HugeTLB
>     + No functionality change intended
>     + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not
> including
>    the complex parts of HugeTLB like
>     + Userspace reservations (resv_map)
>     + Shared PMD mappings
>     + Special page walkers
> 
> What features would need to be ported to CMA?
> 
> + Improved allocation guarantees
>     + Per NUMA node pool of huge pages
>     + Subpools per guest_memfd
> + Memory savings
>     + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
>     + Configuration of number of pages available (and per NUMA node) at and
>       after host boot
>     + Reporting of memory usage/availability statistics at runtime
> 
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
> 
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages
> would
> have to be given up and returned to CMA for guest_memfd pages to be
> rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed
> and
> reapplied on the new guest_memfd memory. This not only slows down
> memory
> allocation but also trims the benefits of HVO. Memory would have to be
> reserved
> on the host to facilitate these transitions.
> 
> Improving how guest_memfd uses the allocator in a future revision of this
> RFC:
> 
> To provide an easier transition away from HugeTLB, guest_memfd's use of
> HugeTLB
> should be limited to these allocator functions:
> 
> + reserve(node, page_size, num_pages) => opaque handle
>     + Used when a guest_memfd inode is created to reserve memory from
> backend
>       allocator
> + allocate(handle, mempolicy, page_size) => folio
>     + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
>     + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
>     + To take a folio, and reconstruct a huge folio out of nr_pages from the
>       first_folio
> + free(handle, folio) => void
>     + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
>     + To handle memory errors
> + unreserve(handle) => void
>     + To return guest_memfd's reservation to allocator backend
> 
> Userspace should only provide a page size when creating a guest_memfd and
> should
> not have to specify HugeTLB.
> 
> Overview of patches:
> 
> + Patches 01-12
>     + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts
> from
>       HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
>     + Letting guest_memfd use HugeTLB
>     + Creation of each guest_memfd reserves pages from HugeTLB's global
> hstate
>       and puts it into the guest_memfd inode's subpool
>     + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
>     + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
>     + More small changes on the HugeTLB side to expose functions needed by
>       guest_memfd
> + Patch 25:
>     + Uses the newly available functions from patches 22-24 to split HugeTLB
>       pages. In this patch, HugeTLB folios are always split to 4K before any
>       usage, private or shared.
> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages
> + Patch 30
>     + Required to zero folios after conversions to avoid leaking initialized
>       kernel memory
> + Patch 31-38
>     + Add selftests to test mapping pages to userspace, guest/host memory
>       sharing and update conversions tests
>     + Patch 33 illustrates the conversion flow between
> VM/userspace/guest_memfd
> + Patch 39
>     + Dynamically split and reconstruct HugeTLB pages instead of always
>       splitting before use. All earlier selftests are expected to still pass.
> 
> TODOs:
> 
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
>     + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only
> is the
>       mem_attr_array and faultability updated, the pages in the requested
> range
>       are also split/reconstructed as necessary. We want to look into delaying
>       splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
>     + When running private_mem_conversions_test with more than 1 vCPU, a
> folio
>       getting truncated may get faulted in by another process, causing elevated
>       mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
>   4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
>   faultability flag
> + Refactor HugeTLB better, present generic allocator interface
> 
> Please let us know your thoughts on:
> 
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
>     + Requiring user to request kernel to unmap pages from userspace using
>       madvise(MADV_DONTNEED)
>     + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
> 
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-
> b9afc1ff3656@quicinc.com/T/
> 
> Ackerley Tng (37):
>   mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
>   mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
>   mm: hugetlb: Remove unnecessary check for avoid_reserve
>   mm: mempolicy: Refactor out policy_node_nodemask()
>   mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
>     interpret mempolicy instead of vma
>   mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
>   mm: hugetlb: Refactor out hugetlb_alloc_folio
>   mm: truncate: Expose preparation steps for truncate_inode_pages_final
>   mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
>   mm: hugetlb: Add option to create new subpool without using surplus
>   mm: hugetlb: Expose hugetlb_acct_memory()
>   mm: hugetlb: Move and expose hugetlb_zero_partial_page()
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of
>     anonymous inodes
>   KVM: guest_memfd: hugetlb: initialization and cleanup
>   KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
>   KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
>   KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
>   KVM: selftests: Support various types of backing sources for private
>     memory
>   KVM: selftests: Update test for various private memory backing source
>     types
>   KVM: selftests: Add private_mem_conversions_test.sh
>   KVM: selftests: Test that guest_memfd usage is reported via hugetlb
>   mm: hugetlb: Expose vmemmap optimization functions
>   mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
>   mm: hugetlb: Add functions to add/move/remove from hugetlb lists
>   KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
>   KVM: guest_memfd: Allow mmapping guest_memfd files
>   KVM: guest_memfd: Use vm_type to determine default faultability
>   KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
>   KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
>   KVM: selftests: Allow vm_set_memory_attributes to be used without
>     asserting return value of 0
>   KVM: selftests: Test using guest_memfd memory from userspace
>   KVM: selftests: Test guest_memfd memory sharing between guest and host
>   KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
>     guest_memfd
>   KVM: selftests: Test that pinned pages block KVM from setting memory
>     attributes to PRIVATE
>   KVM: selftests: Refactor vm_mem_add to be more flexible
>   KVM: selftests: Add helper to perform madvise by memslots
>   KVM: selftests: Update private_mem_conversions_test for mmap()able
>     guest_memfd
> 
> Vishal Annapurve (2):
>   KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
>   KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
> 
>  fs/hugetlbfs/inode.c                          |   35 +-
>  include/linux/hugetlb.h                       |   54 +-
>  include/linux/kvm_host.h                      |    1 +
>  include/linux/mempolicy.h                     |    2 +
>  include/linux/mm.h                            |    1 +
>  include/uapi/linux/kvm.h                      |   26 +
>  include/uapi/linux/magic.h                    |    1 +
>  mm/hugetlb.c                                  |  346 ++--
>  mm/hugetlb_vmemmap.h                          |   11 -
>  mm/mempolicy.c                                |   36 +-
>  mm/truncate.c                                 |   26 +-
>  tools/include/linux/kernel.h                  |    4 +-
>  tools/testing/selftests/kvm/Makefile          |    3 +
>  .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
>  .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
>  .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
>  .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
>  .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
>  .../testing/selftests/kvm/include/test_util.h |   18 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
>  tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
>  .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
>  .../x86_64/private_mem_conversions_test.sh    |   91 +
>  .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
>  virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
>  virt/kvm/kvm_main.c                           |   17 +
>  virt/kvm/kvm_mm.h                             |   16 +
>  27 files changed, 3288 insertions(+), 443 deletions(-)
>  create mode 100644
> tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
>  create mode 100755
> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
> 
> --
> 2.46.0.598.g6f2099f65c-goog
Re: [RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Vishal Annapurve 1 year, 4 months ago
On Fri, Sep 13, 2024 at 6:08 PM Du, Fan <fan.du@intel.com> wrote:
>
> ...
> >
> > Hello,
> >
> > This patchset is our exploration of how to support 1G pages in guest_memfd,
> > and
> > how the pages will be used in Confidential VMs.
> >
> > The patchset covers:
> >
> > + How to get 1G pages
> > + Allowing mmap() of guest_memfd to userspace so that both private and
> > shared
>
> Hi Ackerley
>
> Thanks for posting new version :)
>
> W.r.t above description and below patch snippet from Patch 26-29,
> Does this new design aim to backup shared and private GPA with a single
> Hugetlb spool which equal VM instance total memory?

Yes.
>
> By my understanding, before this new changes, shared memfd and gmem fd
> has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool.

Selftests attached to this series use single gmem fd to back guest memory.

>
> Does Qemu require new changes as well? I'd like to have a test of this series
> if you can share Qemu branch?
>

We are going to discuss this RFC series and related issues at LPC.
Once the next steps are finalized, the plan will be to send out an
improved version. You can use/modify the selftests that are part of
this series to test this feature with software protected VMs for now.

Qemu will require changes for this feature on top of already floated
gmem integration series [1] that adds software protected VM support to
Qemu. If you are interested in testing this feature with TDX VMs then
it needs multiple series to set up the right test environment
(including [2]). We haven't considered posting Qemu patches and it
will be a while before we can get to it.

[1] https://patchew.org/QEMU/20230914035117.3285885-1-xiaoyao.li@intel.com/
[2] https://patchwork.kernel.org/project/kvm/cover/20231115071519.2864957-1-xiaoyao.li@intel.com/
Re: [RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Amit Shah 1 year ago
Hey Ackerley,

On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote:
> Hello,
> 
> This patchset is our exploration of how to support 1G pages in
> guest_memfd, and
> how the pages will be used in Confidential VMs.

We've discussed this patchset at LPC and in the guest-memfd calls.  Can
you please summarise the discussions here as a follow-up, so we can
also continue discussing on-list, and not repeat things that are
already discussed?

Also - as mentioned in those meetings, we at AMD are interested in this
series along with SEV-SNP support - and I'm also interested in figuring
out how we collaborate on the evolution of this series.

Thanks,

		Amit
Re: [RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Ackerley Tng 1 year ago
Amit Shah <amit@infradead.org> writes:

> Hey Ackerley,

Hi Amit,

> On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote:
>> Hello,
>> 
>> This patchset is our exploration of how to support 1G pages in
>> guest_memfd, and
>> how the pages will be used in Confidential VMs.
>
> We've discussed this patchset at LPC and in the guest-memfd calls.  Can
> you please summarise the discussions here as a follow-up, so we can
> also continue discussing on-list, and not repeat things that are
> already discussed?

Thanks for this question! Since LPC, Vishal and I have been tied up with
some Google internal work, which slowed down progress on 1G page support
for guest_memfd. We will have progress this quarter and the next few
quarters on 1G page support for guest_memfd.

The related updates are

1. No objections on using hugetlb as the source of 1G pages.

2. Prerequisite hugetlb changes.

+ I've separated some of the prerequisite hugetlb changes into another
  patch series hoping to have them merged ahead of and separately from
  this patchset [1].
+ Peter Xu contributed a better patchset, including a bugfix [2].
+ I have an alternative [3].
+ The next revision of this series (1G page support for guest_memfd)
  will be based on alternative [3]. I think there should be no issues
  there.
+ I believe Peter is also waiting on the next revision before we make
  further progress/decide on [2] or [3].

3. No objections for allowing mmap()-ing of guest_memfd physical memory
   when memory is marked shared to avoid double-allocation.

4. No objections for splitting pages when marked shared.

5. folio_put() callback for guest_memfd folio cleanup/merging.

+ In Fuad's series [4], Fuad used the callback to reset the folio's
  mappability status.
+ The catch is that the callback is only invoked when folio->page_type
  == PGTY_guest_memfd, and folio->page_type is a union with folio's
  mapcount, so any folio with a non-zero mapcount cannot have a valid
  page_type.
+ I was concerned that we might not get a callback, and hence
  unintentionally skip merging pages and not correctly restore hugetlb
  pages
+ This was discussed at the last guest_memfd upstream call (2025-01-23
  07:58 PST), and the conclusion is that using folio->page_type works,
  because
    + We only merge folios in two cases: (1) when converting to private
      (2) when truncating folios (removing from filemap).
    + When converting to private, in (1), we can forcibly unmap all the
      converted pages or check if the mapcount is 0, and once mapcount
      is 0 we can install the callback by setting folio->page_type =
      PGTY_guest_memfd
    + When truncating, we will be unmapping the folios anyway, so
      mapcount is also 0 and we can install the callback.

Hope that covers the points that you're referring to. If there are other
parts that you'd like to know the status on, please let me know which
aspects those are!

> Also - as mentioned in those meetings, we at AMD are interested in this
> series along with SEV-SNP support - and I'm also interested in figuring
> out how we collaborate on the evolution of this series.

Thanks all your help and comments during the guest_memfd upstream calls,
and thanks for the help from AMD.

Extending mmap() support from Fuad with 1G page support introduces more
states that made it more complicated (at least for me).

I'm modeling the states in python so I can iterate more quickly. I also
have usage flows (e.g. allocate, guest_use, host_use,
transient_folio_get, close, transient_folio_put) as test cases.

I'm almost done with the model and my next steps are to write up a state
machine (like Fuad's [5]) and share that.

I'd be happy to share the python model too but I have to work through
some internal open-sourcing processes first, so if you think this will
be useful, let me know!

Then, I'll code it all up in a new revision of this series (target:
March 2025), which will be accompanied by source code on GitHub.

I'm happy to collaborate more closely, let me know if you have ideas for
collaboration!

> Thanks,
>
> 		Amit

[1] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
[2] https://lore.kernel.org/all/20250107204002.2683356-1-peterx@redhat.com/T/
[3] https://lore.kernel.org/all/diqzjzayz5ho.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/T/
[5] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
Re: [RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Amit Shah 1 year ago
On Mon, 2025-02-03 at 08:35 +0000, Ackerley Tng wrote:
> Amit Shah <amit@infradead.org> writes:
> 
> > Hey Ackerley,
> 
> Hi Amit,
> 
> > On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote:
> > > Hello,
> > > 
> > > This patchset is our exploration of how to support 1G pages in
> > > guest_memfd, and
> > > how the pages will be used in Confidential VMs.
> > 
> > We've discussed this patchset at LPC and in the guest-memfd calls. 
> > Can
> > you please summarise the discussions here as a follow-up, so we can
> > also continue discussing on-list, and not repeat things that are
> > already discussed?
> 
> Thanks for this question! Since LPC, Vishal and I have been tied up
> with
> some Google internal work, which slowed down progress on 1G page
> support
> for guest_memfd. We will have progress this quarter and the next few
> quarters on 1G page support for guest_memfd.
> 
> The related updates are
> 
> 1. No objections on using hugetlb as the source of 1G pages.
> 
> 2. Prerequisite hugetlb changes.
> 
> + I've separated some of the prerequisite hugetlb changes into
> another
>   patch series hoping to have them merged ahead of and separately
> from
>   this patchset [1].
> + Peter Xu contributed a better patchset, including a bugfix [2].
> + I have an alternative [3].
> + The next revision of this series (1G page support for guest_memfd)
>   will be based on alternative [3]. I think there should be no issues
>   there.
> + I believe Peter is also waiting on the next revision before we make
>   further progress/decide on [2] or [3].
> 
> 3. No objections for allowing mmap()-ing of guest_memfd physical
> memory
>    when memory is marked shared to avoid double-allocation.
> 
> 4. No objections for splitting pages when marked shared.
> 
> 5. folio_put() callback for guest_memfd folio cleanup/merging.
> 
> + In Fuad's series [4], Fuad used the callback to reset the folio's
>   mappability status.
> + The catch is that the callback is only invoked when folio-
> >page_type
>   == PGTY_guest_memfd, and folio->page_type is a union with folio's
>   mapcount, so any folio with a non-zero mapcount cannot have a valid
>   page_type.
> + I was concerned that we might not get a callback, and hence
>   unintentionally skip merging pages and not correctly restore
> hugetlb
>   pages
> + This was discussed at the last guest_memfd upstream call (2025-01-
> 23
>   07:58 PST), and the conclusion is that using folio->page_type
> works,
>   because
>     + We only merge folios in two cases: (1) when converting to
> private
>       (2) when truncating folios (removing from filemap).
>     + When converting to private, in (1), we can forcibly unmap all
> the
>       converted pages or check if the mapcount is 0, and once
> mapcount
>       is 0 we can install the callback by setting folio->page_type =
>       PGTY_guest_memfd
>     + When truncating, we will be unmapping the folios anyway, so
>       mapcount is also 0 and we can install the callback.
> 
> Hope that covers the points that you're referring to. If there are
> other
> parts that you'd like to know the status on, please let me know which
> aspects those are!

Thank you for the nice summary!

> > Also - as mentioned in those meetings, we at AMD are interested in
> > this
> > series along with SEV-SNP support - and I'm also interested in
> > figuring
> > out how we collaborate on the evolution of this series.
> 
> Thanks all your help and comments during the guest_memfd upstream
> calls,
> and thanks for the help from AMD.
> 
> Extending mmap() support from Fuad with 1G page support introduces
> more
> states that made it more complicated (at least for me).
> 
> I'm modeling the states in python so I can iterate more quickly. I
> also
> have usage flows (e.g. allocate, guest_use, host_use,
> transient_folio_get, close, transient_folio_put) as test cases.
> 
> I'm almost done with the model and my next steps are to write up a
> state
> machine (like Fuad's [5]) and share that.
> 
> I'd be happy to share the python model too but I have to work through
> some internal open-sourcing processes first, so if you think this
> will
> be useful, let me know!

No problem.  Yes, I'm interested in this - it'll be helpful!

The other thing of note is that while we have the kernel patches, a
userspace to drive them and exercise them is currently missing.

> Then, I'll code it all up in a new revision of this series (target:
> March 2025), which will be accompanied by source code on GitHub.
> 
> I'm happy to collaborate more closely, let me know if you have ideas
> for
> collaboration!

Thank you.  I think currently the bigger problem we have is allocation
of hugepages -- which is also blocking a lot of the follow-on work. 
Vishal briefly mentioned isolating pages from Linux entirely last time
- that's also what I'm interested in to figure out if we can completely
bypass the allocation problem by not allocating struct pages for non-
host use pages entirely.  The guest_memfs/KHO/kexec/live-update patches
also take this approach on AWS (for how their VMs are launched).  If we
work with those patches together, allocation of 1G hugepages is
simplified.  I'd like to discuss more on these themes to see if this is
an approach that helps as well.


		Amit
Re: [RFC PATCH 00/39] 1G page support for guest_memfd
Posted by Ackerley Tng 1 year ago
Amit Shah <amit@infradead.org> writes:

>> <snip>
>> 
>> Thanks all your help and comments during the guest_memfd upstream
>> calls,
>> and thanks for the help from AMD.
>> 
>> Extending mmap() support from Fuad with 1G page support introduces
>> more
>> states that made it more complicated (at least for me).
>> 
>> I'm modeling the states in python so I can iterate more quickly. I
>> also
>> have usage flows (e.g. allocate, guest_use, host_use,
>> transient_folio_get, close, transient_folio_put) as test cases.
>> 
>> I'm almost done with the model and my next steps are to write up a
>> state
>> machine (like Fuad's [5]) and share that.

Thanks everyone for all the comments at the 2025-02-06 guest_memfd
upstream call! Here are the 

+ Slides: https://lpc.events/event/18/contributions/1764/attachments/1409/3704/guest-memfd-1g-page-support-2025-02-06.pdf
+ State diagram: https://lpc.events/event/18/contributions/1764/attachments/1409/3702/guest-memfd-state-diagram-split-merge-2025-02-06.drawio.svg
+ For those interested in editing the state diagram using draw.io:
  https://lpc.events/event/18/contributions/1764/attachments/1409/3703/guest-memfd-state-diagram-split-merge-2025-02-06.drawio.xml

>> 
>> I'd be happy to share the python model too but I have to work through
>> some internal open-sourcing processes first, so if you think this
>> will
>> be useful, let me know!
>
> No problem.  Yes, I'm interested in this - it'll be helpful!

I've started working through the internal processes and will update here
when I'm done!

>
> The other thing of note is that while we have the kernel patches, a
> userspace to drive them and exercise them is currently missing.

In this and future patch series, I'll have selftests that will exercise
any new functionality.

>
>> Then, I'll code it all up in a new revision of this series (target:
>> March 2025), which will be accompanied by source code on GitHub.
>> 
>> I'm happy to collaborate more closely, let me know if you have ideas
>> for
>> collaboration!
>
> Thank you.  I think currently the bigger problem we have is allocation
> of hugepages -- which is also blocking a lot of the follow-on work. 
> Vishal briefly mentioned isolating pages from Linux entirely last time
> - that's also what I'm interested in to figure out if we can completely
> bypass the allocation problem by not allocating struct pages for non-
> host use pages entirely.  The guest_memfs/KHO/kexec/live-update patches
> also take this approach on AWS (for how their VMs are launched).  If we
> work with those patches together, allocation of 1G hugepages is
> simplified.  I'd like to discuss more on these themes to see if this is
> an approach that helps as well.
>
>
> 		Amit

Vishal is still very interested in this and will probably be looking
into this while I push ahead assuming that KVM continues to use struct
pages. This was also brought up at the guest_memfd upstream call on
2025-02-06, people were interested in this and think that it will
simplify refcounting for merging and splitting.

I'll push ahead assuming that we use hugetlb as the source of 1G pages,
and assuming that KVM continues to use struct pages to describe guest
private memory.

The series will still be useful as an interim solution/prototype even if
other allocators are preferred and get merged. :)