fs/hugetlbfs/inode.c | 35 +- include/linux/hugetlb.h | 54 +- include/linux/kvm_host.h | 1 + include/linux/mempolicy.h | 2 + include/linux/mm.h | 1 + include/uapi/linux/kvm.h | 26 + include/uapi/linux/magic.h | 1 + mm/hugetlb.c | 346 ++-- mm/hugetlb_vmemmap.h | 11 - mm/mempolicy.c | 36 +- mm/truncate.c | 26 +- tools/include/linux/kernel.h | 4 +- tools/testing/selftests/kvm/Makefile | 3 + .../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++ .../selftests/kvm/guest_memfd_pin_test.c | 104 ++ .../selftests/kvm/guest_memfd_sharing_test.c | 160 ++ .../testing/selftests/kvm/guest_memfd_test.c | 238 ++- .../testing/selftests/kvm/include/kvm_util.h | 45 +- .../testing/selftests/kvm/include/test_util.h | 18 + tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++-- tools/testing/selftests/kvm/lib/test_util.c | 99 ++ .../kvm/x86_64/private_mem_conversions_test.c | 158 +- .../x86_64/private_mem_conversions_test.sh | 91 + .../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +- virt/kvm/guest_memfd.c | 1563 ++++++++++++++++- virt/kvm/kvm_main.c | 17 + virt/kvm/kvm_mm.h | 16 + 27 files changed, 3288 insertions(+), 443 deletions(-) create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
Hello,
This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.
The patchset covers:
+ How to get 1G pages
+ Allowing mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages
+ Splitting and reconstructing pages to support conversions and mmap()
+ How the VM, userspace and guest_memfd interact to support conversions
+ Selftests to test all the above
+ Selftests also demonstrate the conversion flow between VM, userspace and
guest_memfd.
Why 1G pages in guest memfd?
Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.
+ Performance is improved with 1G pages by more TLB hits and faster page walks
on TLB misses.
+ Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
Options for 1G page support:
1. HugeTLB
2. Contiguous Memory Allocator (CMA)
3. Other suggestions are welcome!
Comparison between options:
1. HugeTLB
+ Refactor HugeTLB to separate allocator from the rest of HugeTLB
+ Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
+ Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
+ Pro: Can provide iterative steps toward new future allocator
+ Unexplored: Managing userspace-visible changes
+ e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
but not when future allocator is used
2. CMA
+ Port some HugeTLB features to be applied on CMA
+ Pro: Clean slate
What would refactoring HugeTLB involve?
(Some refactoring was done in this RFC, more can be done.)
1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
+ Brings more modularity to HugeTLB
+ No functionality change intended
+ Likely step towards HugeTLB's integration into core-mm
2. guest_memfd will use just the allocator component of HugeTLB, not including
the complex parts of HugeTLB like
+ Userspace reservations (resv_map)
+ Shared PMD mappings
+ Special page walkers
What features would need to be ported to CMA?
+ Improved allocation guarantees
+ Per NUMA node pool of huge pages
+ Subpools per guest_memfd
+ Memory savings
+ Something like HugeTLB Vmemmap Optimization
+ Configuration/reporting features
+ Configuration of number of pages available (and per NUMA node) at and
after host boot
+ Reporting of memory usage/availability statistics at runtime
HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.
To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.
Improving how guest_memfd uses the allocator in a future revision of this RFC:
To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:
+ reserve(node, page_size, num_pages) => opaque handle
+ Used when a guest_memfd inode is created to reserve memory from backend
allocator
+ allocate(handle, mempolicy, page_size) => folio
+ To allocate a folio from guest_memfd's reservation
+ split(handle, folio, target_page_size) => void
+ To take a huge folio, and split it to smaller folios, restore to filemap
+ reconstruct(handle, first_folio, nr_pages) => void
+ To take a folio, and reconstruct a huge folio out of nr_pages from the
first_folio
+ free(handle, folio) => void
+ To return folio to guest_memfd's reservation
+ error(handle, folio) => void
+ To handle memory errors
+ unreserve(handle) => void
+ To return guest_memfd's reservation to allocator backend
Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.
Overview of patches:
+ Patches 01-12
+ Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
HugeTLB, and to expose HugeTLB functions.
+ Patches 13-16
+ Letting guest_memfd use HugeTLB
+ Creation of each guest_memfd reserves pages from HugeTLB's global hstate
and puts it into the guest_memfd inode's subpool
+ Each folio allocation takes a page from the guest_memfd inode's subpool
+ Patches 17-21
+ Selftests for new HugeTLB features in guest_memfd
+ Patches 22-24
+ More small changes on the HugeTLB side to expose functions needed by
guest_memfd
+ Patch 25:
+ Uses the newly available functions from patches 22-24 to split HugeTLB
pages. In this patch, HugeTLB folios are always split to 4K before any
usage, private or shared.
+ Patches 26-28
+ Allow mmap() in guest_memfd and faulting in shared pages
+ Patch 29
+ Enables conversion between private/shared pages
+ Patch 30
+ Required to zero folios after conversions to avoid leaking initialized
kernel memory
+ Patch 31-38
+ Add selftests to test mapping pages to userspace, guest/host memory
sharing and update conversions tests
+ Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
+ Patch 39
+ Dynamically split and reconstruct HugeTLB pages instead of always
splitting before use. All earlier selftests are expected to still pass.
TODOs:
+ Add logic to wait for safe_refcount [1]
+ Look into lazy splitting/reconstruction of pages
+ Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
mem_attr_array and faultability updated, the pages in the requested range
are also split/reconstructed as necessary. We want to look into delaying
splitting/reconstruction to fault time.
+ Solve race between folios being faulted in and being truncated
+ When running private_mem_conversions_test with more than 1 vCPU, a folio
getting truncated may get faulted in by another process, causing elevated
mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
+ Add intermediate splits (1G should first split to 2M and not split directly to
4K)
+ Use guest's lock instead of hugetlb_lock
+ Use multi-index xarray/replace xarray with some other data struct for
faultability flag
+ Refactor HugeTLB better, present generic allocator interface
Please let us know your thoughts on:
+ HugeTLB as the choice of transitional allocator backend
+ Refactoring HugeTLB to provide generic allocator interface
+ Shared/private conversion flow
+ Requiring user to request kernel to unmap pages from userspace using
madvise(MADV_DONTNEED)
+ Failing conversion on elevated mapcounts/pincounts/refcounts
+ Process of splitting/reconstructing page
+ Anything else!
[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com/T/
Ackerley Tng (37):
mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
mm: hugetlb: Remove unnecessary check for avoid_reserve
mm: mempolicy: Refactor out policy_node_nodemask()
mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
interpret mempolicy instead of vma
mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
mm: hugetlb: Refactor out hugetlb_alloc_folio
mm: truncate: Expose preparation steps for truncate_inode_pages_final
mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
mm: hugetlb: Add option to create new subpool without using surplus
mm: hugetlb: Expose hugetlb_acct_memory()
mm: hugetlb: Move and expose hugetlb_zero_partial_page()
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: hugetlb: initialization and cleanup
KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
KVM: selftests: Add private_mem_conversions_test.sh
KVM: selftests: Test that guest_memfd usage is reported via hugetlb
mm: hugetlb: Expose vmemmap optimization functions
mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
mm: hugetlb: Add functions to add/move/remove from hugetlb lists
KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
KVM: guest_memfd: Allow mmapping guest_memfd files
KVM: guest_memfd: Use vm_type to determine default faultability
KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
KVM: selftests: Allow vm_set_memory_attributes to be used without
asserting return value of 0
KVM: selftests: Test using guest_memfd memory from userspace
KVM: selftests: Test guest_memfd memory sharing between guest and host
KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
guest_memfd
KVM: selftests: Test that pinned pages block KVM from setting memory
attributes to PRIVATE
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Add helper to perform madvise by memslots
KVM: selftests: Update private_mem_conversions_test for mmap()able
guest_memfd
Vishal Annapurve (2):
KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
fs/hugetlbfs/inode.c | 35 +-
include/linux/hugetlb.h | 54 +-
include/linux/kvm_host.h | 1 +
include/linux/mempolicy.h | 2 +
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 26 +
include/uapi/linux/magic.h | 1 +
mm/hugetlb.c | 346 ++--
mm/hugetlb_vmemmap.h | 11 -
mm/mempolicy.c | 36 +-
mm/truncate.c | 26 +-
tools/include/linux/kernel.h | 4 +-
tools/testing/selftests/kvm/Makefile | 3 +
.../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
.../selftests/kvm/guest_memfd_pin_test.c | 104 ++
.../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
.../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
.../testing/selftests/kvm/include/kvm_util.h | 45 +-
.../testing/selftests/kvm/include/test_util.h | 18 +
tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
tools/testing/selftests/kvm/lib/test_util.c | 99 ++
.../kvm/x86_64/private_mem_conversions_test.c | 158 +-
.../x86_64/private_mem_conversions_test.sh | 91 +
.../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
virt/kvm/kvm_main.c | 17 +
virt/kvm/kvm_mm.h | 16 +
27 files changed, 3288 insertions(+), 443 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
--
2.46.0.598.g6f2099f65c-goog
Cc Oscar for awareness
On Tue 10-09-24 23:43:31, Ackerley Tng wrote:
> Hello,
>
> This patchset is our exploration of how to support 1G pages in guest_memfd, and
> how the pages will be used in Confidential VMs.
>
> The patchset covers:
>
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and shared
> memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
> + Selftests also demonstrate the conversion flow between VM, userspace and
> guest_memfd.
>
> Why 1G pages in guest memfd?
>
> Bring guest_memfd to performance and memory savings parity with VMs that are
> backed by HugeTLBfs.
>
> + Performance is improved with 1G pages by more TLB hits and faster page walks
> on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
>
> Options for 1G page support:
>
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
>
> Comparison between options:
>
> 1. HugeTLB
> + Refactor HugeTLB to separate allocator from the rest of HugeTLB
> + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
> + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
> + Pro: Can provide iterative steps toward new future allocator
> + Unexplored: Managing userspace-visible changes
> + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
> but not when future allocator is used
> 2. CMA
> + Port some HugeTLB features to be applied on CMA
> + Pro: Clean slate
>
> What would refactoring HugeTLB involve?
>
> (Some refactoring was done in this RFC, more can be done.)
>
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
> + Brings more modularity to HugeTLB
> + No functionality change intended
> + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not including
> the complex parts of HugeTLB like
> + Userspace reservations (resv_map)
> + Shared PMD mappings
> + Special page walkers
>
> What features would need to be ported to CMA?
>
> + Improved allocation guarantees
> + Per NUMA node pool of huge pages
> + Subpools per guest_memfd
> + Memory savings
> + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
> + Configuration of number of pages available (and per NUMA node) at and
> after host boot
> + Reporting of memory usage/availability statistics at runtime
>
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
>
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages would
> have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed and
> reapplied on the new guest_memfd memory. This not only slows down memory
> allocation but also trims the benefits of HVO. Memory would have to be reserved
> on the host to facilitate these transitions.
>
> Improving how guest_memfd uses the allocator in a future revision of this RFC:
>
> To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
> should be limited to these allocator functions:
>
> + reserve(node, page_size, num_pages) => opaque handle
> + Used when a guest_memfd inode is created to reserve memory from backend
> allocator
> + allocate(handle, mempolicy, page_size) => folio
> + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
> + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
> + To take a folio, and reconstruct a huge folio out of nr_pages from the
> first_folio
> + free(handle, folio) => void
> + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
> + To handle memory errors
> + unreserve(handle) => void
> + To return guest_memfd's reservation to allocator backend
>
> Userspace should only provide a page size when creating a guest_memfd and should
> not have to specify HugeTLB.
>
> Overview of patches:
>
> + Patches 01-12
> + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
> HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
> + Letting guest_memfd use HugeTLB
> + Creation of each guest_memfd reserves pages from HugeTLB's global hstate
> and puts it into the guest_memfd inode's subpool
> + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
> + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
> + More small changes on the HugeTLB side to expose functions needed by
> guest_memfd
> + Patch 25:
> + Uses the newly available functions from patches 22-24 to split HugeTLB
> pages. In this patch, HugeTLB folios are always split to 4K before any
> usage, private or shared.
> + Patches 26-28
> + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
> + Enables conversion between private/shared pages
> + Patch 30
> + Required to zero folios after conversions to avoid leaking initialized
> kernel memory
> + Patch 31-38
> + Add selftests to test mapping pages to userspace, guest/host memory
> sharing and update conversions tests
> + Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
> + Patch 39
> + Dynamically split and reconstruct HugeTLB pages instead of always
> splitting before use. All earlier selftests are expected to still pass.
>
> TODOs:
>
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
> + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
> mem_attr_array and faultability updated, the pages in the requested range
> are also split/reconstructed as necessary. We want to look into delaying
> splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
> + When running private_mem_conversions_test with more than 1 vCPU, a folio
> getting truncated may get faulted in by another process, causing elevated
> mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
> 4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
> faultability flag
> + Refactor HugeTLB better, present generic allocator interface
>
> Please let us know your thoughts on:
>
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
> + Requiring user to request kernel to unmap pages from userspace using
> madvise(MADV_DONTNEED)
> + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
>
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com/T/
>
> Ackerley Tng (37):
> mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
> mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
> mm: hugetlb: Remove unnecessary check for avoid_reserve
> mm: mempolicy: Refactor out policy_node_nodemask()
> mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
> interpret mempolicy instead of vma
> mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
> mm: hugetlb: Refactor out hugetlb_alloc_folio
> mm: truncate: Expose preparation steps for truncate_inode_pages_final
> mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
> mm: hugetlb: Add option to create new subpool without using surplus
> mm: hugetlb: Expose hugetlb_acct_memory()
> mm: hugetlb: Move and expose hugetlb_zero_partial_page()
> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> anonymous inodes
> KVM: guest_memfd: hugetlb: initialization and cleanup
> KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
> KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
> KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
> KVM: selftests: Support various types of backing sources for private
> memory
> KVM: selftests: Update test for various private memory backing source
> types
> KVM: selftests: Add private_mem_conversions_test.sh
> KVM: selftests: Test that guest_memfd usage is reported via hugetlb
> mm: hugetlb: Expose vmemmap optimization functions
> mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
> mm: hugetlb: Add functions to add/move/remove from hugetlb lists
> KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
> KVM: guest_memfd: Allow mmapping guest_memfd files
> KVM: guest_memfd: Use vm_type to determine default faultability
> KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
> KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
> KVM: selftests: Allow vm_set_memory_attributes to be used without
> asserting return value of 0
> KVM: selftests: Test using guest_memfd memory from userspace
> KVM: selftests: Test guest_memfd memory sharing between guest and host
> KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
> guest_memfd
> KVM: selftests: Test that pinned pages block KVM from setting memory
> attributes to PRIVATE
> KVM: selftests: Refactor vm_mem_add to be more flexible
> KVM: selftests: Add helper to perform madvise by memslots
> KVM: selftests: Update private_mem_conversions_test for mmap()able
> guest_memfd
>
> Vishal Annapurve (2):
> KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
> KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
>
> fs/hugetlbfs/inode.c | 35 +-
> include/linux/hugetlb.h | 54 +-
> include/linux/kvm_host.h | 1 +
> include/linux/mempolicy.h | 2 +
> include/linux/mm.h | 1 +
> include/uapi/linux/kvm.h | 26 +
> include/uapi/linux/magic.h | 1 +
> mm/hugetlb.c | 346 ++--
> mm/hugetlb_vmemmap.h | 11 -
> mm/mempolicy.c | 36 +-
> mm/truncate.c | 26 +-
> tools/include/linux/kernel.h | 4 +-
> tools/testing/selftests/kvm/Makefile | 3 +
> .../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
> .../selftests/kvm/guest_memfd_pin_test.c | 104 ++
> .../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
> .../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
> .../testing/selftests/kvm/include/kvm_util.h | 45 +-
> .../testing/selftests/kvm/include/test_util.h | 18 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
> tools/testing/selftests/kvm/lib/test_util.c | 99 ++
> .../kvm/x86_64/private_mem_conversions_test.c | 158 +-
> .../x86_64/private_mem_conversions_test.sh | 91 +
> .../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
> virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
> virt/kvm/kvm_main.c | 17 +
> virt/kvm/kvm_mm.h | 16 +
> 27 files changed, 3288 insertions(+), 443 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
> create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
>
> --
> 2.46.0.598.g6f2099f65c-goog
--
Michal Hocko
SUSE Labs
> -----Original Message-----
> From: Ackerley Tng <ackerleytng@google.com>
> Sent: Wednesday, September 11, 2024 7:44 AM
> To: tabba@google.com; quic_eberman@quicinc.com; roypat@amazon.co.uk;
> jgg@nvidia.com; peterx@redhat.com; david@redhat.com;
> rientjes@google.com; fvdl@google.com; jthoughton@google.com;
> seanjc@google.com; pbonzini@redhat.com; Li, Zhiquan1
> <zhiquan1.li@intel.com>; Du, Fan <fan.du@intel.com>; Miao, Jun
> <jun.miao@intel.com>; Yamahata, Isaku <isaku.yamahata@intel.com>;
> muchun.song@linux.dev; mike.kravetz@oracle.com
> Cc: Aktas, Erdem <erdemaktas@google.com>; Annapurve, Vishal
> <vannapurve@google.com>; ackerleytng@google.com; qperret@google.com;
> jhubbard@nvidia.com; willy@infradead.org; shuah@kernel.org;
> brauner@kernel.org; bfoster@redhat.com; kent.overstreet@linux.dev;
> pvorel@suse.cz; rppt@kernel.org; richard.weiyang@gmail.com;
> anup@brainfault.org; Xu, Haibo1 <haibo1.xu@intel.com>;
> ajones@ventanamicro.com; vkuznets@redhat.com; Wieczor-Retman, Maciej
> <maciej.wieczor-retman@intel.com>; pgonda@google.com;
> oliver.upton@linux.dev; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> kvm@vger.kernel.org; linux-kselftest@vger.kernel.org; linux-
> fsdevel@kvack.org
> Subject: [RFC PATCH 00/39] 1G page support for guest_memfd
>
> Hello,
>
> This patchset is our exploration of how to support 1G pages in guest_memfd,
> and
> how the pages will be used in Confidential VMs.
>
> The patchset covers:
>
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and
> shared
Hi Ackerley
Thanks for posting new version :)
W.r.t above description and below patch snippet from Patch 26-29,
Does this new design aim to backup shared and private GPA with a single
Hugetlb spool which equal VM instance total memory?
By my understanding, before this new changes, shared memfd and gmem fd
has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool.
Does Qemu require new changes as well? I'd like to have a test of this series
if you can share Qemu branch?
> + Patches 26-28
> + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
> + Enables conversion between private/shared pages
Thanks!
> memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
> + Selftests also demonstrate the conversion flow between VM, userspace
> and
> guest_memfd.
>
> Why 1G pages in guest memfd?
>
> Bring guest_memfd to performance and memory savings parity with VMs that
> are
> backed by HugeTLBfs.
>
> + Performance is improved with 1G pages by more TLB hits and faster page
> walks
> on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap
> Optimization (HVO).
>
> Options for 1G page support:
>
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
>
> Comparison between options:
>
> 1. HugeTLB
> + Refactor HugeTLB to separate allocator from the rest of HugeTLB
> + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
> + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed
> VMs
> + Pro: Can provide iterative steps toward new future allocator
> + Unexplored: Managing userspace-visible changes
> + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
> but not when future allocator is used
> 2. CMA
> + Port some HugeTLB features to be applied on CMA
> + Pro: Clean slate
>
> What would refactoring HugeTLB involve?
>
> (Some refactoring was done in this RFC, more can be done.)
>
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
> + Brings more modularity to HugeTLB
> + No functionality change intended
> + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not
> including
> the complex parts of HugeTLB like
> + Userspace reservations (resv_map)
> + Shared PMD mappings
> + Special page walkers
>
> What features would need to be ported to CMA?
>
> + Improved allocation guarantees
> + Per NUMA node pool of huge pages
> + Subpools per guest_memfd
> + Memory savings
> + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
> + Configuration of number of pages available (and per NUMA node) at and
> after host boot
> + Reporting of memory usage/availability statistics at runtime
>
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
>
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages
> would
> have to be given up and returned to CMA for guest_memfd pages to be
> rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed
> and
> reapplied on the new guest_memfd memory. This not only slows down
> memory
> allocation but also trims the benefits of HVO. Memory would have to be
> reserved
> on the host to facilitate these transitions.
>
> Improving how guest_memfd uses the allocator in a future revision of this
> RFC:
>
> To provide an easier transition away from HugeTLB, guest_memfd's use of
> HugeTLB
> should be limited to these allocator functions:
>
> + reserve(node, page_size, num_pages) => opaque handle
> + Used when a guest_memfd inode is created to reserve memory from
> backend
> allocator
> + allocate(handle, mempolicy, page_size) => folio
> + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
> + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
> + To take a folio, and reconstruct a huge folio out of nr_pages from the
> first_folio
> + free(handle, folio) => void
> + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
> + To handle memory errors
> + unreserve(handle) => void
> + To return guest_memfd's reservation to allocator backend
>
> Userspace should only provide a page size when creating a guest_memfd and
> should
> not have to specify HugeTLB.
>
> Overview of patches:
>
> + Patches 01-12
> + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts
> from
> HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
> + Letting guest_memfd use HugeTLB
> + Creation of each guest_memfd reserves pages from HugeTLB's global
> hstate
> and puts it into the guest_memfd inode's subpool
> + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
> + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
> + More small changes on the HugeTLB side to expose functions needed by
> guest_memfd
> + Patch 25:
> + Uses the newly available functions from patches 22-24 to split HugeTLB
> pages. In this patch, HugeTLB folios are always split to 4K before any
> usage, private or shared.
> + Patches 26-28
> + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
> + Enables conversion between private/shared pages
> + Patch 30
> + Required to zero folios after conversions to avoid leaking initialized
> kernel memory
> + Patch 31-38
> + Add selftests to test mapping pages to userspace, guest/host memory
> sharing and update conversions tests
> + Patch 33 illustrates the conversion flow between
> VM/userspace/guest_memfd
> + Patch 39
> + Dynamically split and reconstruct HugeTLB pages instead of always
> splitting before use. All earlier selftests are expected to still pass.
>
> TODOs:
>
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
> + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only
> is the
> mem_attr_array and faultability updated, the pages in the requested
> range
> are also split/reconstructed as necessary. We want to look into delaying
> splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
> + When running private_mem_conversions_test with more than 1 vCPU, a
> folio
> getting truncated may get faulted in by another process, causing elevated
> mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
> 4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
> faultability flag
> + Refactor HugeTLB better, present generic allocator interface
>
> Please let us know your thoughts on:
>
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
> + Requiring user to request kernel to unmap pages from userspace using
> madvise(MADV_DONTNEED)
> + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
>
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-
> b9afc1ff3656@quicinc.com/T/
>
> Ackerley Tng (37):
> mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
> mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
> mm: hugetlb: Remove unnecessary check for avoid_reserve
> mm: mempolicy: Refactor out policy_node_nodemask()
> mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
> interpret mempolicy instead of vma
> mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
> mm: hugetlb: Refactor out hugetlb_alloc_folio
> mm: truncate: Expose preparation steps for truncate_inode_pages_final
> mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
> mm: hugetlb: Add option to create new subpool without using surplus
> mm: hugetlb: Expose hugetlb_acct_memory()
> mm: hugetlb: Move and expose hugetlb_zero_partial_page()
> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> anonymous inodes
> KVM: guest_memfd: hugetlb: initialization and cleanup
> KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
> KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
> KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
> KVM: selftests: Support various types of backing sources for private
> memory
> KVM: selftests: Update test for various private memory backing source
> types
> KVM: selftests: Add private_mem_conversions_test.sh
> KVM: selftests: Test that guest_memfd usage is reported via hugetlb
> mm: hugetlb: Expose vmemmap optimization functions
> mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
> mm: hugetlb: Add functions to add/move/remove from hugetlb lists
> KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
> KVM: guest_memfd: Allow mmapping guest_memfd files
> KVM: guest_memfd: Use vm_type to determine default faultability
> KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
> KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
> KVM: selftests: Allow vm_set_memory_attributes to be used without
> asserting return value of 0
> KVM: selftests: Test using guest_memfd memory from userspace
> KVM: selftests: Test guest_memfd memory sharing between guest and host
> KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
> guest_memfd
> KVM: selftests: Test that pinned pages block KVM from setting memory
> attributes to PRIVATE
> KVM: selftests: Refactor vm_mem_add to be more flexible
> KVM: selftests: Add helper to perform madvise by memslots
> KVM: selftests: Update private_mem_conversions_test for mmap()able
> guest_memfd
>
> Vishal Annapurve (2):
> KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
> KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
>
> fs/hugetlbfs/inode.c | 35 +-
> include/linux/hugetlb.h | 54 +-
> include/linux/kvm_host.h | 1 +
> include/linux/mempolicy.h | 2 +
> include/linux/mm.h | 1 +
> include/uapi/linux/kvm.h | 26 +
> include/uapi/linux/magic.h | 1 +
> mm/hugetlb.c | 346 ++--
> mm/hugetlb_vmemmap.h | 11 -
> mm/mempolicy.c | 36 +-
> mm/truncate.c | 26 +-
> tools/include/linux/kernel.h | 4 +-
> tools/testing/selftests/kvm/Makefile | 3 +
> .../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
> .../selftests/kvm/guest_memfd_pin_test.c | 104 ++
> .../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
> .../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
> .../testing/selftests/kvm/include/kvm_util.h | 45 +-
> .../testing/selftests/kvm/include/test_util.h | 18 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
> tools/testing/selftests/kvm/lib/test_util.c | 99 ++
> .../kvm/x86_64/private_mem_conversions_test.c | 158 +-
> .../x86_64/private_mem_conversions_test.sh | 91 +
> .../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
> virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
> virt/kvm/kvm_main.c | 17 +
> virt/kvm/kvm_mm.h | 16 +
> 27 files changed, 3288 insertions(+), 443 deletions(-)
> create mode 100644
> tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
> create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
> create mode 100755
> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
>
> --
> 2.46.0.598.g6f2099f65c-goog
On Fri, Sep 13, 2024 at 6:08 PM Du, Fan <fan.du@intel.com> wrote: > > ... > > > > Hello, > > > > This patchset is our exploration of how to support 1G pages in guest_memfd, > > and > > how the pages will be used in Confidential VMs. > > > > The patchset covers: > > > > + How to get 1G pages > > + Allowing mmap() of guest_memfd to userspace so that both private and > > shared > > Hi Ackerley > > Thanks for posting new version :) > > W.r.t above description and below patch snippet from Patch 26-29, > Does this new design aim to backup shared and private GPA with a single > Hugetlb spool which equal VM instance total memory? Yes. > > By my understanding, before this new changes, shared memfd and gmem fd > has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool. Selftests attached to this series use single gmem fd to back guest memory. > > Does Qemu require new changes as well? I'd like to have a test of this series > if you can share Qemu branch? > We are going to discuss this RFC series and related issues at LPC. Once the next steps are finalized, the plan will be to send out an improved version. You can use/modify the selftests that are part of this series to test this feature with software protected VMs for now. Qemu will require changes for this feature on top of already floated gmem integration series [1] that adds software protected VM support to Qemu. If you are interested in testing this feature with TDX VMs then it needs multiple series to set up the right test environment (including [2]). We haven't considered posting Qemu patches and it will be a while before we can get to it. [1] https://patchew.org/QEMU/20230914035117.3285885-1-xiaoyao.li@intel.com/ [2] https://patchwork.kernel.org/project/kvm/cover/20231115071519.2864957-1-xiaoyao.li@intel.com/
Hey Ackerley, On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote: > Hello, > > This patchset is our exploration of how to support 1G pages in > guest_memfd, and > how the pages will be used in Confidential VMs. We've discussed this patchset at LPC and in the guest-memfd calls. Can you please summarise the discussions here as a follow-up, so we can also continue discussing on-list, and not repeat things that are already discussed? Also - as mentioned in those meetings, we at AMD are interested in this series along with SEV-SNP support - and I'm also interested in figuring out how we collaborate on the evolution of this series. Thanks, Amit
Amit Shah <amit@infradead.org> writes:
> Hey Ackerley,
Hi Amit,
> On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote:
>> Hello,
>>
>> This patchset is our exploration of how to support 1G pages in
>> guest_memfd, and
>> how the pages will be used in Confidential VMs.
>
> We've discussed this patchset at LPC and in the guest-memfd calls. Can
> you please summarise the discussions here as a follow-up, so we can
> also continue discussing on-list, and not repeat things that are
> already discussed?
Thanks for this question! Since LPC, Vishal and I have been tied up with
some Google internal work, which slowed down progress on 1G page support
for guest_memfd. We will have progress this quarter and the next few
quarters on 1G page support for guest_memfd.
The related updates are
1. No objections on using hugetlb as the source of 1G pages.
2. Prerequisite hugetlb changes.
+ I've separated some of the prerequisite hugetlb changes into another
patch series hoping to have them merged ahead of and separately from
this patchset [1].
+ Peter Xu contributed a better patchset, including a bugfix [2].
+ I have an alternative [3].
+ The next revision of this series (1G page support for guest_memfd)
will be based on alternative [3]. I think there should be no issues
there.
+ I believe Peter is also waiting on the next revision before we make
further progress/decide on [2] or [3].
3. No objections for allowing mmap()-ing of guest_memfd physical memory
when memory is marked shared to avoid double-allocation.
4. No objections for splitting pages when marked shared.
5. folio_put() callback for guest_memfd folio cleanup/merging.
+ In Fuad's series [4], Fuad used the callback to reset the folio's
mappability status.
+ The catch is that the callback is only invoked when folio->page_type
== PGTY_guest_memfd, and folio->page_type is a union with folio's
mapcount, so any folio with a non-zero mapcount cannot have a valid
page_type.
+ I was concerned that we might not get a callback, and hence
unintentionally skip merging pages and not correctly restore hugetlb
pages
+ This was discussed at the last guest_memfd upstream call (2025-01-23
07:58 PST), and the conclusion is that using folio->page_type works,
because
+ We only merge folios in two cases: (1) when converting to private
(2) when truncating folios (removing from filemap).
+ When converting to private, in (1), we can forcibly unmap all the
converted pages or check if the mapcount is 0, and once mapcount
is 0 we can install the callback by setting folio->page_type =
PGTY_guest_memfd
+ When truncating, we will be unmapping the folios anyway, so
mapcount is also 0 and we can install the callback.
Hope that covers the points that you're referring to. If there are other
parts that you'd like to know the status on, please let me know which
aspects those are!
> Also - as mentioned in those meetings, we at AMD are interested in this
> series along with SEV-SNP support - and I'm also interested in figuring
> out how we collaborate on the evolution of this series.
Thanks all your help and comments during the guest_memfd upstream calls,
and thanks for the help from AMD.
Extending mmap() support from Fuad with 1G page support introduces more
states that made it more complicated (at least for me).
I'm modeling the states in python so I can iterate more quickly. I also
have usage flows (e.g. allocate, guest_use, host_use,
transient_folio_get, close, transient_folio_put) as test cases.
I'm almost done with the model and my next steps are to write up a state
machine (like Fuad's [5]) and share that.
I'd be happy to share the python model too but I have to work through
some internal open-sourcing processes first, so if you think this will
be useful, let me know!
Then, I'll code it all up in a new revision of this series (target:
March 2025), which will be accompanied by source code on GitHub.
I'm happy to collaborate more closely, let me know if you have ideas for
collaboration!
> Thanks,
>
> Amit
[1] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
[2] https://lore.kernel.org/all/20250107204002.2683356-1-peterx@redhat.com/T/
[3] https://lore.kernel.org/all/diqzjzayz5ho.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/T/
[5] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
On Mon, 2025-02-03 at 08:35 +0000, Ackerley Tng wrote: > Amit Shah <amit@infradead.org> writes: > > > Hey Ackerley, > > Hi Amit, > > > On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote: > > > Hello, > > > > > > This patchset is our exploration of how to support 1G pages in > > > guest_memfd, and > > > how the pages will be used in Confidential VMs. > > > > We've discussed this patchset at LPC and in the guest-memfd calls. > > Can > > you please summarise the discussions here as a follow-up, so we can > > also continue discussing on-list, and not repeat things that are > > already discussed? > > Thanks for this question! Since LPC, Vishal and I have been tied up > with > some Google internal work, which slowed down progress on 1G page > support > for guest_memfd. We will have progress this quarter and the next few > quarters on 1G page support for guest_memfd. > > The related updates are > > 1. No objections on using hugetlb as the source of 1G pages. > > 2. Prerequisite hugetlb changes. > > + I've separated some of the prerequisite hugetlb changes into > another > patch series hoping to have them merged ahead of and separately > from > this patchset [1]. > + Peter Xu contributed a better patchset, including a bugfix [2]. > + I have an alternative [3]. > + The next revision of this series (1G page support for guest_memfd) > will be based on alternative [3]. I think there should be no issues > there. > + I believe Peter is also waiting on the next revision before we make > further progress/decide on [2] or [3]. > > 3. No objections for allowing mmap()-ing of guest_memfd physical > memory > when memory is marked shared to avoid double-allocation. > > 4. No objections for splitting pages when marked shared. > > 5. folio_put() callback for guest_memfd folio cleanup/merging. > > + In Fuad's series [4], Fuad used the callback to reset the folio's > mappability status. > + The catch is that the callback is only invoked when folio- > >page_type > == PGTY_guest_memfd, and folio->page_type is a union with folio's > mapcount, so any folio with a non-zero mapcount cannot have a valid > page_type. > + I was concerned that we might not get a callback, and hence > unintentionally skip merging pages and not correctly restore > hugetlb > pages > + This was discussed at the last guest_memfd upstream call (2025-01- > 23 > 07:58 PST), and the conclusion is that using folio->page_type > works, > because > + We only merge folios in two cases: (1) when converting to > private > (2) when truncating folios (removing from filemap). > + When converting to private, in (1), we can forcibly unmap all > the > converted pages or check if the mapcount is 0, and once > mapcount > is 0 we can install the callback by setting folio->page_type = > PGTY_guest_memfd > + When truncating, we will be unmapping the folios anyway, so > mapcount is also 0 and we can install the callback. > > Hope that covers the points that you're referring to. If there are > other > parts that you'd like to know the status on, please let me know which > aspects those are! Thank you for the nice summary! > > Also - as mentioned in those meetings, we at AMD are interested in > > this > > series along with SEV-SNP support - and I'm also interested in > > figuring > > out how we collaborate on the evolution of this series. > > Thanks all your help and comments during the guest_memfd upstream > calls, > and thanks for the help from AMD. > > Extending mmap() support from Fuad with 1G page support introduces > more > states that made it more complicated (at least for me). > > I'm modeling the states in python so I can iterate more quickly. I > also > have usage flows (e.g. allocate, guest_use, host_use, > transient_folio_get, close, transient_folio_put) as test cases. > > I'm almost done with the model and my next steps are to write up a > state > machine (like Fuad's [5]) and share that. > > I'd be happy to share the python model too but I have to work through > some internal open-sourcing processes first, so if you think this > will > be useful, let me know! No problem. Yes, I'm interested in this - it'll be helpful! The other thing of note is that while we have the kernel patches, a userspace to drive them and exercise them is currently missing. > Then, I'll code it all up in a new revision of this series (target: > March 2025), which will be accompanied by source code on GitHub. > > I'm happy to collaborate more closely, let me know if you have ideas > for > collaboration! Thank you. I think currently the bigger problem we have is allocation of hugepages -- which is also blocking a lot of the follow-on work. Vishal briefly mentioned isolating pages from Linux entirely last time - that's also what I'm interested in to figure out if we can completely bypass the allocation problem by not allocating struct pages for non- host use pages entirely. The guest_memfs/KHO/kexec/live-update patches also take this approach on AWS (for how their VMs are launched). If we work with those patches together, allocation of 1G hugepages is simplified. I'd like to discuss more on these themes to see if this is an approach that helps as well. Amit
Amit Shah <amit@infradead.org> writes: >> <snip> >> >> Thanks all your help and comments during the guest_memfd upstream >> calls, >> and thanks for the help from AMD. >> >> Extending mmap() support from Fuad with 1G page support introduces >> more >> states that made it more complicated (at least for me). >> >> I'm modeling the states in python so I can iterate more quickly. I >> also >> have usage flows (e.g. allocate, guest_use, host_use, >> transient_folio_get, close, transient_folio_put) as test cases. >> >> I'm almost done with the model and my next steps are to write up a >> state >> machine (like Fuad's [5]) and share that. Thanks everyone for all the comments at the 2025-02-06 guest_memfd upstream call! Here are the + Slides: https://lpc.events/event/18/contributions/1764/attachments/1409/3704/guest-memfd-1g-page-support-2025-02-06.pdf + State diagram: https://lpc.events/event/18/contributions/1764/attachments/1409/3702/guest-memfd-state-diagram-split-merge-2025-02-06.drawio.svg + For those interested in editing the state diagram using draw.io: https://lpc.events/event/18/contributions/1764/attachments/1409/3703/guest-memfd-state-diagram-split-merge-2025-02-06.drawio.xml >> >> I'd be happy to share the python model too but I have to work through >> some internal open-sourcing processes first, so if you think this >> will >> be useful, let me know! > > No problem. Yes, I'm interested in this - it'll be helpful! I've started working through the internal processes and will update here when I'm done! > > The other thing of note is that while we have the kernel patches, a > userspace to drive them and exercise them is currently missing. In this and future patch series, I'll have selftests that will exercise any new functionality. > >> Then, I'll code it all up in a new revision of this series (target: >> March 2025), which will be accompanied by source code on GitHub. >> >> I'm happy to collaborate more closely, let me know if you have ideas >> for >> collaboration! > > Thank you. I think currently the bigger problem we have is allocation > of hugepages -- which is also blocking a lot of the follow-on work. > Vishal briefly mentioned isolating pages from Linux entirely last time > - that's also what I'm interested in to figure out if we can completely > bypass the allocation problem by not allocating struct pages for non- > host use pages entirely. The guest_memfs/KHO/kexec/live-update patches > also take this approach on AWS (for how their VMs are launched). If we > work with those patches together, allocation of 1G hugepages is > simplified. I'd like to discuss more on these themes to see if this is > an approach that helps as well. > > > Amit Vishal is still very interested in this and will probably be looking into this while I push ahead assuming that KVM continues to use struct pages. This was also brought up at the guest_memfd upstream call on 2025-02-06, people were interested in this and think that it will simplify refcounting for merging and splitting. I'll push ahead assuming that we use hugetlb as the source of 1G pages, and assuming that KVM continues to use struct pages to describe guest private memory. The series will still be useful as an interim solution/prototype even if other allocators are preferred and get merged. :)
© 2016 - 2026 Red Hat, Inc.