fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 122 ++++++++- virt/kvm/guest_memfd.c | 255 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 10 +- 15 files changed, 408 insertions(+), 49 deletions(-)
This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work that enabled host-mapping for guest_memfd memory [1]. == Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2). == Implementation == This series implements proper NUMA policy support for guest-memfd by: 1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy. With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design). == Upstream Plan == Phased approach as per David's guest_memfd extension overview [2] and community calls [3]: Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work. Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [4]. This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities. Please review and provide feedback! Thanks, Shivank == Changelog == - v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [5] and guest_memfd bi-weekly upstream call discussion [6]). - v7: Use inodes to store NUMA policy instead of file [7]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. - v9: Rebase on top of Fuad's V13 and incorporate review comments [1] https://lore.kernel.org/all/20250709105946.4009897-1-tabba@google.com [2] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.svcbod20b5ur [4] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [5] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [6] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [7] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes Matthew Wilcox (Oracle) (2): mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies Shivank Garg (4): mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 122 ++++++++- virt/kvm/guest_memfd.c | 255 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 10 +- 15 files changed, 408 insertions(+), 49 deletions(-) -- 2.43.0 --- == Earlier Postings == v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com
On 13.07.25 19:43, Shivank Garg wrote: > This series introduces NUMA-aware memory placement support for KVM guests > with guest_memfd memory backends. It builds upon Fuad Tabba's work that > enabled host-mapping for guest_memfd memory [1]. > > == Background == > KVM's guest-memfd memory backend currently lacks support for NUMA policy > enforcement, causing guest memory allocations to be distributed across host > nodes according to kernel's default behavior, irrespective of any policy > specified by the VMM. This limitation arises because conventional userspace > NUMA control mechanisms like mbind(2) don't work since the memory isn't > directly mapped to userspace when allocations occur. > Fuad's work [1] provides the necessary mmap capability, and this series > leverages it to enable mbind(2). > > == Implementation == > > This series implements proper NUMA policy support for guest-memfd by: > > 1. Adding mempolicy-aware allocation APIs to the filemap layer. > 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, > kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. > 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA > policy. > > With these changes, VMMs can now control guest memory placement by mapping > guest_memfd file descriptor and using mbind(2) to specify: > - Policy modes: default, bind, interleave, or preferred > - Host NUMA nodes: List of target nodes for memory allocation > > These Policies affect only future allocations and do not migrate existing > memory. This matches mbind(2)'s default behavior which affects only new > allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not > supported for guest_memfd as it is unmovable by design). > > == Upstream Plan == > Phased approach as per David's guest_memfd extension overview [2] and > community calls [3]: > > Phase 1 (this series): > 1. Focuses on shared guest_memfd support (non-CoCo VMs). > 2. Builds on Fuad's host-mapping work. Just to clarify: this is based on Fuad's stage 1 and should probably still be tagged "RFC" until stage-1 is finally upstream. (I was hoping stage-1 would go upstream in 6.17, but I am not sure yet if that is still feasible looking at the never-ending review) I'm surprised to see that commit cbe4134ea4bc493239786220bd69cb8a13493190 Author: Shivank Garg <shivankg@amd.com> Date: Fri Jun 20 07:03:30 2025 +0000 fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass was merged with the kvm export EXPORT_SYMBOL_GPL_FOR_MODULES(anon_inode_make_secure_inode, "kvm"); I thought I commented that this is something to done separately and not really "fix" material. Anyhow, good for this series, no need to touch that. -- Cheers, David / dhildenb
On 7/22/2025 8:10 PM, David Hildenbrand wrote: > On 13.07.25 19:43, Shivank Garg wrote: >> This series introduces NUMA-aware memory placement support for KVM guests >> with guest_memfd memory backends. It builds upon Fuad Tabba's work that >> enabled host-mapping for guest_memfd memory [1]. >> >> == Background == >> KVM's guest-memfd memory backend currently lacks support for NUMA policy >> enforcement, causing guest memory allocations to be distributed across host >> nodes according to kernel's default behavior, irrespective of any policy >> specified by the VMM. This limitation arises because conventional userspace >> NUMA control mechanisms like mbind(2) don't work since the memory isn't >> directly mapped to userspace when allocations occur. >> Fuad's work [1] provides the necessary mmap capability, and this series >> leverages it to enable mbind(2). >> >> == Implementation == >> >> This series implements proper NUMA policy support for guest-memfd by: >> >> 1. Adding mempolicy-aware allocation APIs to the filemap layer. >> 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, >> kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. >> 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA >> policy. >> >> With these changes, VMMs can now control guest memory placement by mapping >> guest_memfd file descriptor and using mbind(2) to specify: >> - Policy modes: default, bind, interleave, or preferred >> - Host NUMA nodes: List of target nodes for memory allocation >> >> These Policies affect only future allocations and do not migrate existing >> memory. This matches mbind(2)'s default behavior which affects only new >> allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not >> supported for guest_memfd as it is unmovable by design). >> >> == Upstream Plan == >> Phased approach as per David's guest_memfd extension overview [2] and >> community calls [3]: >> >> Phase 1 (this series): >> 1. Focuses on shared guest_memfd support (non-CoCo VMs). >> 2. Builds on Fuad's host-mapping work. > > Just to clarify: this is based on Fuad's stage 1 and should probably still be > tagged "RFC" until stage-1 is finally upstream. > Sure. > (I was hoping stage-1 would go upstream in 6.17, but I am not sure yet if that is > still feasible looking at the never-ending review) > > I'm surprised to see that > > commit cbe4134ea4bc493239786220bd69cb8a13493190 > Author: Shivank Garg <shivankg@amd.com> > Date: Fri Jun 20 07:03:30 2025 +0000 > > fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass > was merged with the kvm export > > EXPORT_SYMBOL_GPL_FOR_MODULES(anon_inode_make_secure_inode, "kvm"); > > I thought I commented that this is something to done separately and not really > "fix" material. > > Anyhow, good for this series, no need to touch that. > Yeah, V2 got merged instead of V3. https://lore.kernel.org/all/1ab3381b-1620-485d-8e1b-fff2c48d45c3@amd.com but backporting did not give issues either. Thank you for the reviews :) Best Regards, Shivank
On Tue, Jul 22, 2025, David Hildenbrand wrote: > Just to clarify: this is based on Fuad's stage 1 and should probably still be > tagged "RFC" until stage-1 is finally upstream. > > (I was hoping stage-1 would go upstream in 6.17, but I am not sure yet if that is > still feasible looking at the never-ending review) 6.17 is very doable.
On 22.07.25 16:45, Sean Christopherson wrote: > On Tue, Jul 22, 2025, David Hildenbrand wrote: >> Just to clarify: this is based on Fuad's stage 1 and should probably still be >> tagged "RFC" until stage-1 is finally upstream. >> >> (I was hoping stage-1 would go upstream in 6.17, but I am not sure yet if that is >> still feasible looking at the never-ending review) > > 6.17 is very doable. I like your optimism :) -- Cheers, David / dhildenb
On Tue, Jul 22, 2025, David Hildenbrand wrote: > On 22.07.25 16:45, Sean Christopherson wrote: > > On Tue, Jul 22, 2025, David Hildenbrand wrote: > > > Just to clarify: this is based on Fuad's stage 1 and should probably still be > > > tagged "RFC" until stage-1 is finally upstream. > > > > > > (I was hoping stage-1 would go upstream in 6.17, but I am not sure yet if that is > > > still feasible looking at the never-ending review) > > > > 6.17 is very doable. > > I like your optimism :) I'm not optimistic, just incompetent. I forgot what kernel we're on. **6.18** is very doable, 6.17 not so much.
On 23.07.25 01:07, Sean Christopherson wrote: > On Tue, Jul 22, 2025, David Hildenbrand wrote: >> On 22.07.25 16:45, Sean Christopherson wrote: >>> On Tue, Jul 22, 2025, David Hildenbrand wrote: >>>> Just to clarify: this is based on Fuad's stage 1 and should probably still be >>>> tagged "RFC" until stage-1 is finally upstream. >>>> >>>> (I was hoping stage-1 would go upstream in 6.17, but I am not sure yet if that is >>>> still feasible looking at the never-ending review) >>> >>> 6.17 is very doable. >> >> I like your optimism :) > > I'm not optimistic, just incompetent. Well, I wouldn't agree with that :) > I forgot what kernel we're on. **6.18** > is very doable, 6.17 not so much. Yes, probably best to target 6.18 than rushing this into the upcoming MR. -- Cheers, David / dhildenb
© 2016 - 2025 Red Hat, Inc.