include/linux/mm.h | 71 +++++++++++++++++++++ include/linux/shmem_fs.h | 14 ----- include/linux/userfaultfd_k.h | 58 ++++------------- mm/hugetlb.c | 19 ++++++ mm/shmem.c | 28 ++++++++- mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- 6 files changed, 217 insertions(+), 88 deletions(-)
[based on akpm/mm-new] This series is an alternative proposal of what Nikita proposed here on the initial three patches: https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com This is not yet relevant to any guest-memfd support, but paving way for it. Here, the major goal is to make kernel modules be able to opt-in with any form of userfaultfd supports, like guest-memfd. This alternative option should hopefully be cleaner, and avoid leaking userfault details into vm_ops.fault(). It also means this series does not depend on anything. It's a pure refactoring of userfaultfd internals to provide a generic API, so that other types of files, especially RAM based, can support userfaultfd without touching mm/ at all. To achieve it, this series introduced a file operation called vm_uffd_ops. The ops needs to be provided when a file type supports any of userfaultfd. With that, I moved both hugetlbfs and shmem over. Hugetlbfs is still very special that it will only use partial of the vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a BUG() and so far hard-coded into core mm. But this should still be better, because at least hugetlbfs is still always involved in feature probing (e.g. where it used to not support ZEROPAGE and we have a hard-coded line to fail that, and some more). Meanwhile after this series, shmem will be completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for shmem looks like this: static const vm_uffd_ops shmem_uffd_ops = { .uffd_features = __VM_UFFD_FLAGS, .uffd_ioctls = BIT(_UFFDIO_COPY) | BIT(_UFFDIO_ZEROPAGE) | BIT(_UFFDIO_WRITEPROTECT) | BIT(_UFFDIO_CONTINUE) | BIT(_UFFDIO_POISON), .uffd_get_folio = shmem_uffd_get_folio, .uffd_copy = shmem_mfill_atomic_pte, }; As I mentioned in one of my reply to Nikita, I don't like the current interface of uffd_copy(), but this will be the minimum change version of such API to support complete extrenal-module-ready userfaultfd. Here, very minimal change will be needed from shmem side to support that. Meanwhile, the vm_uffd_ops is also not the only place one will need to provide to support userfaultfd. Normally vm_ops.fault() will also need to be updated, but that's a generic function and it'll play together with the new vm_uffd_ops to make everything fly. No functional change expected at all after the whole series applied. There might be some slightly stricter check on uffd ops here and there in the last patch, but that really shouldn't stand out anywhere to anyone. For testing: besides the cross-compilation tests, I did also try with uffd-stress in a VM to measure any perf difference before/after the change; The static call becomes a pointer now. I really cannot measure anything different, which is more or less expected. Comments welcomed, thanks. Peter Xu (4): mm: Introduce vm_uffd_ops API mm/shmem: Support vm_uffd_ops API mm/hugetlb: Support vm_uffd_ops API mm: Apply vm_uffd_ops API to core mm include/linux/mm.h | 71 +++++++++++++++++++++ include/linux/shmem_fs.h | 14 ----- include/linux/userfaultfd_k.h | 58 ++++------------- mm/hugetlb.c | 19 ++++++ mm/shmem.c | 28 ++++++++- mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- 6 files changed, 217 insertions(+), 88 deletions(-) -- 2.49.0
On 20/06/2025 20:03, Peter Xu wrote: > [based on akpm/mm-new] > > This series is an alternative proposal of what Nikita proposed here on the > initial three patches: > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > This is not yet relevant to any guest-memfd support, but paving way for it. Hi Peter, Thanks for posting this. I confirmed that minor fault handling was working for guest_memfd based on this series and looked simple (a draft based on mmap support in guest_memfd v7 [1]): diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 5abb6d52a375..6ddc73419724 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -5,6 +5,9 @@ #include <linux/pagemap.h> #include <linux/anon_inodes.h> #include <linux/set_memory.h> +#ifdef CONFIG_USERFAULTFD +#include <linux/userfaultfd_k.h> +#endif #include "kvm_mm.h" @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) kvm_gmem_mark_prepared(folio); } +#ifdef CONFIG_USERFAULTFD + if (userfaultfd_minor(vmf->vma)) { + folio_unlock(folio); + filemap_invalidate_unlock_shared(inode->i_mapping); + return handle_userfault(vmf, VM_UFFD_MINOR); + } +#endif + vmf->page = folio_file_page(folio, vmf->pgoff); out_folio: @@ -410,8 +421,39 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) return ret; } +#ifdef CONFIG_USERFAULTFD +static int kvm_gmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff, + struct folio **foliop) +{ + struct folio *folio; + folio = kvm_gmem_get_folio(inode, pgoff); + + if (IS_ERR(folio)) { + *foliop = NULL; + return PTR_ERR(folio); + } + + if (!folio_test_uptodate(folio)) { + clear_highpage(folio_page(folio, 0)); + kvm_gmem_mark_prepared(folio); + } + + *foliop = folio; + return 0; +} + +static const vm_uffd_ops kvm_gmem_uffd_ops = { + .uffd_features = VM_UFFD_MINOR, + .uffd_ioctls = BIT(_UFFDIO_CONTINUE), + .uffd_get_folio = kvm_gmem_uffd_get_folio, +}; +#endif + static const struct vm_operations_struct kvm_gmem_vm_ops = { .fault = kvm_gmem_fault, +#ifdef CONFIG_USERFAULTFD + .userfaultfd_ops = &kvm_gmem_uffd_ops, +#endif }; static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) [1]: https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/ > Here, the major goal is to make kernel modules be able to opt-in with any > form of userfaultfd supports, like guest-memfd. This alternative option > should hopefully be cleaner, and avoid leaking userfault details into > vm_ops.fault(). > > It also means this series does not depend on anything. It's a pure > refactoring of userfaultfd internals to provide a generic API, so that > other types of files, especially RAM based, can support userfaultfd without > touching mm/ at all. > > To achieve it, this series introduced a file operation called vm_uffd_ops. > The ops needs to be provided when a file type supports any of userfaultfd. > > With that, I moved both hugetlbfs and shmem over. > > Hugetlbfs is still very special that it will only use partial of the > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a > BUG() and so far hard-coded into core mm. But this should still be better, > because at least hugetlbfs is still always involved in feature probing > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line > to fail that, and some more). Meanwhile after this series, shmem will be > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for > shmem looks like this: > > static const vm_uffd_ops shmem_uffd_ops = { > .uffd_features = __VM_UFFD_FLAGS, > .uffd_ioctls = BIT(_UFFDIO_COPY) | > BIT(_UFFDIO_ZEROPAGE) | > BIT(_UFFDIO_WRITEPROTECT) | > BIT(_UFFDIO_CONTINUE) | > BIT(_UFFDIO_POISON), > .uffd_get_folio = shmem_uffd_get_folio, > .uffd_copy = shmem_mfill_atomic_pte, > }; > > As I mentioned in one of my reply to Nikita, I don't like the current > interface of uffd_copy(), but this will be the minimum change version of > such API to support complete extrenal-module-ready userfaultfd. Here, very > minimal change will be needed from shmem side to support that. > > Meanwhile, the vm_uffd_ops is also not the only place one will need to > provide to support userfaultfd. Normally vm_ops.fault() will also need to > be updated, but that's a generic function and it'll play together with the > new vm_uffd_ops to make everything fly. > > No functional change expected at all after the whole series applied. There > might be some slightly stricter check on uffd ops here and there in the > last patch, but that really shouldn't stand out anywhere to anyone. > > For testing: besides the cross-compilation tests, I did also try with > uffd-stress in a VM to measure any perf difference before/after the change; > The static call becomes a pointer now. I really cannot measure anything > different, which is more or less expected. > > Comments welcomed, thanks. > > Peter Xu (4): > mm: Introduce vm_uffd_ops API > mm/shmem: Support vm_uffd_ops API > mm/hugetlb: Support vm_uffd_ops API > mm: Apply vm_uffd_ops API to core mm > > include/linux/mm.h | 71 +++++++++++++++++++++ > include/linux/shmem_fs.h | 14 ----- > include/linux/userfaultfd_k.h | 58 ++++------------- > mm/hugetlb.c | 19 ++++++ > mm/shmem.c | 28 ++++++++- > mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- > 6 files changed, 217 insertions(+), 88 deletions(-) > > -- > 2.49.0 >
On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote: > > > On 20/06/2025 20:03, Peter Xu wrote: > > [based on akpm/mm-new] > > > > This series is an alternative proposal of what Nikita proposed here on the > > initial three patches: > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > Hi Peter, Hi, Nikita, > > Thanks for posting this. I confirmed that minor fault handling was working > for guest_memfd based on this series and looked simple (a draft based on > mmap support in guest_memfd v7 [1]): Thanks for the quick spin, glad to know it works. Some trivial things to mention below.. > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 5abb6d52a375..6ddc73419724 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -5,6 +5,9 @@ > #include <linux/pagemap.h> > #include <linux/anon_inodes.h> > #include <linux/set_memory.h> > +#ifdef CONFIG_USERFAULTFD This ifdef not needed, userfaultfd_k.h has taken care of all cases. > +#include <linux/userfaultfd_k.h> > +#endif > > #include "kvm_mm.h" > > @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > kvm_gmem_mark_prepared(folio); > } > > +#ifdef CONFIG_USERFAULTFD Same here. userfaultfd_minor() is always defined. I'll wait for a few more days for reviewers, and likely send v2 before next week. Thanks, -- Peter Xu
On 25/06/2025 21:17, Peter Xu wrote: > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote: >> >> >> On 20/06/2025 20:03, Peter Xu wrote: >>> [based on akpm/mm-new] >>> >>> This series is an alternative proposal of what Nikita proposed here on the >>> initial three patches: >>> >>> https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com >>> >>> This is not yet relevant to any guest-memfd support, but paving way for it. >> >> Hi Peter, > > Hi, Nikita, > >> >> Thanks for posting this. I confirmed that minor fault handling was working >> for guest_memfd based on this series and looked simple (a draft based on >> mmap support in guest_memfd v7 [1]): > > Thanks for the quick spin, glad to know it works. Some trivial things to > mention below.. Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it works as well: diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 8c44e4b9f5f8..b5458a22fff4 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index) static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; struct inode *inode = file_inode(vmf->vma->vm_file); struct folio *folio; vm_fault_t ret = VM_FAULT_LOCKED; filemap_invalidate_lock_shared(inode->i_mapping); + folio = filemap_get_entry(inode->i_mapping, vmf->pgoff); + if (!folio && vma && userfaultfd_missing(vma)) { + filemap_invalidate_unlock_shared(inode->i_mapping); + return handle_userfault(vmf, VM_UFFD_MISSING); + } + folio = kvm_gmem_get_folio(inode, vmf->pgoff); if (IS_ERR(folio)) { int err = PTR_ERR(folio); @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff, return 0; } +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + uffd_flags_t flags, + struct folio **foliop) +{ + struct inode *inode = file_inode(dst_vma->vm_file); + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); + struct folio *folio; + int ret; + + folio = kvm_gmem_get_folio(inode, pgoff); + if (IS_ERR(folio)) { + ret = PTR_ERR(folio); + goto out; + } + + folio_unlock(folio); + + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { + void *vaddr = kmap_local_folio(folio, 0); + ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE); + kunmap_local(vaddr); + if (unlikely(ret)) { + *foliop = folio; + ret = -ENOENT; + goto out; + } + } else { /* ZEROPAGE */ + clear_user_highpage(&folio->page, dst_addr); + } + + kvm_gmem_mark_prepared(folio); + + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, + &folio->page, true, flags); + + if (ret) + folio_put(folio); +out: + return ret; +} + static const vm_uffd_ops kvm_gmem_uffd_ops = { - .uffd_features = VM_UFFD_MINOR, - .uffd_ioctls = BIT(_UFFDIO_CONTINUE), + .uffd_features = VM_UFFD_MISSING | VM_UFFD_MINOR, + .uffd_ioctls = BIT(_UFFDIO_COPY) | + BIT(_UFFDIO_ZEROPAGE) | + BIT(_UFFDIO_CONTINUE), .uffd_get_folio = kvm_gmem_uffd_get_folio, + .uffd_copy = kvm_gmem_mfill_atomic_pte, }; #endif > >> >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c >> index 5abb6d52a375..6ddc73419724 100644 >> --- a/virt/kvm/guest_memfd.c >> +++ b/virt/kvm/guest_memfd.c >> @@ -5,6 +5,9 @@ >> #include <linux/pagemap.h> >> #include <linux/anon_inodes.h> >> #include <linux/set_memory.h> >> +#ifdef CONFIG_USERFAULTFD > > This ifdef not needed, userfaultfd_k.h has taken care of all cases. Good to know, thanks. >> +#include <linux/userfaultfd_k.h> >> +#endif >> >> #include "kvm_mm.h" >> >> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) >> kvm_gmem_mark_prepared(folio); >> } >> >> +#ifdef CONFIG_USERFAULTFD > > Same here. userfaultfd_minor() is always defined. Thank you. > I'll wait for a few more days for reviewers, and likely send v2 before next > week. > > Thanks, > > -- > Peter Xu >
On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote: > > > On 25/06/2025 21:17, Peter Xu wrote: > > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote: > > > > > > > > > On 20/06/2025 20:03, Peter Xu wrote: > > > > [based on akpm/mm-new] > > > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > > initial three patches: > > > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > > > > Hi Peter, > > > > Hi, Nikita, > > > > > > > > Thanks for posting this. I confirmed that minor fault handling was working > > > for guest_memfd based on this series and looked simple (a draft based on > > > mmap support in guest_memfd v7 [1]): > > > > Thanks for the quick spin, glad to know it works. Some trivial things to > > mention below.. > > Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it > works as well: Appreciated. Since at it, I'll comment quickly below. > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 8c44e4b9f5f8..b5458a22fff4 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file > *file, pgoff_t index) > > static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > { > + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; > struct inode *inode = file_inode(vmf->vma->vm_file); > struct folio *folio; > vm_fault_t ret = VM_FAULT_LOCKED; > > filemap_invalidate_lock_shared(inode->i_mapping); > > + folio = filemap_get_entry(inode->i_mapping, vmf->pgoff); > + if (!folio && vma && userfaultfd_missing(vma)) { > + filemap_invalidate_unlock_shared(inode->i_mapping); > + return handle_userfault(vmf, VM_UFFD_MISSING); > + } Likely a possible refcount leak when folio != NULL here. > + > folio = kvm_gmem_get_folio(inode, vmf->pgoff); > if (IS_ERR(folio)) { > int err = PTR_ERR(folio); > @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode > *inode, pgoff_t pgoff, > return 0; > } > > +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd, > + struct vm_area_struct *dst_vma, > + unsigned long dst_addr, > + unsigned long src_addr, > + uffd_flags_t flags, > + struct folio **foliop) > +{ > + struct inode *inode = file_inode(dst_vma->vm_file); > + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); > + struct folio *folio; > + int ret; > + > + folio = kvm_gmem_get_folio(inode, pgoff); > + if (IS_ERR(folio)) { > + ret = PTR_ERR(folio); > + goto out; > + } > + > + folio_unlock(folio); > + > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { > + void *vaddr = kmap_local_folio(folio, 0); > + ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE); > + kunmap_local(vaddr); > + if (unlikely(ret)) { > + *foliop = folio; > + ret = -ENOENT; > + goto out; > + } > + } else { /* ZEROPAGE */ > + clear_user_highpage(&folio->page, dst_addr); > + } > + > + kvm_gmem_mark_prepared(folio); Since Faud's series hasn't yet landed, so I'm almost looking at the current code base with an imagination of what might happen. In general, missing trapping for guest-memfd could start to be slightly trickier. So far IIUC guest-memfd cache pool needs to be populated only by a prior fallocate() syscall, not during fault. So I suppose we will need to use uptodate bit to mark folio ready, like what's done here. If so, we may want to make sure in fault path any !uptodate fault will get trapped for missing too, even if it sounds not strictly a "cache miss" ... so slightly confusing but sounds necessary. Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into account, because CoCo needs to prepare the pages, so mark uptodate may not be enough? I don't know well on the CoCo side to tell. Otherwise we'll at least need to restrict MISSING traps to only happen on fully shared guest-memfds. OTOH, MINOR should be much easier to be done for guest-memfd, not only because the code to support that would be very minimum which is definitely lovely, but also because it's still pretty common idea to monitor pgtable entries, and it should logically even apply to CoCo: in a fault(), we need to check whether the guest-memfd folio is "shared" and/or "faultable" first; it should already fail the fault() if it's a private folio. Then if it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a MINOR too. For !CoCo it'll always trap as it's always faultable. MINOR also makes more sense to be used in the future with 1G postcopy support on top of gmem, because that's almost the only way to go. Looks like we've made up our mind to reuse Hugetlb pages for gmem which sounds good, then Hugetlb pages are in 1G granule in allocations, and we can't easily do 4K miss trapping on one 1G huge page. MINOR is simpler but actually more powerful from that POV. To summarize, I think after this we can do MINOR before MISSING for guest-memfd if MINOR already works for you. We can leave MISSING until we know how we would use it. Thanks, > + > + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, > + &folio->page, true, flags); > + > + if (ret) > + folio_put(folio); > +out: > + return ret; > +} > + > static const vm_uffd_ops kvm_gmem_uffd_ops = { > - .uffd_features = VM_UFFD_MINOR, > - .uffd_ioctls = BIT(_UFFDIO_CONTINUE), > + .uffd_features = VM_UFFD_MISSING | VM_UFFD_MINOR, > + .uffd_ioctls = BIT(_UFFDIO_COPY) | > + BIT(_UFFDIO_ZEROPAGE) | > + BIT(_UFFDIO_CONTINUE), > .uffd_get_folio = kvm_gmem_uffd_get_folio, > + .uffd_copy = kvm_gmem_mfill_atomic_pte, > }; > #endif > > > > > > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > index 5abb6d52a375..6ddc73419724 100644 > > > --- a/virt/kvm/guest_memfd.c > > > +++ b/virt/kvm/guest_memfd.c > > > @@ -5,6 +5,9 @@ > > > #include <linux/pagemap.h> > > > #include <linux/anon_inodes.h> > > > #include <linux/set_memory.h> > > > +#ifdef CONFIG_USERFAULTFD > > > > This ifdef not needed, userfaultfd_k.h has taken care of all cases. > > Good to know, thanks. > > > > +#include <linux/userfaultfd_k.h> > > > +#endif > > > > > > #include "kvm_mm.h" > > > > > > @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > > > kvm_gmem_mark_prepared(folio); > > > } > > > > > > +#ifdef CONFIG_USERFAULTFD > > > > Same here. userfaultfd_minor() is always defined. > > Thank you. > > > I'll wait for a few more days for reviewers, and likely send v2 before next > > week. > > > > Thanks, > > > > -- > > Peter Xu > > > -- Peter Xu
On 27/06/2025 14:51, Peter Xu wrote: > On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote: >> >> >> On 25/06/2025 21:17, Peter Xu wrote: >>> On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote: >>>> >>>> >>>> On 20/06/2025 20:03, Peter Xu wrote: >>>>> [based on akpm/mm-new] >>>>> >>>>> This series is an alternative proposal of what Nikita proposed here on the >>>>> initial three patches: >>>>> >>>>> https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com >>>>> >>>>> This is not yet relevant to any guest-memfd support, but paving way for it. >>>> >>>> Hi Peter, >>> >>> Hi, Nikita, >>> >>>> >>>> Thanks for posting this. I confirmed that minor fault handling was working >>>> for guest_memfd based on this series and looked simple (a draft based on >>>> mmap support in guest_memfd v7 [1]): >>> >>> Thanks for the quick spin, glad to know it works. Some trivial things to >>> mention below.. >> >> Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it >> works as well: > > Appreciated. > > Since at it, I'll comment quickly below. > >> >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c >> index 8c44e4b9f5f8..b5458a22fff4 100644 >> --- a/virt/kvm/guest_memfd.c >> +++ b/virt/kvm/guest_memfd.c >> @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file >> *file, pgoff_t index) >> >> static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) >> { >> + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; >> struct inode *inode = file_inode(vmf->vma->vm_file); >> struct folio *folio; >> vm_fault_t ret = VM_FAULT_LOCKED; >> >> filemap_invalidate_lock_shared(inode->i_mapping); >> >> + folio = filemap_get_entry(inode->i_mapping, vmf->pgoff); >> + if (!folio && vma && userfaultfd_missing(vma)) { >> + filemap_invalidate_unlock_shared(inode->i_mapping); >> + return handle_userfault(vmf, VM_UFFD_MISSING); >> + } > > Likely a possible refcount leak when folio != NULL here. Thank you. I was only aiming to cover the happy case for know. I will keep it in mind for the future. >> + >> folio = kvm_gmem_get_folio(inode, vmf->pgoff); >> if (IS_ERR(folio)) { >> int err = PTR_ERR(folio); >> @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode >> *inode, pgoff_t pgoff, >> return 0; >> } >> >> +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd, >> + struct vm_area_struct *dst_vma, >> + unsigned long dst_addr, >> + unsigned long src_addr, >> + uffd_flags_t flags, >> + struct folio **foliop) >> +{ >> + struct inode *inode = file_inode(dst_vma->vm_file); >> + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); >> + struct folio *folio; >> + int ret; >> + >> + folio = kvm_gmem_get_folio(inode, pgoff); >> + if (IS_ERR(folio)) { >> + ret = PTR_ERR(folio); >> + goto out; >> + } >> + >> + folio_unlock(folio); >> + >> + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { >> + void *vaddr = kmap_local_folio(folio, 0); >> + ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE); >> + kunmap_local(vaddr); >> + if (unlikely(ret)) { >> + *foliop = folio; >> + ret = -ENOENT; >> + goto out; >> + } >> + } else { /* ZEROPAGE */ >> + clear_user_highpage(&folio->page, dst_addr); >> + } >> + >> + kvm_gmem_mark_prepared(folio); > > Since Faud's series hasn't yet landed, so I'm almost looking at the current > code base with an imagination of what might happen. > > In general, missing trapping for guest-memfd could start to be slightly > trickier. So far IIUC guest-memfd cache pool needs to be populated only by > a prior fallocate() syscall, not during fault. So I suppose we will need > to use uptodate bit to mark folio ready, like what's done here. I don't think I'm familiar with the fallocate() requirement in guest_memfd. Fuad's v12 [1] (although I think it has been like that from the beginning) calls kvm_gmem_get_folio() that populates pagecache in the fault handler (kvm_gmem_fault_shared()). SEV [2] and TDX [3] seem to use kvm_gmem_populate() for both allocation and preparation. [1] https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/T/#m15b53a741e4f328e61f995a01afb9c4682ffe611 [2] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/svm/sev.c#L2331 [3] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/vmx/tdx.c#L3236 > > If so, we may want to make sure in fault path any !uptodate fault will get > trapped for missing too, even if it sounds not strictly a "cache miss" > ... so slightly confusing but sounds necessary. > > Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into > account, because CoCo needs to prepare the pages, so mark uptodate may not > be enough? I don't know well on the CoCo side to tell. Otherwise we'll at > least need to restrict MISSING traps to only happen on fully shared > guest-memfds. I am not fluent in CoCo either, but I thought CoCo needed to do preparation for private pages only, while UFFD shouldn't be dealing with them so issuing MISSING only on shared looks sensible to me. > OTOH, MINOR should be much easier to be done for guest-memfd, not only > because the code to support that would be very minimum which is definitely > lovely, but also because it's still pretty common idea to monitor pgtable > entries, and it should logically even apply to CoCo: in a fault(), we need > to check whether the guest-memfd folio is "shared" and/or "faultable" > first; it should already fail the fault() if it's a private folio. Then if > it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a > MINOR too. For !CoCo it'll always trap as it's always faultable. > MINOR also makes more sense to be used in the future with 1G postcopy > support on top of gmem, because that's almost the only way to go. Looks > like we've made up our mind to reuse Hugetlb pages for gmem which sounds > good, then Hugetlb pages are in 1G granule in allocations, and we can't > easily do 4K miss trapping on one 1G huge page. MINOR is simpler but > actually more powerful from that POV. > > To summarize, I think after this we can do MINOR before MISSING for > guest-memfd if MINOR already works for you. We can leave MISSING until we > know how we would use it. Starting with MINOR sounds good to me. > > Thanks, > >> + >> + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, >> + &folio->page, true, flags); >> + >> + if (ret) >> + folio_put(folio); >> +out: >> + return ret; >> +} >> + >> static const vm_uffd_ops kvm_gmem_uffd_ops = { >> - .uffd_features = VM_UFFD_MINOR, >> - .uffd_ioctls = BIT(_UFFDIO_CONTINUE), >> + .uffd_features = VM_UFFD_MISSING | VM_UFFD_MINOR, >> + .uffd_ioctls = BIT(_UFFDIO_COPY) | >> + BIT(_UFFDIO_ZEROPAGE) | >> + BIT(_UFFDIO_CONTINUE), >> .uffd_get_folio = kvm_gmem_uffd_get_folio, >> + .uffd_copy = kvm_gmem_mfill_atomic_pte, >> }; >> #endif >> >>> >>>> >>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c >>>> index 5abb6d52a375..6ddc73419724 100644 >>>> --- a/virt/kvm/guest_memfd.c >>>> +++ b/virt/kvm/guest_memfd.c >>>> @@ -5,6 +5,9 @@ >>>> #include <linux/pagemap.h> >>>> #include <linux/anon_inodes.h> >>>> #include <linux/set_memory.h> >>>> +#ifdef CONFIG_USERFAULTFD >>> >>> This ifdef not needed, userfaultfd_k.h has taken care of all cases. >> >> Good to know, thanks. >> >>>> +#include <linux/userfaultfd_k.h> >>>> +#endif >>>> >>>> #include "kvm_mm.h" >>>> >>>> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) >>>> kvm_gmem_mark_prepared(folio); >>>> } >>>> >>>> +#ifdef CONFIG_USERFAULTFD >>> >>> Same here. userfaultfd_minor() is always defined. >> >> Thank you. >> >>> I'll wait for a few more days for reviewers, and likely send v2 before next >>> week. >>> >>> Thanks, >>> >>> -- >>> Peter Xu >>> >> > > -- > Peter Xu >
On Fri, Jun 27, 2025 at 05:59:49PM +0100, Nikita Kalyazin wrote: > > > On 27/06/2025 14:51, Peter Xu wrote: > > On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote: > > > > > > > > > On 25/06/2025 21:17, Peter Xu wrote: > > > > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote: > > > > > > > > > > > > > > > On 20/06/2025 20:03, Peter Xu wrote: > > > > > > [based on akpm/mm-new] > > > > > > > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > > > > initial three patches: > > > > > > > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > > > > > > > > Hi Peter, > > > > > > > > Hi, Nikita, > > > > > > > > > > > > > > Thanks for posting this. I confirmed that minor fault handling was working > > > > > for guest_memfd based on this series and looked simple (a draft based on > > > > > mmap support in guest_memfd v7 [1]): > > > > > > > > Thanks for the quick spin, glad to know it works. Some trivial things to > > > > mention below.. > > > > > > Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it > > > works as well: > > > > Appreciated. > > > > Since at it, I'll comment quickly below. > > > > > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > index 8c44e4b9f5f8..b5458a22fff4 100644 > > > --- a/virt/kvm/guest_memfd.c > > > +++ b/virt/kvm/guest_memfd.c > > > @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file > > > *file, pgoff_t index) > > > > > > static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > > > { > > > + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; > > > struct inode *inode = file_inode(vmf->vma->vm_file); > > > struct folio *folio; > > > vm_fault_t ret = VM_FAULT_LOCKED; > > > > > > filemap_invalidate_lock_shared(inode->i_mapping); > > > > > > + folio = filemap_get_entry(inode->i_mapping, vmf->pgoff); > > > + if (!folio && vma && userfaultfd_missing(vma)) { > > > + filemap_invalidate_unlock_shared(inode->i_mapping); > > > + return handle_userfault(vmf, VM_UFFD_MISSING); > > > + } > > > > Likely a possible refcount leak when folio != NULL here. > > Thank you. I was only aiming to cover the happy case for know. I will keep > it in mind for the future. Yep that's good enough, thanks. It's really something I'd comment passingly, it's definitely reassuring to know the happy case works. > > > + > > > folio = kvm_gmem_get_folio(inode, vmf->pgoff); > > > if (IS_ERR(folio)) { > > > int err = PTR_ERR(folio); > > > @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode > > > *inode, pgoff_t pgoff, > > > return 0; > > > } > > > > > > +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd, > > > + struct vm_area_struct *dst_vma, > > > + unsigned long dst_addr, > > > + unsigned long src_addr, > > > + uffd_flags_t flags, > > > + struct folio **foliop) > > > +{ > > > + struct inode *inode = file_inode(dst_vma->vm_file); > > > + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); > > > + struct folio *folio; > > > + int ret; > > > + > > > + folio = kvm_gmem_get_folio(inode, pgoff); > > > + if (IS_ERR(folio)) { > > > + ret = PTR_ERR(folio); > > > + goto out; > > > + } > > > + > > > + folio_unlock(folio); > > > + > > > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { > > > + void *vaddr = kmap_local_folio(folio, 0); > > > + ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE); > > > + kunmap_local(vaddr); > > > + if (unlikely(ret)) { > > > + *foliop = folio; > > > + ret = -ENOENT; > > > + goto out; > > > + } > > > + } else { /* ZEROPAGE */ > > > + clear_user_highpage(&folio->page, dst_addr); > > > + } > > > + > > > + kvm_gmem_mark_prepared(folio); > > > > Since Faud's series hasn't yet landed, so I'm almost looking at the current > > code base with an imagination of what might happen. > > > > In general, missing trapping for guest-memfd could start to be slightly > > trickier. So far IIUC guest-memfd cache pool needs to be populated only by > > a prior fallocate() syscall, not during fault. So I suppose we will need > > to use uptodate bit to mark folio ready, like what's done here. > > I don't think I'm familiar with the fallocate() requirement in guest_memfd. > Fuad's v12 [1] (although I think it has been like that from the beginning) > calls kvm_gmem_get_folio() that populates pagecache in the fault handler > (kvm_gmem_fault_shared()). SEV [2] and TDX [3] seem to use > kvm_gmem_populate() for both allocation and preparation. I actually didn't notice fault() uses kvm_gmem_get_folio(), which has FGP_CREAT indeed. I checked Ackerley's latest 1G patchset, which also did the same that kvm_gmem_get_folio() will invoke the custom allocator to allocate 1G pages even during a fault(). Not sure whether it's intentional though, for example, if the tests in userspace always does fallocate() then the code should run the same, and FGP_CREAT will just never be used. Thanks for pointing this out. I definitely didn't notice this trivial detail before. Looks like it's not a major issue, if the folio can be dynamically allocated, then MISSING mode (if/when it'll be supported) can capture both "!folio" and "folio && !uptodate" cases here as missing. > > [1] https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/T/#m15b53a741e4f328e61f995a01afb9c4682ffe611 > [2] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/svm/sev.c#L2331 > [3] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/vmx/tdx.c#L3236 -- Peter Xu
© 2016 - 2025 Red Hat, Inc.