[v1] mm/userfaultfd: modulize memory types

[PATCH 0/4] mm/userfaultfd: modulize memory types

Posted by Peter Xu 3 months, 2 weeks ago

[based on akpm/mm-new]

This series is an alternative proposal of what Nikita proposed here on the
initial three patches:

  https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com

This is not yet relevant to any guest-memfd support, but paving way for it.
Here, the major goal is to make kernel modules be able to opt-in with any
form of userfaultfd supports, like guest-memfd.  This alternative option
should hopefully be cleaner, and avoid leaking userfault details into
vm_ops.fault().

It also means this series does not depend on anything.  It's a pure
refactoring of userfaultfd internals to provide a generic API, so that
other types of files, especially RAM based, can support userfaultfd without
touching mm/ at all.

To achieve it, this series introduced a file operation called vm_uffd_ops.
The ops needs to be provided when a file type supports any of userfaultfd.

With that, I moved both hugetlbfs and shmem over.

Hugetlbfs is still very special that it will only use partial of the
vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
BUG() and so far hard-coded into core mm.  But this should still be better,
because at least hugetlbfs is still always involved in feature probing
(e.g. where it used to not support ZEROPAGE and we have a hard-coded line
to fail that, and some more).  Meanwhile after this series, shmem will be
completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
shmem looks like this:

static const vm_uffd_ops shmem_uffd_ops = {
	.uffd_features	= 	__VM_UFFD_FLAGS,
	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
				BIT(_UFFDIO_ZEROPAGE) |
				BIT(_UFFDIO_WRITEPROTECT) |
				BIT(_UFFDIO_CONTINUE) |
				BIT(_UFFDIO_POISON),
	.uffd_get_folio	=	shmem_uffd_get_folio,
	.uffd_copy	=	shmem_mfill_atomic_pte,
};

As I mentioned in one of my reply to Nikita, I don't like the current
interface of uffd_copy(), but this will be the minimum change version of
such API to support complete extrenal-module-ready userfaultfd.  Here, very
minimal change will be needed from shmem side to support that.

Meanwhile, the vm_uffd_ops is also not the only place one will need to
provide to support userfaultfd.  Normally vm_ops.fault() will also need to
be updated, but that's a generic function and it'll play together with the
new vm_uffd_ops to make everything fly.

No functional change expected at all after the whole series applied.  There
might be some slightly stricter check on uffd ops here and there in the
last patch, but that really shouldn't stand out anywhere to anyone.

For testing: besides the cross-compilation tests, I did also try with
uffd-stress in a VM to measure any perf difference before/after the change;
The static call becomes a pointer now.  I really cannot measure anything
different, which is more or less expected.

Comments welcomed, thanks.

Peter Xu (4):
  mm: Introduce vm_uffd_ops API
  mm/shmem: Support vm_uffd_ops API
  mm/hugetlb: Support vm_uffd_ops API
  mm: Apply vm_uffd_ops API to core mm

 include/linux/mm.h            |  71 +++++++++++++++++++++
 include/linux/shmem_fs.h      |  14 -----
 include/linux/userfaultfd_k.h |  58 ++++-------------
 mm/hugetlb.c                  |  19 ++++++
 mm/shmem.c                    |  28 ++++++++-
 mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
 6 files changed, 217 insertions(+), 88 deletions(-)

-- 
2.49.0

Re: [PATCH 0/4] mm/userfaultfd: modulize memory types

Posted by Nikita Kalyazin 3 months, 2 weeks ago


On 20/06/2025 20:03, Peter Xu wrote:
> [based on akpm/mm-new]
> 
> This series is an alternative proposal of what Nikita proposed here on the
> initial three patches:
> 
>    https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> 
> This is not yet relevant to any guest-memfd support, but paving way for it.

Hi Peter,

Thanks for posting this.  I confirmed that minor fault handling was 
working for guest_memfd based on this series and looked simple (a draft 
based on mmap support in guest_memfd v7 [1]):

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5abb6d52a375..6ddc73419724 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -5,6 +5,9 @@
  #include <linux/pagemap.h>
  #include <linux/anon_inodes.h>
  #include <linux/set_memory.h>
+#ifdef CONFIG_USERFAULTFD
+#include <linux/userfaultfd_k.h>
+#endif

  #include "kvm_mm.h"

@@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
  		kvm_gmem_mark_prepared(folio);
  	}

+#ifdef CONFIG_USERFAULTFD
+	if (userfaultfd_minor(vmf->vma)) {
+		folio_unlock(folio);
+		filemap_invalidate_unlock_shared(inode->i_mapping);
+		return handle_userfault(vmf, VM_UFFD_MINOR);
+	}
+#endif
+
  	vmf->page = folio_file_page(folio, vmf->pgoff);

  out_folio:
@@ -410,8 +421,39 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
  	return ret;
  }

+#ifdef CONFIG_USERFAULTFD
+static int kvm_gmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
+				struct folio **foliop)
+{
+	struct folio *folio;
+	folio = kvm_gmem_get_folio(inode, pgoff);
+
+	if (IS_ERR(folio)) {
+		*foliop = NULL;
+		return PTR_ERR(folio);
+	}
+
+	if (!folio_test_uptodate(folio)) {
+		clear_highpage(folio_page(folio, 0));
+		kvm_gmem_mark_prepared(folio);
+	}
+
+	*foliop = folio;
+	return 0;
+}
+
+static const vm_uffd_ops kvm_gmem_uffd_ops = {
+	.uffd_features	= 	VM_UFFD_MINOR,
+	.uffd_ioctls	= 	BIT(_UFFDIO_CONTINUE),
+	.uffd_get_folio	=	kvm_gmem_uffd_get_folio,
+};
+#endif
+
  static const struct vm_operations_struct kvm_gmem_vm_ops = {
  	.fault = kvm_gmem_fault,
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &kvm_gmem_uffd_ops,
+#endif
  };

  static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)

[1]: https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/

> Here, the major goal is to make kernel modules be able to opt-in with any
> form of userfaultfd supports, like guest-memfd.  This alternative option
> should hopefully be cleaner, and avoid leaking userfault details into
> vm_ops.fault().
> 
> It also means this series does not depend on anything.  It's a pure
> refactoring of userfaultfd internals to provide a generic API, so that
> other types of files, especially RAM based, can support userfaultfd without
> touching mm/ at all.
> 
> To achieve it, this series introduced a file operation called vm_uffd_ops.
> The ops needs to be provided when a file type supports any of userfaultfd.
> 
> With that, I moved both hugetlbfs and shmem over.
> 
> Hugetlbfs is still very special that it will only use partial of the
> vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
> BUG() and so far hard-coded into core mm.  But this should still be better,
> because at least hugetlbfs is still always involved in feature probing
> (e.g. where it used to not support ZEROPAGE and we have a hard-coded line
> to fail that, and some more).  Meanwhile after this series, shmem will be
> completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
> shmem looks like this:
> 
> static const vm_uffd_ops shmem_uffd_ops = {
>          .uffd_features  =       __VM_UFFD_FLAGS,
>          .uffd_ioctls    =       BIT(_UFFDIO_COPY) |
>                                  BIT(_UFFDIO_ZEROPAGE) |
>                                  BIT(_UFFDIO_WRITEPROTECT) |
>                                  BIT(_UFFDIO_CONTINUE) |
>                                  BIT(_UFFDIO_POISON),
>          .uffd_get_folio =       shmem_uffd_get_folio,
>          .uffd_copy      =       shmem_mfill_atomic_pte,
> };
> 
> As I mentioned in one of my reply to Nikita, I don't like the current
> interface of uffd_copy(), but this will be the minimum change version of
> such API to support complete extrenal-module-ready userfaultfd.  Here, very
> minimal change will be needed from shmem side to support that.
> 
> Meanwhile, the vm_uffd_ops is also not the only place one will need to
> provide to support userfaultfd.  Normally vm_ops.fault() will also need to
> be updated, but that's a generic function and it'll play together with the
> new vm_uffd_ops to make everything fly.
> 
> No functional change expected at all after the whole series applied.  There
> might be some slightly stricter check on uffd ops here and there in the
> last patch, but that really shouldn't stand out anywhere to anyone.
> 
> For testing: besides the cross-compilation tests, I did also try with
> uffd-stress in a VM to measure any perf difference before/after the change;
> The static call becomes a pointer now.  I really cannot measure anything
> different, which is more or less expected.
> 
> Comments welcomed, thanks.
> 
> Peter Xu (4):
>    mm: Introduce vm_uffd_ops API
>    mm/shmem: Support vm_uffd_ops API
>    mm/hugetlb: Support vm_uffd_ops API
>    mm: Apply vm_uffd_ops API to core mm
> 
>   include/linux/mm.h            |  71 +++++++++++++++++++++
>   include/linux/shmem_fs.h      |  14 -----
>   include/linux/userfaultfd_k.h |  58 ++++-------------
>   mm/hugetlb.c                  |  19 ++++++
>   mm/shmem.c                    |  28 ++++++++-
>   mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
>   6 files changed, 217 insertions(+), 88 deletions(-)
> 
> --
> 2.49.0
>

Re: [PATCH 0/4] mm/userfaultfd: modulize memory types

Posted by Peter Xu 3 months, 2 weeks ago

On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 20/06/2025 20:03, Peter Xu wrote:
> > [based on akpm/mm-new]
> > 
> > This series is an alternative proposal of what Nikita proposed here on the
> > initial three patches:
> > 
> >    https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > 
> > This is not yet relevant to any guest-memfd support, but paving way for it.
> 
> Hi Peter,

Hi, Nikita,

> 
> Thanks for posting this.  I confirmed that minor fault handling was working
> for guest_memfd based on this series and looked simple (a draft based on
> mmap support in guest_memfd v7 [1]):

Thanks for the quick spin, glad to know it works. Some trivial things to
mention below..

> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 5abb6d52a375..6ddc73419724 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -5,6 +5,9 @@
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_USERFAULTFD

This ifdef not needed, userfaultfd_k.h has taken care of all cases.

> +#include <linux/userfaultfd_k.h>
> +#endif
> 
>  #include "kvm_mm.h"
> 
> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  		kvm_gmem_mark_prepared(folio);
>  	}
> 
> +#ifdef CONFIG_USERFAULTFD

Same here.  userfaultfd_minor() is always defined.

I'll wait for a few more days for reviewers, and likely send v2 before next
week.

Thanks,

-- 
Peter Xu

Re: [PATCH 0/4] mm/userfaultfd: modulize memory types

Posted by Nikita Kalyazin 3 months, 2 weeks ago


On 25/06/2025 21:17, Peter Xu wrote:
> On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 20/06/2025 20:03, Peter Xu wrote:
>>> [based on akpm/mm-new]
>>>
>>> This series is an alternative proposal of what Nikita proposed here on the
>>> initial three patches:
>>>
>>>     https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
>>>
>>> This is not yet relevant to any guest-memfd support, but paving way for it.
>>
>> Hi Peter,
> 
> Hi, Nikita,
> 
>>
>> Thanks for posting this.  I confirmed that minor fault handling was working
>> for guest_memfd based on this series and looked simple (a draft based on
>> mmap support in guest_memfd v7 [1]):
> 
> Thanks for the quick spin, glad to know it works. Some trivial things to
> mention below..

Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm 
it works as well:

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8c44e4b9f5f8..b5458a22fff4 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file 
*file, pgoff_t index)

  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
  {
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
  	struct inode *inode = file_inode(vmf->vma->vm_file);
  	struct folio *folio;
  	vm_fault_t ret = VM_FAULT_LOCKED;

  	filemap_invalidate_lock_shared(inode->i_mapping);

+	folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
+	if (!folio && vma && userfaultfd_missing(vma)) {
+		filemap_invalidate_unlock_shared(inode->i_mapping);
+		return handle_userfault(vmf, VM_UFFD_MISSING);
+	}
+
  	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
  	if (IS_ERR(folio)) {
  		int err = PTR_ERR(folio);
@@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode 
*inode, pgoff_t pgoff,
  	return 0;
  }

+static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
+			   struct vm_area_struct *dst_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   uffd_flags_t flags,
+			   struct folio **foliop)
+{
+	struct inode *inode = file_inode(dst_vma->vm_file);
+	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	struct folio *folio;
+	int ret;
+
+	folio = kvm_gmem_get_folio(inode, pgoff);
+	if (IS_ERR(folio)) {
+		ret = PTR_ERR(folio);
+		goto out;
+	}
+
+	folio_unlock(folio);
+
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
+		void *vaddr = kmap_local_folio(folio, 0);
+		ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
+		kunmap_local(vaddr);
+		if (unlikely(ret)) {
+			*foliop = folio;
+			ret = -ENOENT;
+			goto out;
+		}
+	} else {		/* ZEROPAGE */
+		clear_user_highpage(&folio->page, dst_addr);
+	}
+
+	kvm_gmem_mark_prepared(folio);
+
+	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
+				       &folio->page, true, flags);
+
+	if (ret)
+		folio_put(folio);
+out:
+	return ret;
+}
+
  static const vm_uffd_ops kvm_gmem_uffd_ops = {
-	.uffd_features	= 	VM_UFFD_MINOR,
-	.uffd_ioctls	= 	BIT(_UFFDIO_CONTINUE),
+	.uffd_features	= 	VM_UFFD_MISSING | VM_UFFD_MINOR,
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_ZEROPAGE) |
+				BIT(_UFFDIO_CONTINUE),
  	.uffd_get_folio	=	kvm_gmem_uffd_get_folio,
+	.uffd_copy	=	kvm_gmem_mfill_atomic_pte,
  };
  #endif

> 
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 5abb6d52a375..6ddc73419724 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -5,6 +5,9 @@
>>   #include <linux/pagemap.h>
>>   #include <linux/anon_inodes.h>
>>   #include <linux/set_memory.h>
>> +#ifdef CONFIG_USERFAULTFD
> 
> This ifdef not needed, userfaultfd_k.h has taken care of all cases.

Good to know, thanks.

>> +#include <linux/userfaultfd_k.h>
>> +#endif
>>
>>   #include "kvm_mm.h"
>>
>> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>                kvm_gmem_mark_prepared(folio);
>>        }
>>
>> +#ifdef CONFIG_USERFAULTFD
> 
> Same here.  userfaultfd_minor() is always defined.

Thank you.

> I'll wait for a few more days for reviewers, and likely send v2 before next
> week.
> 
> Thanks,
> 
> --
> Peter Xu
>

Re: [PATCH 0/4] mm/userfaultfd: modulize memory types

Posted by Peter Xu 3 months, 1 week ago

On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 25/06/2025 21:17, Peter Xu wrote:
> > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 20/06/2025 20:03, Peter Xu wrote:
> > > > [based on akpm/mm-new]
> > > > 
> > > > This series is an alternative proposal of what Nikita proposed here on the
> > > > initial three patches:
> > > > 
> > > >     https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > > > 
> > > > This is not yet relevant to any guest-memfd support, but paving way for it.
> > > 
> > > Hi Peter,
> > 
> > Hi, Nikita,
> > 
> > > 
> > > Thanks for posting this.  I confirmed that minor fault handling was working
> > > for guest_memfd based on this series and looked simple (a draft based on
> > > mmap support in guest_memfd v7 [1]):
> > 
> > Thanks for the quick spin, glad to know it works. Some trivial things to
> > mention below..
> 
> Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it
> works as well:

Appreciated.

Since at it, I'll comment quickly below.

> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 8c44e4b9f5f8..b5458a22fff4 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file
> *file, pgoff_t index)
> 
>  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  {
> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
>  	struct inode *inode = file_inode(vmf->vma->vm_file);
>  	struct folio *folio;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
> 
>  	filemap_invalidate_lock_shared(inode->i_mapping);
> 
> +	folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
> +	if (!folio && vma && userfaultfd_missing(vma)) {
> +		filemap_invalidate_unlock_shared(inode->i_mapping);
> +		return handle_userfault(vmf, VM_UFFD_MISSING);
> +	}

Likely a possible refcount leak when folio != NULL here.

> +
>  	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
>  		int err = PTR_ERR(folio);
> @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode
> *inode, pgoff_t pgoff,
>  	return 0;
>  }
> 
> +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
> +			   struct vm_area_struct *dst_vma,
> +			   unsigned long dst_addr,
> +			   unsigned long src_addr,
> +			   uffd_flags_t flags,
> +			   struct folio **foliop)
> +{
> +	struct inode *inode = file_inode(dst_vma->vm_file);
> +	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +	struct folio *folio;
> +	int ret;
> +
> +	folio = kvm_gmem_get_folio(inode, pgoff);
> +	if (IS_ERR(folio)) {
> +		ret = PTR_ERR(folio);
> +		goto out;
> +	}
> +
> +	folio_unlock(folio);
> +
> +	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
> +		void *vaddr = kmap_local_folio(folio, 0);
> +		ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
> +		kunmap_local(vaddr);
> +		if (unlikely(ret)) {
> +			*foliop = folio;
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +	} else {		/* ZEROPAGE */
> +		clear_user_highpage(&folio->page, dst_addr);
> +	}
> +
> +	kvm_gmem_mark_prepared(folio);

Since Faud's series hasn't yet landed, so I'm almost looking at the current
code base with an imagination of what might happen.

In general, missing trapping for guest-memfd could start to be slightly
trickier.  So far IIUC guest-memfd cache pool needs to be populated only by
a prior fallocate() syscall, not during fault.  So I suppose we will need
to use uptodate bit to mark folio ready, like what's done here.

If so, we may want to make sure in fault path any !uptodate fault will get
trapped for missing too, even if it sounds not strictly a "cache miss"
... so slightly confusing but sounds necessary.

Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into
account, because CoCo needs to prepare the pages, so mark uptodate may not
be enough?  I don't know well on the CoCo side to tell.  Otherwise we'll at
least need to restrict MISSING traps to only happen on fully shared
guest-memfds.

OTOH, MINOR should be much easier to be done for guest-memfd, not only
because the code to support that would be very minimum which is definitely
lovely, but also because it's still pretty common idea to monitor pgtable
entries, and it should logically even apply to CoCo: in a fault(), we need
to check whether the guest-memfd folio is "shared" and/or "faultable"
first; it should already fail the fault() if it's a private folio.  Then if
it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a
MINOR too.  For !CoCo it'll always trap as it's always faultable.

MINOR also makes more sense to be used in the future with 1G postcopy
support on top of gmem, because that's almost the only way to go.  Looks
like we've made up our mind to reuse Hugetlb pages for gmem which sounds
good, then Hugetlb pages are in 1G granule in allocations, and we can't
easily do 4K miss trapping on one 1G huge page.  MINOR is simpler but
actually more powerful from that POV.

To summarize, I think after this we can do MINOR before MISSING for
guest-memfd if MINOR already works for you.  We can leave MISSING until we
know how we would use it.

Thanks,

> +
> +	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> +				       &folio->page, true, flags);
> +
> +	if (ret)
> +		folio_put(folio);
> +out:
> +	return ret;
> +}
> +
>  static const vm_uffd_ops kvm_gmem_uffd_ops = {
> -	.uffd_features	= 	VM_UFFD_MINOR,
> -	.uffd_ioctls	= 	BIT(_UFFDIO_CONTINUE),
> +	.uffd_features	= 	VM_UFFD_MISSING | VM_UFFD_MINOR,
> +	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> +				BIT(_UFFDIO_ZEROPAGE) |
> +				BIT(_UFFDIO_CONTINUE),
>  	.uffd_get_folio	=	kvm_gmem_uffd_get_folio,
> +	.uffd_copy	=	kvm_gmem_mfill_atomic_pte,
>  };
>  #endif
> 
> > 
> > > 
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 5abb6d52a375..6ddc73419724 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -5,6 +5,9 @@
> > >   #include <linux/pagemap.h>
> > >   #include <linux/anon_inodes.h>
> > >   #include <linux/set_memory.h>
> > > +#ifdef CONFIG_USERFAULTFD
> > 
> > This ifdef not needed, userfaultfd_k.h has taken care of all cases.
> 
> Good to know, thanks.
> 
> > > +#include <linux/userfaultfd_k.h>
> > > +#endif
> > > 
> > >   #include "kvm_mm.h"
> > > 
> > > @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > >                kvm_gmem_mark_prepared(folio);
> > >        }
> > > 
> > > +#ifdef CONFIG_USERFAULTFD
> > 
> > Same here.  userfaultfd_minor() is always defined.
> 
> Thank you.
> 
> > I'll wait for a few more days for reviewers, and likely send v2 before next
> > week.
> > 
> > Thanks,
> > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu

Re: [PATCH 0/4] mm/userfaultfd: modulize memory types

Posted by Nikita Kalyazin 3 months, 1 week ago


On 27/06/2025 14:51, Peter Xu wrote:
> On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 25/06/2025 21:17, Peter Xu wrote:
>>> On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
>>>>
>>>>
>>>> On 20/06/2025 20:03, Peter Xu wrote:
>>>>> [based on akpm/mm-new]
>>>>>
>>>>> This series is an alternative proposal of what Nikita proposed here on the
>>>>> initial three patches:
>>>>>
>>>>>      https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
>>>>>
>>>>> This is not yet relevant to any guest-memfd support, but paving way for it.
>>>>
>>>> Hi Peter,
>>>
>>> Hi, Nikita,
>>>
>>>>
>>>> Thanks for posting this.  I confirmed that minor fault handling was working
>>>> for guest_memfd based on this series and looked simple (a draft based on
>>>> mmap support in guest_memfd v7 [1]):
>>>
>>> Thanks for the quick spin, glad to know it works. Some trivial things to
>>> mention below..
>>
>> Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it
>> works as well:
> 
> Appreciated.
> 
> Since at it, I'll comment quickly below.
> 
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 8c44e4b9f5f8..b5458a22fff4 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file
>> *file, pgoff_t index)
>>
>>   static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>   {
>> +     struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
>>        struct inode *inode = file_inode(vmf->vma->vm_file);
>>        struct folio *folio;
>>        vm_fault_t ret = VM_FAULT_LOCKED;
>>
>>        filemap_invalidate_lock_shared(inode->i_mapping);
>>
>> +     folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
>> +     if (!folio && vma && userfaultfd_missing(vma)) {
>> +             filemap_invalidate_unlock_shared(inode->i_mapping);
>> +             return handle_userfault(vmf, VM_UFFD_MISSING);
>> +     }
> 
> Likely a possible refcount leak when folio != NULL here.

Thank you.  I was only aiming to cover the happy case for know.  I will 
keep it in mind for the future.
>> +
>>        folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>>        if (IS_ERR(folio)) {
>>                int err = PTR_ERR(folio);
>> @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode
>> *inode, pgoff_t pgoff,
>>        return 0;
>>   }
>>
>> +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
>> +                        struct vm_area_struct *dst_vma,
>> +                        unsigned long dst_addr,
>> +                        unsigned long src_addr,
>> +                        uffd_flags_t flags,
>> +                        struct folio **foliop)
>> +{
>> +     struct inode *inode = file_inode(dst_vma->vm_file);
>> +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
>> +     struct folio *folio;
>> +     int ret;
>> +
>> +     folio = kvm_gmem_get_folio(inode, pgoff);
>> +     if (IS_ERR(folio)) {
>> +             ret = PTR_ERR(folio);
>> +             goto out;
>> +     }
>> +
>> +     folio_unlock(folio);
>> +
>> +     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
>> +             void *vaddr = kmap_local_folio(folio, 0);
>> +             ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
>> +             kunmap_local(vaddr);
>> +             if (unlikely(ret)) {
>> +                     *foliop = folio;
>> +                     ret = -ENOENT;
>> +                     goto out;
>> +             }
>> +     } else {                /* ZEROPAGE */
>> +             clear_user_highpage(&folio->page, dst_addr);
>> +     }
>> +
>> +     kvm_gmem_mark_prepared(folio);
> 
> Since Faud's series hasn't yet landed, so I'm almost looking at the current
> code base with an imagination of what might happen.
> 
> In general, missing trapping for guest-memfd could start to be slightly
> trickier.  So far IIUC guest-memfd cache pool needs to be populated only by
> a prior fallocate() syscall, not during fault.  So I suppose we will need
> to use uptodate bit to mark folio ready, like what's done here.

I don't think I'm familiar with the fallocate() requirement in 
guest_memfd.  Fuad's v12 [1] (although I think it has been like that 
from the beginning) calls kvm_gmem_get_folio() that populates pagecache 
in the fault handler (kvm_gmem_fault_shared()).  SEV [2] and TDX [3] 
seem to use kvm_gmem_populate() for both allocation and preparation.

[1] 
https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/T/#m15b53a741e4f328e61f995a01afb9c4682ffe611
[2] 
https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/svm/sev.c#L2331
[3] 
https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/vmx/tdx.c#L3236

> 
> If so, we may want to make sure in fault path any !uptodate fault will get
> trapped for missing too, even if it sounds not strictly a "cache miss"
> ... so slightly confusing but sounds necessary.
> 
> Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into
> account, because CoCo needs to prepare the pages, so mark uptodate may not
> be enough?  I don't know well on the CoCo side to tell.  Otherwise we'll at
> least need to restrict MISSING traps to only happen on fully shared
> guest-memfds.

I am not fluent in CoCo either, but I thought CoCo needed to do 
preparation for private pages only, while UFFD shouldn't be dealing with 
them so issuing MISSING only on shared looks sensible to me.
> OTOH, MINOR should be much easier to be done for guest-memfd, not only
> because the code to support that would be very minimum which is definitely
> lovely, but also because it's still pretty common idea to monitor pgtable
> entries, and it should logically even apply to CoCo: in a fault(), we need
> to check whether the guest-memfd folio is "shared" and/or "faultable"
> first; it should already fail the fault() if it's a private folio.  Then if
> it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a
> MINOR too.  For !CoCo it'll always trap as it's always faultable. > MINOR also makes more sense to be used in the future with 1G postcopy
> support on top of gmem, because that's almost the only way to go.  Looks
> like we've made up our mind to reuse Hugetlb pages for gmem which sounds
> good, then Hugetlb pages are in 1G granule in allocations, and we can't
> easily do 4K miss trapping on one 1G huge page.  MINOR is simpler but
> actually more powerful from that POV.
> 
> To summarize, I think after this we can do MINOR before MISSING for
> guest-memfd if MINOR already works for you.  We can leave MISSING until we
> know how we would use it.

Starting with MINOR sounds good to me.

> 
> Thanks,
> 
>> +
>> +     ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
>> +                                    &folio->page, true, flags);
>> +
>> +     if (ret)
>> +             folio_put(folio);
>> +out:
>> +     return ret;
>> +}
>> +
>>   static const vm_uffd_ops kvm_gmem_uffd_ops = {
>> -     .uffd_features  =       VM_UFFD_MINOR,
>> -     .uffd_ioctls    =       BIT(_UFFDIO_CONTINUE),
>> +     .uffd_features  =       VM_UFFD_MISSING | VM_UFFD_MINOR,
>> +     .uffd_ioctls    =       BIT(_UFFDIO_COPY) |
>> +                             BIT(_UFFDIO_ZEROPAGE) |
>> +                             BIT(_UFFDIO_CONTINUE),
>>        .uffd_get_folio =       kvm_gmem_uffd_get_folio,
>> +     .uffd_copy      =       kvm_gmem_mfill_atomic_pte,
>>   };
>>   #endif
>>
>>>
>>>>
>>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>>>> index 5abb6d52a375..6ddc73419724 100644
>>>> --- a/virt/kvm/guest_memfd.c
>>>> +++ b/virt/kvm/guest_memfd.c
>>>> @@ -5,6 +5,9 @@
>>>>    #include <linux/pagemap.h>
>>>>    #include <linux/anon_inodes.h>
>>>>    #include <linux/set_memory.h>
>>>> +#ifdef CONFIG_USERFAULTFD
>>>
>>> This ifdef not needed, userfaultfd_k.h has taken care of all cases.
>>
>> Good to know, thanks.
>>
>>>> +#include <linux/userfaultfd_k.h>
>>>> +#endif
>>>>
>>>>    #include "kvm_mm.h"
>>>>
>>>> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>>>                 kvm_gmem_mark_prepared(folio);
>>>>         }
>>>>
>>>> +#ifdef CONFIG_USERFAULTFD
>>>
>>> Same here.  userfaultfd_minor() is always defined.
>>
>> Thank you.
>>
>>> I'll wait for a few more days for reviewers, and likely send v2 before next
>>> week.
>>>
>>> Thanks,
>>>
>>> --
>>> Peter Xu
>>>
>>
> 
> --
> Peter Xu
>

Re: [PATCH 0/4] mm/userfaultfd: modulize memory types

Posted by Peter Xu 3 months, 1 week ago

On Fri, Jun 27, 2025 at 05:59:49PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 27/06/2025 14:51, Peter Xu wrote:
> > On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 25/06/2025 21:17, Peter Xu wrote:
> > > > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
> > > > > 
> > > > > 
> > > > > On 20/06/2025 20:03, Peter Xu wrote:
> > > > > > [based on akpm/mm-new]
> > > > > > 
> > > > > > This series is an alternative proposal of what Nikita proposed here on the
> > > > > > initial three patches:
> > > > > > 
> > > > > >      https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > > > > > 
> > > > > > This is not yet relevant to any guest-memfd support, but paving way for it.
> > > > > 
> > > > > Hi Peter,
> > > > 
> > > > Hi, Nikita,
> > > > 
> > > > > 
> > > > > Thanks for posting this.  I confirmed that minor fault handling was working
> > > > > for guest_memfd based on this series and looked simple (a draft based on
> > > > > mmap support in guest_memfd v7 [1]):
> > > > 
> > > > Thanks for the quick spin, glad to know it works. Some trivial things to
> > > > mention below..
> > > 
> > > Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it
> > > works as well:
> > 
> > Appreciated.
> > 
> > Since at it, I'll comment quickly below.
> > 
> > > 
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 8c44e4b9f5f8..b5458a22fff4 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file
> > > *file, pgoff_t index)
> > > 
> > >   static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > >   {
> > > +     struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
> > >        struct inode *inode = file_inode(vmf->vma->vm_file);
> > >        struct folio *folio;
> > >        vm_fault_t ret = VM_FAULT_LOCKED;
> > > 
> > >        filemap_invalidate_lock_shared(inode->i_mapping);
> > > 
> > > +     folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
> > > +     if (!folio && vma && userfaultfd_missing(vma)) {
> > > +             filemap_invalidate_unlock_shared(inode->i_mapping);
> > > +             return handle_userfault(vmf, VM_UFFD_MISSING);
> > > +     }
> > 
> > Likely a possible refcount leak when folio != NULL here.
> 
> Thank you.  I was only aiming to cover the happy case for know.  I will keep
> it in mind for the future.

Yep that's good enough, thanks.  It's really something I'd comment
passingly, it's definitely reassuring to know the happy case works.

> > > +
> > >        folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > >        if (IS_ERR(folio)) {
> > >                int err = PTR_ERR(folio);
> > > @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode
> > > *inode, pgoff_t pgoff,
> > >        return 0;
> > >   }
> > > 
> > > +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
> > > +                        struct vm_area_struct *dst_vma,
> > > +                        unsigned long dst_addr,
> > > +                        unsigned long src_addr,
> > > +                        uffd_flags_t flags,
> > > +                        struct folio **foliop)
> > > +{
> > > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > > +     struct folio *folio;
> > > +     int ret;
> > > +
> > > +     folio = kvm_gmem_get_folio(inode, pgoff);
> > > +     if (IS_ERR(folio)) {
> > > +             ret = PTR_ERR(folio);
> > > +             goto out;
> > > +     }
> > > +
> > > +     folio_unlock(folio);
> > > +
> > > +     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
> > > +             void *vaddr = kmap_local_folio(folio, 0);
> > > +             ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
> > > +             kunmap_local(vaddr);
> > > +             if (unlikely(ret)) {
> > > +                     *foliop = folio;
> > > +                     ret = -ENOENT;
> > > +                     goto out;
> > > +             }
> > > +     } else {                /* ZEROPAGE */
> > > +             clear_user_highpage(&folio->page, dst_addr);
> > > +     }
> > > +
> > > +     kvm_gmem_mark_prepared(folio);
> > 
> > Since Faud's series hasn't yet landed, so I'm almost looking at the current
> > code base with an imagination of what might happen.
> > 
> > In general, missing trapping for guest-memfd could start to be slightly
> > trickier.  So far IIUC guest-memfd cache pool needs to be populated only by
> > a prior fallocate() syscall, not during fault.  So I suppose we will need
> > to use uptodate bit to mark folio ready, like what's done here.
> 
> I don't think I'm familiar with the fallocate() requirement in guest_memfd.
> Fuad's v12 [1] (although I think it has been like that from the beginning)
> calls kvm_gmem_get_folio() that populates pagecache in the fault handler
> (kvm_gmem_fault_shared()).  SEV [2] and TDX [3] seem to use
> kvm_gmem_populate() for both allocation and preparation.

I actually didn't notice fault() uses kvm_gmem_get_folio(), which has
FGP_CREAT indeed.

I checked Ackerley's latest 1G patchset, which also did the same that
kvm_gmem_get_folio() will invoke the custom allocator to allocate 1G pages
even during a fault().

Not sure whether it's intentional though, for example, if the tests in
userspace always does fallocate() then the code should run the same, and
FGP_CREAT will just never be used.

Thanks for pointing this out.  I definitely didn't notice this trivial
detail before. Looks like it's not a major issue, if the folio can be
dynamically allocated, then MISSING mode (if/when it'll be supported) can
capture both "!folio" and "folio && !uptodate" cases here as missing.

> 
> [1] https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/T/#m15b53a741e4f328e61f995a01afb9c4682ffe611
> [2] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/svm/sev.c#L2331
> [3] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/vmx/tdx.c#L3236

-- 
Peter Xu