[PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()

David Hildenbrand posted 11 patches 9 months, 2 weeks ago
There is a newer version of this series
[PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 2 weeks ago
Let's use our new interface. In remap_pfn_range(), we'll now decide
whether we have to track (full VMA covered) or only sanitize the pgprot
(partial VMA covered).

Remember what we have to untrack by linking it from the VMA. When
duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
to anon VMA names, and use a kref to share the tracking.

Once the last VMA un-refs our tracking data, we'll do the untracking,
which simplifies things a lot and should sort our various issues we saw
recently, for example, when partially unmapping/zapping a tracked VMA.

This change implies that we'll keep tracking the original PFN range even
after splitting + partially unmapping it: not too bad, because it was
not working reliably before. The only thing that kind-of worked before
was shrinking such a mapping using mremap(): we managed to adjust the
reservation in a hacky way, now we won't adjust the reservation but
leave it around until all involved VMAs are gone.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_inline.h |  2 +
 include/linux/mm_types.h  | 11 ++++++
 kernel/fork.c             | 54 ++++++++++++++++++++++++--
 mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
 mm/mremap.c               |  4 --
 5 files changed, 128 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f9157a0c42a5c..89b518ff097e6 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
 
 #endif  /* CONFIG_ANON_VMA_NAME */
 
+void pfnmap_track_ctx_release(struct kref *ref);
+
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
 	atomic_set(&mm->tlb_flush_pending, 0);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f91..91124761cfda8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -764,6 +764,14 @@ struct vma_numab_state {
 	int prev_scan_seq;
 };
 
+#ifdef __HAVE_PFNMAP_TRACKING
+struct pfnmap_track_ctx {
+	struct kref kref;
+	unsigned long pfn;
+	unsigned long size;
+};
+#endif
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -877,6 +885,9 @@ struct vm_area_struct {
 	struct anon_vma_name *anon_name;
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef __HAVE_PFNMAP_TRACKING
+	struct pfnmap_track_ctx *pfnmap_track_ctx;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index 168681fc4b25a..ae518b8fe752c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
 #ifdef CONFIG_NUMA
 	dest->vm_policy = src->vm_policy;
 #endif
+#ifdef __HAVE_PFNMAP_TRACKING
+	dest->pfnmap_track_ctx = NULL;
+#endif
+}
+
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+		struct vm_area_struct *new)
+{
+	struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
+
+	if (likely(!ctx))
+		return 0;
+
+	/*
+	 * We don't expect to ever hit this. If ever required, we would have
+	 * to duplicate the tracking.
+	 */
+	if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
+		return -ENOMEM;
+	kref_get(&ctx->kref);
+	new->pfnmap_track_ctx = ctx;
+	return 0;
+}
+
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+	struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
+
+	if (likely(!ctx))
+		return;
+
+	kref_put(&ctx->kref, pfnmap_track_ctx_release);
+	vma->pfnmap_track_ctx = NULL;
+}
+#else
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+		struct vm_area_struct *new)
+{
+	return 0;
 }
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+}
+#endif
 
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 {
@@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
 	vm_area_init_from(orig, new);
+
+	if (vma_pfnmap_track_ctx_dup(orig, new)) {
+		kmem_cache_free(vm_area_cachep, new);
+		return NULL;
+	}
 	vma_lock_init(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
@@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
 	vma_assert_detached(vma);
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
+	vma_pfnmap_track_ctx_release(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		if (!tmp)
 			goto fail_nomem;
 
-		/* track_pfn_copy() will later take care of copying internal state. */
-		if (unlikely(tmp->vm_flags & VM_PFNMAP))
-			untrack_pfn_clear(tmp);
-
 		retval = vma_dup_policy(mpnt, tmp);
 		if (retval)
 			goto fail_nomem_policy;
diff --git a/mm/memory.c b/mm/memory.c
index c737a8625866a..eb2b3f10a97ec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	struct mm_struct *dst_mm = dst_vma->vm_mm;
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	struct mmu_notifier_range range;
-	unsigned long next, pfn = 0;
+	unsigned long next;
 	bool is_cow;
 	int ret;
 
@@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	if (is_vm_hugetlb_page(src_vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
 
-	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
-		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
-		if (ret)
-			return ret;
-	}
-
 	/*
 	 * We need to invalidate the secondary MMU mappings only when
 	 * there could be a permission downgrade on the ptes of the
@@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		raw_write_seqcount_end(&src_mm->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
 	}
-	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
-		untrack_pfn_copy(dst_vma, pfn);
 	return ret;
 }
 
@@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 	if (vma->vm_file)
 		uprobe_munmap(vma, start, end);
 
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn(vma, 0, 0, mm_wr_locked);
-
 	if (start != end) {
 		if (unlikely(is_vm_hugetlb_page(vma))) {
 			/*
@@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
 	return error;
 }
 
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
+		unsigned long size, pgprot_t *prot)
+{
+	struct pfnmap_track_ctx *ctx;
+
+	if (pfnmap_track(pfn, size, prot))
+		return ERR_PTR(-EINVAL);
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (unlikely(!ctx)) {
+		pfnmap_untrack(pfn, size);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pfn = pfn;
+	ctx->size = size;
+	kref_init(&ctx->kref);
+	return ctx;
+}
+
+void pfnmap_track_ctx_release(struct kref *ref)
+{
+	struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
+
+	pfnmap_untrack(ctx->pfn, ctx->size);
+	kfree(ctx);
+}
+#endif /* __HAVE_PFNMAP_TRACKING */
+
 /**
  * remap_pfn_range - remap kernel memory to userspace
  * @vma: user vma to map to
@@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
  *
  * Return: %0 on success, negative error code otherwise.
  */
+#ifdef __HAVE_PFNMAP_TRACKING
 int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
+	struct pfnmap_track_ctx *ctx = NULL;
 	int err;
 
-	err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
-	if (err)
+	size = PAGE_ALIGN(size);
+
+	/*
+	 * If we cover the full VMA, we'll perform actual tracking, and
+	 * remember to untrack when the last reference to our tracking
+	 * context from a VMA goes away.
+	 *
+	 * If we only cover parts of the VMA, we'll only sanitize the
+	 * pgprot.
+	 */
+	if (addr == vma->vm_start && addr + size == vma->vm_end) {
+		if (vma->pfnmap_track_ctx)
+			return -EINVAL;
+		ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
+	} else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
 		return -EINVAL;
+	}
 
 	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
-	if (err)
-		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
+	if (ctx) {
+		if (err)
+			kref_put(&ctx->kref, pfnmap_track_ctx_release);
+		else
+			vma->pfnmap_track_ctx = ctx;
+	}
 	return err;
 }
+
+#else
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+		    unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+}
+#endif
 EXPORT_SYMBOL(remap_pfn_range);
 
 /**
diff --git a/mm/mremap.c b/mm/mremap.c
index 7db9da609c84f..6e78e02f74bd3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 	if (is_vm_hugetlb_page(vma))
 		clear_vma_resv_huge_pages(vma);
 
-	/* Tell pfnmap has moved from this vma */
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn_clear(vma);
-
 	*new_vma_ptr = new_vma;
 	return err;
 }
-- 
2.49.0
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Lorenzo Stoakes 9 months, 2 weeks ago
On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only sanitize the pgprot
> (partial VMA covered).

Yeah I do agree with Peter that 'sanitize' is not great here, but naming is
hard :) anyway was discussed in that thread elsewhere...

>
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.

Yes this is sensible.

>
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.

Lovely!

>
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.

Hm, but what if we shrink a VMA, then map another one, might it be
incorrectly storing PAT attributes for part of the range that is now mapped
elsewhere?

Also my god re: the 'kind of working' aspects of PAT, so frustrating.

>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Generally looking good, afaict, but maybe let's get some input from Suren
on VMA size.

Are there actually any PAT tests out here? I had a quick glance in
tools/testing/selftests/x86,mm and couldn't find any, but didn't look
_that_ card.

Thanks in general for tackling this, this is a big improvement!

> ---
>  include/linux/mm_inline.h |  2 +
>  include/linux/mm_types.h  | 11 ++++++
>  kernel/fork.c             | 54 ++++++++++++++++++++++++--
>  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>  mm/mremap.c               |  4 --
>  5 files changed, 128 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>
>  #endif  /* CONFIG_ANON_VMA_NAME */
>
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
>  static inline void init_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f91..91124761cfda8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -764,6 +764,14 @@ struct vma_numab_state {
>  	int prev_scan_seq;
>  };
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> +	struct kref kref;
> +	unsigned long pfn;
> +	unsigned long size;

Again, (super) nitty, but we really should express units. I suppose 'size'
implies bytes to be honest as you'd unlikely say 'size' for number of pages
(you'd go with nr_pages or something). But maybe a trailing /* in bytes */
would help.

Not a big deal though!

> +};
> +#endif
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -877,6 +885,9 @@ struct vm_area_struct {
>  	struct anon_vma_name *anon_name;
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING

An aside, but absolutely hate '__HAVE_PFNMAP_TRACKING' as a name here. But
you didn't create it, and it's not really sensible to change it in this
series so. Just a rumble...

> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif
>  } __randomize_layout;
>
>  #ifdef CONFIG_NUMA
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 168681fc4b25a..ae518b8fe752c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
>  #ifdef CONFIG_NUMA
>  	dest->vm_policy = src->vm_policy;
>  #endif
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	dest->pfnmap_track_ctx = NULL;
> +#endif
> +}
> +
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return 0;
> +
> +	/*
> +	 * We don't expect to ever hit this. If ever required, we would have
> +	 * to duplicate the tracking.
> +	 */
> +	if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> +		return -ENOMEM;
> +	kref_get(&ctx->kref);
> +	new->pfnmap_track_ctx = ctx;
> +	return 0;
> +}
> +
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +	struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return;
> +
> +	kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +	vma->pfnmap_track_ctx = NULL;
> +}
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	return 0;
>  }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  {

Obviously my series will break this but should be _fairly_ trivial to
update.

You will however have to make sure to update tools/testing/vma/* to handle
the new functions in userland testing (they need to be stubbed otu).

If it makes life easier, you can even send it to me off-list, or just send
it without changing this in a respin and I can fix it up fairly quick for
you.

> @@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
>  	vm_area_init_from(orig, new);
> +
> +	if (vma_pfnmap_track_ctx_dup(orig, new)) {
> +		kmem_cache_free(vm_area_cachep, new);
> +		return NULL;
> +	}
>  	vma_lock_init(new, true);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  	vma_numab_state_init(new);
> @@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
>  	vma_assert_detached(vma);
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> +	vma_pfnmap_track_ctx_release(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>
> @@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		if (!tmp)
>  			goto fail_nomem;
>
> -		/* track_pfn_copy() will later take care of copying internal state. */
> -		if (unlikely(tmp->vm_flags & VM_PFNMAP))
> -			untrack_pfn_clear(tmp);
> -
>  		retval = vma_dup_policy(mpnt, tmp);
>  		if (retval)
>  			goto fail_nomem_policy;
> diff --git a/mm/memory.c b/mm/memory.c
> index c737a8625866a..eb2b3f10a97ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	struct mm_struct *dst_mm = dst_vma->vm_mm;
>  	struct mm_struct *src_mm = src_vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	unsigned long next, pfn = 0;
> +	unsigned long next;
>  	bool is_cow;
>  	int ret;
>
> @@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	if (is_vm_hugetlb_page(src_vma))
>  		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>
> -	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> -		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> -		if (ret)
> -			return ret;
> -	}
> -

So lovely to see this kind of thing go...

>  	/*
>  	 * We need to invalidate the secondary MMU mappings only when
>  	 * there could be a permission downgrade on the ptes of the
> @@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  		raw_write_seqcount_end(&src_mm->write_protect_seq);
>  		mmu_notifier_invalidate_range_end(&range);
>  	}
> -	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_copy(dst_vma, pfn);
>  	return ret;
>  }
>
> @@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  	if (vma->vm_file)
>  		uprobe_munmap(vma, start, end);
>
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
>  	if (start != end) {
>  		if (unlikely(is_vm_hugetlb_page(vma))) {
>  			/*
> @@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>  	return error;
>  }
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> +		unsigned long size, pgprot_t *prot)
> +{
> +	struct pfnmap_track_ctx *ctx;
> +
> +	if (pfnmap_track(pfn, size, prot))
> +		return ERR_PTR(-EINVAL);
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (unlikely(!ctx)) {
> +		pfnmap_untrack(pfn, size);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	ctx->pfn = pfn;
> +	ctx->size = size;
> +	kref_init(&ctx->kref);
> +	return ctx;
> +}
> +
> +void pfnmap_track_ctx_release(struct kref *ref)
> +{
> +	struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> +
> +	pfnmap_untrack(ctx->pfn, ctx->size);
> +	kfree(ctx);
> +}
> +#endif /* __HAVE_PFNMAP_TRACKING */
> +
>  /**
>   * remap_pfn_range - remap kernel memory to userspace
>   * @vma: user vma to map to
> @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * Return: %0 on success, negative error code otherwise.
>   */
> +#ifdef __HAVE_PFNMAP_TRACKING
>  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  		    unsigned long pfn, unsigned long size, pgprot_t prot)

OK so to expose some of my lack-of-knowledge of PAT - is this the
'entrypoint' to PAT tracking?

So we have some kernel memory we remap to userland as PFN map, the kind
that very well might be sensible to use PAT the change cache behaviour for,
and each time this happens, it's mapped as PAT?

>  {
> +	struct pfnmap_track_ctx *ctx = NULL;
>  	int err;
>
> -	err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> -	if (err)
> +	size = PAGE_ALIGN(size);
> +
> +	/*
> +	 * If we cover the full VMA, we'll perform actual tracking, and
> +	 * remember to untrack when the last reference to our tracking
> +	 * context from a VMA goes away.
> +	 *
> +	 * If we only cover parts of the VMA, we'll only sanitize the
> +	 * pgprot.
> +	 */
> +	if (addr == vma->vm_start && addr + size == vma->vm_end) {
> +		if (vma->pfnmap_track_ctx)
> +			return -EINVAL;
> +		ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
>  		return -EINVAL;
> +	}
>
>  	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> -	if (err)
> -		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> +	if (ctx) {
> +		if (err)
> +			kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +		else
> +			vma->pfnmap_track_ctx = ctx;
> +	}
>  	return err;
>  }
> +
> +#else
> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> +		    unsigned long pfn, unsigned long size, pgprot_t prot)
> +{
> +	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> +}
> +#endif
>  EXPORT_SYMBOL(remap_pfn_range);
>
>  /**
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84f..6e78e02f74bd3 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>  	if (is_vm_hugetlb_page(vma))
>  		clear_vma_resv_huge_pages(vma);
>
> -	/* Tell pfnmap has moved from this vma */
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_clear(vma);
> -

Thanks! <3

>  	*new_vma_ptr = new_vma;
>  	return err;
>  }
> --
> 2.49.0
>
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 1 week ago
>>
>> This change implies that we'll keep tracking the original PFN range even
>> after splitting + partially unmapping it: not too bad, because it was
>> not working reliably before. The only thing that kind-of worked before
>> was shrinking such a mapping using mremap(): we managed to adjust the
>> reservation in a hacky way, now we won't adjust the reservation but
>> leave it around until all involved VMAs are gone.
> 
> Hm, but what if we shrink a VMA, then map another one, might it be
> incorrectly storing PAT attributes for part of the range that is now mapped
> elsewhere?

Not "incorrectly". We'll simply undo the reservation of the cachemode 
for the original PFN range once everything of the original VMA is gone.

AFAIK, one can usually mmap() the "unmapped" part after shrinking again 
with the same cachemode, which should be the main use case.

Supporting partial un-tracking will require hooking into vma splitting 
code ... not something I am super happy about. :)

> 
> Also my god re: the 'kind of working' aspects of PAT, so frustrating.
> 
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Generally looking good, afaict, but maybe let's get some input from Suren
> on VMA size.
> 
> Are there actually any PAT tests out here? I had a quick glance in
> tools/testing/selftests/x86,mm and couldn't find any, but didn't look
> _that_ card.

Heh, booting a simple VM gets PAT involved. I suspect because /dev/mem 
and BIOS/GPU/whatever hacks.

In the cover letter I have

"Briefly tested with some basic /dev/mem test I crafted. I want to 
convert them to selftests, but that might or might not require a bit of
more work (e.g., /dev/mem accessibility)."

> 
> Thanks in general for tackling this, this is a big improvement!
> 
>> ---
>>   include/linux/mm_inline.h |  2 +
>>   include/linux/mm_types.h  | 11 ++++++
>>   kernel/fork.c             | 54 ++++++++++++++++++++++++--
>>   mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>>   mm/mremap.c               |  4 --
>>   5 files changed, 128 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index f9157a0c42a5c..89b518ff097e6 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>
>>   #endif  /* CONFIG_ANON_VMA_NAME */
>>
>> +void pfnmap_track_ctx_release(struct kref *ref);
>> +
>>   static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>   {
>>   	atomic_set(&mm->tlb_flush_pending, 0);
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 56d07edd01f91..91124761cfda8 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>>   	int prev_scan_seq;
>>   };
>>
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +struct pfnmap_track_ctx {
>> +	struct kref kref;
>> +	unsigned long pfn;
>> +	unsigned long size;
> 
> Again, (super) nitty, but we really should express units. I suppose 'size'
> implies bytes to be honest as you'd unlikely say 'size' for number of pages
> (you'd go with nr_pages or something). But maybe a trailing /* in bytes */
> would help.
> 
> Not a big deal though!

"size" in the kernel is usually bytes, never pages ... but I might be wrong.

Anyhow, I can use "/* in bytes*/" here, although I doubt that many will 
benefit from this comment :)

> 
>> +};
>> +#endif
>> +
>>   /*
>>    * This struct describes a virtual memory area. There is one of these
>>    * per VM-area/task. A VM area is any part of the process virtual memory
>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>>   	struct anon_vma_name *anon_name;
>>   #endif
>>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef __HAVE_PFNMAP_TRACKING
> 
> An aside, but absolutely hate '__HAVE_PFNMAP_TRACKING' as a name here. But
> you didn't create it, and it's not really sensible to change it in this
> series so. Just a rumble...

I cannot argue with that ... same here.

To be clear: I hate all of this with passion ;) With this series, I hate 
it a bit less.

[...]

> 
> Obviously my series will break this but should be _fairly_ trivial to
> update.
> 
> You will however have to make sure to update tools/testing/vma/* to handle
> the new functions in userland testing (they need to be stubbed otu).

Ah, I was happy it compiled but looks like I'll have to mess with that 
as well.

> 
> If it makes life easier, you can even send it to me off-list, or just send
> it without changing this in a respin and I can fix it up fairly quick for
> you.

Let me give it a try first, I'll let you know if it takes me too long.

Thanks!

[...]

>>   /**
>>    * remap_pfn_range - remap kernel memory to userspace
>>    * @vma: user vma to map to
>> @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>>    *
>>    * Return: %0 on success, negative error code otherwise.
>>    */
>> +#ifdef __HAVE_PFNMAP_TRACKING
>>   int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>>   		    unsigned long pfn, unsigned long size, pgprot_t prot)
> 
> OK so to expose some of my lack-of-knowledge of PAT - is this the
> 'entrypoint' to PAT tracking?

Only if you're using remap_pfn_range() ... there is other low-level 
tracking/reservation using the memtype_reserve() interface and friends.

> 
> So we have some kernel memory we remap to userland as PFN map, the kind
> that very well might be sensible to use PAT the change cache behaviour for,
> and each time this happens, it's mapped as PAT?

Right, anytime someone uses remap_pfn_range() on the full VMA, we track 
it (depending on RAM vs. !RAM this "tracking" has different semantics).

For RAM, we seem to only lookup the cachemode. For !RAM, we seem to 
reserve the memtype for the PFN range, which will fail if there already 
is an incompatible memtype reserved.

It's all ... very weird.

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 1 week ago
>>
>> Obviously my series will break this but should be _fairly_ trivial to
>> update.
>>
>> You will however have to make sure to update tools/testing/vma/* to handle
>> the new functions in userland testing (they need to be stubbed otu).

Hmm, seems to compile. I guess, because we won't have
"__HAVE_PFNMAP_TRACKING" defined in the test environment, so
the existing stubs in there already seem to do the trick.


+#else
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+              struct vm_area_struct *new)
+{
+      return 0;
  }
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+}
+#endif


-- 
Cheers,

David / dhildenb
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Lorenzo Stoakes 9 months, 1 week ago
On Wed, May 07, 2025 at 03:25:42PM +0200, David Hildenbrand wrote:
> > >
> > > Obviously my series will break this but should be _fairly_ trivial to
> > > update.
> > >
> > > You will however have to make sure to update tools/testing/vma/* to handle
> > > the new functions in userland testing (they need to be stubbed otu).
>
> Hmm, seems to compile. I guess, because we won't have
> "__HAVE_PFNMAP_TRACKING" defined in the test environment, so
> the existing stubs in there already seem to do the trick.
>
>
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +              struct vm_area_struct *new)
> +{
> +      return 0;
>  }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>

OK cool! Then we're good :>)

>
> --
> Cheers,
>
> David / dhildenb
>
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Lorenzo Stoakes 9 months, 2 weeks ago
+cc Suren, who has worked HEAVILY on VMA field manipulation and such :)

Suren - David is proposing adding a new field. AFAICT this does not add a
new cache line so I think we're all good.

But FYI!

On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only sanitize the pgprot
> (partial VMA covered).
>
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.
>
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.
>
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm_inline.h |  2 +
>  include/linux/mm_types.h  | 11 ++++++
>  kernel/fork.c             | 54 ++++++++++++++++++++++++--
>  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>  mm/mremap.c               |  4 --
>  5 files changed, 128 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>
>  #endif  /* CONFIG_ANON_VMA_NAME */
>
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
>  static inline void init_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f91..91124761cfda8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -764,6 +764,14 @@ struct vma_numab_state {
>  	int prev_scan_seq;
>  };
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> +	struct kref kref;
> +	unsigned long pfn;
> +	unsigned long size;
> +};
> +#endif
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -877,6 +885,9 @@ struct vm_area_struct {
>  	struct anon_vma_name *anon_name;
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif
>  } __randomize_layout;
>
>  #ifdef CONFIG_NUMA
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 168681fc4b25a..ae518b8fe752c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
>  #ifdef CONFIG_NUMA
>  	dest->vm_policy = src->vm_policy;
>  #endif
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	dest->pfnmap_track_ctx = NULL;
> +#endif
> +}
> +
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return 0;
> +
> +	/*
> +	 * We don't expect to ever hit this. If ever required, we would have
> +	 * to duplicate the tracking.
> +	 */
> +	if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> +		return -ENOMEM;
> +	kref_get(&ctx->kref);
> +	new->pfnmap_track_ctx = ctx;
> +	return 0;
> +}
> +
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +	struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return;
> +
> +	kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +	vma->pfnmap_track_ctx = NULL;
> +}
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	return 0;
>  }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  {
> @@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
>  	vm_area_init_from(orig, new);
> +
> +	if (vma_pfnmap_track_ctx_dup(orig, new)) {
> +		kmem_cache_free(vm_area_cachep, new);
> +		return NULL;
> +	}
>  	vma_lock_init(new, true);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  	vma_numab_state_init(new);
> @@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
>  	vma_assert_detached(vma);
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> +	vma_pfnmap_track_ctx_release(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>
> @@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		if (!tmp)
>  			goto fail_nomem;
>
> -		/* track_pfn_copy() will later take care of copying internal state. */
> -		if (unlikely(tmp->vm_flags & VM_PFNMAP))
> -			untrack_pfn_clear(tmp);
> -
>  		retval = vma_dup_policy(mpnt, tmp);
>  		if (retval)
>  			goto fail_nomem_policy;
> diff --git a/mm/memory.c b/mm/memory.c
> index c737a8625866a..eb2b3f10a97ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	struct mm_struct *dst_mm = dst_vma->vm_mm;
>  	struct mm_struct *src_mm = src_vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	unsigned long next, pfn = 0;
> +	unsigned long next;
>  	bool is_cow;
>  	int ret;
>
> @@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	if (is_vm_hugetlb_page(src_vma))
>  		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>
> -	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> -		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	/*
>  	 * We need to invalidate the secondary MMU mappings only when
>  	 * there could be a permission downgrade on the ptes of the
> @@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  		raw_write_seqcount_end(&src_mm->write_protect_seq);
>  		mmu_notifier_invalidate_range_end(&range);
>  	}
> -	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_copy(dst_vma, pfn);
>  	return ret;
>  }
>
> @@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  	if (vma->vm_file)
>  		uprobe_munmap(vma, start, end);
>
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
>  	if (start != end) {
>  		if (unlikely(is_vm_hugetlb_page(vma))) {
>  			/*
> @@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>  	return error;
>  }
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> +		unsigned long size, pgprot_t *prot)
> +{
> +	struct pfnmap_track_ctx *ctx;
> +
> +	if (pfnmap_track(pfn, size, prot))
> +		return ERR_PTR(-EINVAL);
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (unlikely(!ctx)) {
> +		pfnmap_untrack(pfn, size);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	ctx->pfn = pfn;
> +	ctx->size = size;
> +	kref_init(&ctx->kref);
> +	return ctx;
> +}
> +
> +void pfnmap_track_ctx_release(struct kref *ref)
> +{
> +	struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> +
> +	pfnmap_untrack(ctx->pfn, ctx->size);
> +	kfree(ctx);
> +}
> +#endif /* __HAVE_PFNMAP_TRACKING */
> +
>  /**
>   * remap_pfn_range - remap kernel memory to userspace
>   * @vma: user vma to map to
> @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * Return: %0 on success, negative error code otherwise.
>   */
> +#ifdef __HAVE_PFNMAP_TRACKING
>  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  		    unsigned long pfn, unsigned long size, pgprot_t prot)
>  {
> +	struct pfnmap_track_ctx *ctx = NULL;
>  	int err;
>
> -	err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> -	if (err)
> +	size = PAGE_ALIGN(size);
> +
> +	/*
> +	 * If we cover the full VMA, we'll perform actual tracking, and
> +	 * remember to untrack when the last reference to our tracking
> +	 * context from a VMA goes away.
> +	 *
> +	 * If we only cover parts of the VMA, we'll only sanitize the
> +	 * pgprot.
> +	 */
> +	if (addr == vma->vm_start && addr + size == vma->vm_end) {
> +		if (vma->pfnmap_track_ctx)
> +			return -EINVAL;
> +		ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
>  		return -EINVAL;
> +	}
>
>  	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> -	if (err)
> -		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> +	if (ctx) {
> +		if (err)
> +			kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +		else
> +			vma->pfnmap_track_ctx = ctx;
> +	}
>  	return err;
>  }
> +
> +#else
> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> +		    unsigned long pfn, unsigned long size, pgprot_t prot)
> +{
> +	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> +}
> +#endif
>  EXPORT_SYMBOL(remap_pfn_range);
>
>  /**
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84f..6e78e02f74bd3 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>  	if (is_vm_hugetlb_page(vma))
>  		clear_vma_resv_huge_pages(vma);
>
> -	/* Tell pfnmap has moved from this vma */
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_clear(vma);
> -
>  	*new_vma_ptr = new_vma;
>  	return err;
>  }
> --
> 2.49.0
>
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Suren Baghdasaryan 9 months, 2 weeks ago
On Mon, Apr 28, 2025 at 12:47 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +cc Suren, who has worked HEAVILY on VMA field manipulation and such :)
>
> Suren - David is proposing adding a new field. AFAICT this does not add a
> new cache line so I think we're all good.
>
> But FYI!

Thanks! Yes, there should be some space in the last cacheline after my
last field reshuffling.

>
> On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> > Let's use our new interface. In remap_pfn_range(), we'll now decide
> > whether we have to track (full VMA covered) or only sanitize the pgprot
> > (partial VMA covered).
> >
> > Remember what we have to untrack by linking it from the VMA. When
> > duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> > to anon VMA names, and use a kref to share the tracking.
> >
> > Once the last VMA un-refs our tracking data, we'll do the untracking,
> > which simplifies things a lot and should sort our various issues we saw
> > recently, for example, when partially unmapping/zapping a tracked VMA.
> >
> > This change implies that we'll keep tracking the original PFN range even
> > after splitting + partially unmapping it: not too bad, because it was
> > not working reliably before. The only thing that kind-of worked before
> > was shrinking such a mapping using mremap(): we managed to adjust the
> > reservation in a hacky way, now we won't adjust the reservation but
> > leave it around until all involved VMAs are gone.
> >
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  include/linux/mm_inline.h |  2 +
> >  include/linux/mm_types.h  | 11 ++++++
> >  kernel/fork.c             | 54 ++++++++++++++++++++++++--
> >  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
> >  mm/mremap.c               |  4 --
> >  5 files changed, 128 insertions(+), 24 deletions(-)
> >
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index f9157a0c42a5c..89b518ff097e6 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
> >
> >  #endif  /* CONFIG_ANON_VMA_NAME */
> >
> > +void pfnmap_track_ctx_release(struct kref *ref);
> > +
> >  static inline void init_tlb_flush_pending(struct mm_struct *mm)
> >  {
> >       atomic_set(&mm->tlb_flush_pending, 0);
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 56d07edd01f91..91124761cfda8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -764,6 +764,14 @@ struct vma_numab_state {
> >       int prev_scan_seq;
> >  };
> >
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +struct pfnmap_track_ctx {
> > +     struct kref kref;
> > +     unsigned long pfn;
> > +     unsigned long size;
> > +};
> > +#endif
> > +
> >  /*
> >   * This struct describes a virtual memory area. There is one of these
> >   * per VM-area/task. A VM area is any part of the process virtual memory
> > @@ -877,6 +885,9 @@ struct vm_area_struct {
> >       struct anon_vma_name *anon_name;
> >  #endif
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +     struct pfnmap_track_ctx *pfnmap_track_ctx;
> > +#endif
> >  } __randomize_layout;
> >
> >  #ifdef CONFIG_NUMA
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 168681fc4b25a..ae518b8fe752c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
> >  #ifdef CONFIG_NUMA
> >       dest->vm_policy = src->vm_policy;
> >  #endif
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +     dest->pfnmap_track_ctx = NULL;
> > +#endif
> > +}
> > +
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> > +             struct vm_area_struct *new)
> > +{
> > +     struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> > +
> > +     if (likely(!ctx))
> > +             return 0;
> > +
> > +     /*
> > +      * We don't expect to ever hit this. If ever required, we would have
> > +      * to duplicate the tracking.
> > +      */
> > +     if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> > +             return -ENOMEM;
> > +     kref_get(&ctx->kref);
> > +     new->pfnmap_track_ctx = ctx;
> > +     return 0;
> > +}
> > +
> > +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> > +{
> > +     struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> > +
> > +     if (likely(!ctx))
> > +             return;
> > +
> > +     kref_put(&ctx->kref, pfnmap_track_ctx_release);
> > +     vma->pfnmap_track_ctx = NULL;
> > +}
> > +#else
> > +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> > +             struct vm_area_struct *new)
> > +{
> > +     return 0;
> >  }
> > +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> > +{
> > +}
> > +#endif
> >
> >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >  {
> > @@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> >       vm_area_init_from(orig, new);
> > +
> > +     if (vma_pfnmap_track_ctx_dup(orig, new)) {
> > +             kmem_cache_free(vm_area_cachep, new);
> > +             return NULL;
> > +     }
> >       vma_lock_init(new, true);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> >       vma_numab_state_init(new);
> > @@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
> >       vma_assert_detached(vma);
> >       vma_numab_state_free(vma);
> >       free_anon_vma_name(vma);
> > +     vma_pfnmap_track_ctx_release(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
> >
> > @@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >               if (!tmp)
> >                       goto fail_nomem;
> >
> > -             /* track_pfn_copy() will later take care of copying internal state. */
> > -             if (unlikely(tmp->vm_flags & VM_PFNMAP))
> > -                     untrack_pfn_clear(tmp);
> > -
> >               retval = vma_dup_policy(mpnt, tmp);
> >               if (retval)
> >                       goto fail_nomem_policy;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c737a8625866a..eb2b3f10a97ec 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >       struct mm_struct *dst_mm = dst_vma->vm_mm;
> >       struct mm_struct *src_mm = src_vma->vm_mm;
> >       struct mmu_notifier_range range;
> > -     unsigned long next, pfn = 0;
> > +     unsigned long next;
> >       bool is_cow;
> >       int ret;
> >
> > @@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >       if (is_vm_hugetlb_page(src_vma))
> >               return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
> >
> > -     if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> > -             ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> > -             if (ret)
> > -                     return ret;
> > -     }
> > -
> >       /*
> >        * We need to invalidate the secondary MMU mappings only when
> >        * there could be a permission downgrade on the ptes of the
> > @@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >               raw_write_seqcount_end(&src_mm->write_protect_seq);
> >               mmu_notifier_invalidate_range_end(&range);
> >       }
> > -     if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> > -             untrack_pfn_copy(dst_vma, pfn);
> >       return ret;
> >  }
> >
> > @@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >       if (vma->vm_file)
> >               uprobe_munmap(vma, start, end);
> >
> > -     if (unlikely(vma->vm_flags & VM_PFNMAP))
> > -             untrack_pfn(vma, 0, 0, mm_wr_locked);
> > -
> >       if (start != end) {
> >               if (unlikely(is_vm_hugetlb_page(vma))) {
> >                       /*
> > @@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> >       return error;
> >  }
> >
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> > +             unsigned long size, pgprot_t *prot)
> > +{
> > +     struct pfnmap_track_ctx *ctx;
> > +
> > +     if (pfnmap_track(pfn, size, prot))
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> > +     if (unlikely(!ctx)) {
> > +             pfnmap_untrack(pfn, size);
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> > +
> > +     ctx->pfn = pfn;
> > +     ctx->size = size;
> > +     kref_init(&ctx->kref);
> > +     return ctx;
> > +}
> > +
> > +void pfnmap_track_ctx_release(struct kref *ref)
> > +{
> > +     struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> > +
> > +     pfnmap_untrack(ctx->pfn, ctx->size);
> > +     kfree(ctx);
> > +}
> > +#endif /* __HAVE_PFNMAP_TRACKING */
> > +
> >  /**
> >   * remap_pfn_range - remap kernel memory to userspace
> >   * @vma: user vma to map to
> > @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> >   *
> >   * Return: %0 on success, negative error code otherwise.
> >   */
> > +#ifdef __HAVE_PFNMAP_TRACKING
> >  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> >                   unsigned long pfn, unsigned long size, pgprot_t prot)
> >  {
> > +     struct pfnmap_track_ctx *ctx = NULL;
> >       int err;
> >
> > -     err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> > -     if (err)
> > +     size = PAGE_ALIGN(size);
> > +
> > +     /*
> > +      * If we cover the full VMA, we'll perform actual tracking, and
> > +      * remember to untrack when the last reference to our tracking
> > +      * context from a VMA goes away.
> > +      *
> > +      * If we only cover parts of the VMA, we'll only sanitize the
> > +      * pgprot.
> > +      */
> > +     if (addr == vma->vm_start && addr + size == vma->vm_end) {
> > +             if (vma->pfnmap_track_ctx)
> > +                     return -EINVAL;
> > +             ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> > +             if (IS_ERR(ctx))
> > +                     return PTR_ERR(ctx);
> > +     } else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
> >               return -EINVAL;
> > +     }
> >
> >       err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> > -     if (err)
> > -             untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> > +     if (ctx) {
> > +             if (err)
> > +                     kref_put(&ctx->kref, pfnmap_track_ctx_release);
> > +             else
> > +                     vma->pfnmap_track_ctx = ctx;
> > +     }
> >       return err;
> >  }
> > +
> > +#else
> > +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> > +                 unsigned long pfn, unsigned long size, pgprot_t prot)
> > +{
> > +     return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> > +}
> > +#endif
> >  EXPORT_SYMBOL(remap_pfn_range);
> >
> >  /**
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 7db9da609c84f..6e78e02f74bd3 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> >       if (is_vm_hugetlb_page(vma))
> >               clear_vma_resv_huge_pages(vma);
> >
> > -     /* Tell pfnmap has moved from this vma */
> > -     if (unlikely(vma->vm_flags & VM_PFNMAP))
> > -             untrack_pfn_clear(vma);
> > -
> >       *new_vma_ptr = new_vma;
> >       return err;
> >  }
> > --
> > 2.49.0
> >
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 28.04.25 22:00, Suren Baghdasaryan wrote:
> On Mon, Apr 28, 2025 at 12:47 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> +cc Suren, who has worked HEAVILY on VMA field manipulation and such :)
>>
>> Suren - David is proposing adding a new field. AFAICT this does not add a
>> new cache line so I think we're all good.
>>
>> But FYI!
> 
> Thanks! Yes, there should be some space in the last cacheline after my
> last field reshuffling.

That explains why -- that last time I looked at this -- there was no 
easy space available. Thanks for that!

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Peter Xu 9 months, 2 weeks ago
On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only sanitize the pgprot
> (partial VMA covered).
> 
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.
> 
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.
> 
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm_inline.h |  2 +
>  include/linux/mm_types.h  | 11 ++++++
>  kernel/fork.c             | 54 ++++++++++++++++++++++++--
>  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>  mm/mremap.c               |  4 --
>  5 files changed, 128 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>  
>  #endif  /* CONFIG_ANON_VMA_NAME */
>  
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
>  static inline void init_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f91..91124761cfda8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -764,6 +764,14 @@ struct vma_numab_state {
>  	int prev_scan_seq;
>  };
>  
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> +	struct kref kref;
> +	unsigned long pfn;
> +	unsigned long size;
> +};
> +#endif
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -877,6 +885,9 @@ struct vm_area_struct {
>  	struct anon_vma_name *anon_name;
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif

So this was originally the small concern (or is it small?) that this will
grow every vma on x86, am I right?

After all pfnmap vmas are the minority, I was thinking whether we could
work it out without extending vma struct.

I had a quick thought quite a while ago, but never tried out (it was almost
off-track since vfio switched away from remap_pfn_range..), which is to
have x86 maintain its own mapping of vma <-> pfn tracking using a global
stucture.  After all, the memtype code did it already with the
memtype_rbroot, so I was thinking if vma info can be memorized as well, so
as to get rid of get_pat_info() too.

Maybe it also needs the 2nd layer like what you did with the track ctx, but
the tree maintains the mapping instead of adding the ctx pointer into vma.

Maybe it could work with squashing the two layers (or say, extending
memtype rbtree), but maybe not..

It could make it slightly slower than vma->pfnmap_track_ctx ref when
looking up pfn when holding a vma ref, but I assume it's ok considering
that track/untrack should be slow path for pfnmaps, and pfnmaps shouldn't
be a huge lot.

I didn't think further, but if that'll work it'll definitely avoids the
additional fields on x86 vmas.  I'm curious whether you explored that
direction, or maybe it's a known decision that the 8 bytes isn't a concern.

Thanks,

-- 
Peter Xu
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 25.04.25 22:23, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
>> Let's use our new interface. In remap_pfn_range(), we'll now decide
>> whether we have to track (full VMA covered) or only sanitize the pgprot
>> (partial VMA covered).
>>
>> Remember what we have to untrack by linking it from the VMA. When
>> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
>> to anon VMA names, and use a kref to share the tracking.
>>
>> Once the last VMA un-refs our tracking data, we'll do the untracking,
>> which simplifies things a lot and should sort our various issues we saw
>> recently, for example, when partially unmapping/zapping a tracked VMA.
>>
>> This change implies that we'll keep tracking the original PFN range even
>> after splitting + partially unmapping it: not too bad, because it was
>> not working reliably before. The only thing that kind-of worked before
>> was shrinking such a mapping using mremap(): we managed to adjust the
>> reservation in a hacky way, now we won't adjust the reservation but
>> leave it around until all involved VMAs are gone.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/mm_inline.h |  2 +
>>   include/linux/mm_types.h  | 11 ++++++
>>   kernel/fork.c             | 54 ++++++++++++++++++++++++--
>>   mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>>   mm/mremap.c               |  4 --
>>   5 files changed, 128 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index f9157a0c42a5c..89b518ff097e6 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>   
>>   #endif  /* CONFIG_ANON_VMA_NAME */
>>   
>> +void pfnmap_track_ctx_release(struct kref *ref);
>> +
>>   static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>   {
>>   	atomic_set(&mm->tlb_flush_pending, 0);
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 56d07edd01f91..91124761cfda8 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>>   	int prev_scan_seq;
>>   };
>>   
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +struct pfnmap_track_ctx {
>> +	struct kref kref;
>> +	unsigned long pfn;
>> +	unsigned long size;
>> +};
>> +#endif
>> +
>>   /*
>>    * This struct describes a virtual memory area. There is one of these
>>    * per VM-area/task. A VM area is any part of the process virtual memory
>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>>   	struct anon_vma_name *anon_name;
>>   #endif
>>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
>> +#endif
> 
> So this was originally the small concern (or is it small?) that this will
> grow every vma on x86, am I right?

Yeah, and last time I looked into this, it would have grown it such that it would
require a bigger slab. Right now:

Before this change:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		freeptr_t          vm_freeptr;           /*     0     8 */
	};                                               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	unsigned int               vm_lock_seq;          /*    40     4 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */
	struct vma_numab_state *   numab_state;          /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
	struct anon_vma_name *     anon_name;            /*   168     8 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */

	/* size: 192, cachelines: 3, members: 18 */
	/* sum members: 168, holes: 2, sum holes: 8 */
	/* padding: 16 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));

After this change:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		freeptr_t          vm_freeptr;           /*     0     8 */
	};                                               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	unsigned int               vm_lock_seq;          /*    40     4 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */
	struct vma_numab_state *   numab_state;          /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
	struct anon_vma_name *     anon_name;            /*   168     8 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
	struct pfnmap_track_ctx *  pfnmap_track_ctx;     /*   176     8 */

	/* size: 192, cachelines: 3, members: 19 */
	/* sum members: 176, holes: 2, sum holes: 8 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));

Observe that we allocate 192 bytes with or without pfnmap_track_ctx. (IIRC,
slab sizes are ... 128, 192, 256, 512, ...)

> 
> After all pfnmap vmas are the minority, I was thinking whether we could
> work it out without extending vma struct.

Heh, similar to userfaultfd on most systems, or ones with a mempolicy, or
anon vma names, ... :)

But yeah, pfnmap is certainly a minority as well.

> 
> I had a quick thought quite a while ago, but never tried out (it was almost
> off-track since vfio switched away from remap_pfn_range..), which is to
> have x86 maintain its own mapping of vma <-> pfn tracking using a global
> stucture.  After all, the memtype code did it already with the
> memtype_rbroot, so I was thinking if vma info can be memorized as well, so
> as to get rid of get_pat_info() too.
> 
> Maybe it also needs the 2nd layer like what you did with the track ctx, but
> the tree maintains the mapping instead of adding the ctx pointer into vma.
> 
> Maybe it could work with squashing the two layers (or say, extending
> memtype rbtree), but maybe not..
> 
> It could make it slightly slower than vma->pfnmap_track_ctx ref when
> looking up pfn when holding a vma ref, but I assume it's ok considering
> that track/untrack should be slow path for pfnmaps, and pfnmaps shouldn't
> be a huge lot.
> 
> I didn't think further, but if that'll work it'll definitely avoids the
> additional fields on x86 vmas.  I'm curious whether you explored that
> direction, or maybe it's a known decision that the 8 bytes isn't a concern.

When discussing this approach with Lorenzo, I raised that we could simply
map using an xarray the VMA to that structure.

But then, if we're not effectively allocating any more space, then probably
not worth adding more complexity right now.

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Peter Xu 9 months, 2 weeks ago
On Fri, Apr 25, 2025 at 10:36:55PM +0200, David Hildenbrand wrote:
> On 25.04.25 22:23, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> > > Let's use our new interface. In remap_pfn_range(), we'll now decide
> > > whether we have to track (full VMA covered) or only sanitize the pgprot
> > > (partial VMA covered).
> > > 
> > > Remember what we have to untrack by linking it from the VMA. When
> > > duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> > > to anon VMA names, and use a kref to share the tracking.
> > > 
> > > Once the last VMA un-refs our tracking data, we'll do the untracking,
> > > which simplifies things a lot and should sort our various issues we saw
> > > recently, for example, when partially unmapping/zapping a tracked VMA.
> > > 
> > > This change implies that we'll keep tracking the original PFN range even
> > > after splitting + partially unmapping it: not too bad, because it was
> > > not working reliably before. The only thing that kind-of worked before
> > > was shrinking such a mapping using mremap(): we managed to adjust the
> > > reservation in a hacky way, now we won't adjust the reservation but
> > > leave it around until all involved VMAs are gone.
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   include/linux/mm_inline.h |  2 +
> > >   include/linux/mm_types.h  | 11 ++++++
> > >   kernel/fork.c             | 54 ++++++++++++++++++++++++--
> > >   mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
> > >   mm/mremap.c               |  4 --
> > >   5 files changed, 128 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > index f9157a0c42a5c..89b518ff097e6 100644
> > > --- a/include/linux/mm_inline.h
> > > +++ b/include/linux/mm_inline.h
> > > @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
> > >   #endif  /* CONFIG_ANON_VMA_NAME */
> > > +void pfnmap_track_ctx_release(struct kref *ref);
> > > +
> > >   static inline void init_tlb_flush_pending(struct mm_struct *mm)
> > >   {
> > >   	atomic_set(&mm->tlb_flush_pending, 0);
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 56d07edd01f91..91124761cfda8 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -764,6 +764,14 @@ struct vma_numab_state {
> > >   	int prev_scan_seq;
> > >   };
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +struct pfnmap_track_ctx {
> > > +	struct kref kref;
> > > +	unsigned long pfn;
> > > +	unsigned long size;
> > > +};
> > > +#endif
> > > +
> > >   /*
> > >    * This struct describes a virtual memory area. There is one of these
> > >    * per VM-area/task. A VM area is any part of the process virtual memory
> > > @@ -877,6 +885,9 @@ struct vm_area_struct {
> > >   	struct anon_vma_name *anon_name;
> > >   #endif
> > >   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> > > +#endif
> > 
> > So this was originally the small concern (or is it small?) that this will
> > grow every vma on x86, am I right?
> 
> Yeah, and last time I looked into this, it would have grown it such that it would
> require a bigger slab. Right now:

Probably due to what config you have.  E.g., when I'm looking mine it's
much bigger and already consuming 256B, but it's because I enabled more
things (userfaultfd, lockdep, etc.).

> 
> Before this change:
> 
> struct vm_area_struct {
> 	union {
> 		struct {
> 			long unsigned int vm_start;      /*     0     8 */
> 			long unsigned int vm_end;        /*     8     8 */
> 		};                                       /*     0    16 */
> 		freeptr_t          vm_freeptr;           /*     0     8 */
> 	};                                               /*     0    16 */
> 	struct mm_struct *         vm_mm;                /*    16     8 */
> 	pgprot_t                   vm_page_prot;         /*    24     8 */
> 	union {
> 		const vm_flags_t   vm_flags;             /*    32     8 */
> 		vm_flags_t         __vm_flags;           /*    32     8 */
> 	};                                               /*    32     8 */
> 	unsigned int               vm_lock_seq;          /*    40     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct list_head           anon_vma_chain;       /*    48    16 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct anon_vma *          anon_vma;             /*    64     8 */
> 	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> 	long unsigned int          vm_pgoff;             /*    80     8 */
> 	struct file *              vm_file;              /*    88     8 */
> 	void *                     vm_private_data;      /*    96     8 */
> 	atomic_long_t              swap_readahead_info;  /*   104     8 */
> 	struct mempolicy *         vm_policy;            /*   112     8 */
> 	struct vma_numab_state *   numab_state;          /*   120     8 */
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct {
> 		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
> 		long unsigned int  rb_subtree_last;      /*   160     8 */
> 	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
> 	struct anon_vma_name *     anon_name;            /*   168     8 */
> 	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> 
> 	/* size: 192, cachelines: 3, members: 18 */
> 	/* sum members: 168, holes: 2, sum holes: 8 */
> 	/* padding: 16 */
> 	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> } __attribute__((__aligned__(64)));
> 
> After this change:
> 
> struct vm_area_struct {
> 	union {
> 		struct {
> 			long unsigned int vm_start;      /*     0     8 */
> 			long unsigned int vm_end;        /*     8     8 */
> 		};                                       /*     0    16 */
> 		freeptr_t          vm_freeptr;           /*     0     8 */
> 	};                                               /*     0    16 */
> 	struct mm_struct *         vm_mm;                /*    16     8 */
> 	pgprot_t                   vm_page_prot;         /*    24     8 */
> 	union {
> 		const vm_flags_t   vm_flags;             /*    32     8 */
> 		vm_flags_t         __vm_flags;           /*    32     8 */
> 	};                                               /*    32     8 */
> 	unsigned int               vm_lock_seq;          /*    40     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct list_head           anon_vma_chain;       /*    48    16 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct anon_vma *          anon_vma;             /*    64     8 */
> 	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> 	long unsigned int          vm_pgoff;             /*    80     8 */
> 	struct file *              vm_file;              /*    88     8 */
> 	void *                     vm_private_data;      /*    96     8 */
> 	atomic_long_t              swap_readahead_info;  /*   104     8 */
> 	struct mempolicy *         vm_policy;            /*   112     8 */
> 	struct vma_numab_state *   numab_state;          /*   120     8 */
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct {
> 		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
> 		long unsigned int  rb_subtree_last;      /*   160     8 */
> 	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
> 	struct anon_vma_name *     anon_name;            /*   168     8 */
> 	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> 	struct pfnmap_track_ctx *  pfnmap_track_ctx;     /*   176     8 */
> 
> 	/* size: 192, cachelines: 3, members: 19 */
> 	/* sum members: 176, holes: 2, sum holes: 8 */
> 	/* padding: 8 */
> 	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> } __attribute__((__aligned__(64)));
> 
> Observe that we allocate 192 bytes with or without pfnmap_track_ctx. (IIRC,
> slab sizes are ... 128, 192, 256, 512, ...)

True. I just double checked, vm_area_cachep has SLAB_HWCACHE_ALIGN set, I
think it means it's working like that on x86_64 at least indeed.  So looks
like the new field at least isn't an immediate concern.

Thanks,

-- 
Peter Xu
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 28.04.25 18:08, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:36:55PM +0200, David Hildenbrand wrote:
>> On 25.04.25 22:23, Peter Xu wrote:
>>> On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
>>>> Let's use our new interface. In remap_pfn_range(), we'll now decide
>>>> whether we have to track (full VMA covered) or only sanitize the pgprot
>>>> (partial VMA covered).
>>>>
>>>> Remember what we have to untrack by linking it from the VMA. When
>>>> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
>>>> to anon VMA names, and use a kref to share the tracking.
>>>>
>>>> Once the last VMA un-refs our tracking data, we'll do the untracking,
>>>> which simplifies things a lot and should sort our various issues we saw
>>>> recently, for example, when partially unmapping/zapping a tracked VMA.
>>>>
>>>> This change implies that we'll keep tracking the original PFN range even
>>>> after splitting + partially unmapping it: not too bad, because it was
>>>> not working reliably before. The only thing that kind-of worked before
>>>> was shrinking such a mapping using mremap(): we managed to adjust the
>>>> reservation in a hacky way, now we won't adjust the reservation but
>>>> leave it around until all involved VMAs are gone.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/linux/mm_inline.h |  2 +
>>>>    include/linux/mm_types.h  | 11 ++++++
>>>>    kernel/fork.c             | 54 ++++++++++++++++++++++++--
>>>>    mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>>>>    mm/mremap.c               |  4 --
>>>>    5 files changed, 128 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>>>> index f9157a0c42a5c..89b518ff097e6 100644
>>>> --- a/include/linux/mm_inline.h
>>>> +++ b/include/linux/mm_inline.h
>>>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>>>    #endif  /* CONFIG_ANON_VMA_NAME */
>>>> +void pfnmap_track_ctx_release(struct kref *ref);
>>>> +
>>>>    static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>>>    {
>>>>    	atomic_set(&mm->tlb_flush_pending, 0);
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 56d07edd01f91..91124761cfda8 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>>>>    	int prev_scan_seq;
>>>>    };
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +struct pfnmap_track_ctx {
>>>> +	struct kref kref;
>>>> +	unsigned long pfn;
>>>> +	unsigned long size;
>>>> +};
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * This struct describes a virtual memory area. There is one of these
>>>>     * per VM-area/task. A VM area is any part of the process virtual memory
>>>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>>>>    	struct anon_vma_name *anon_name;
>>>>    #endif
>>>>    	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
>>>> +#endif
>>>
>>> So this was originally the small concern (or is it small?) that this will
>>> grow every vma on x86, am I right?
>>
>> Yeah, and last time I looked into this, it would have grown it such that it would
>> require a bigger slab. Right now:
> 
> Probably due to what config you have.  E.g., when I'm looking mine it's
> much bigger and already consuming 256B, but it's because I enabled more
> things (userfaultfd, lockdep, etc.).

Note that I enabled everything that you would expect on a production 
system (incld. userfaultfd, mempolicy, per-vma locks), so I didn't 
enable lockep.

Thanks for verifying!

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Peter Xu 9 months, 2 weeks ago
On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
> > Probably due to what config you have.  E.g., when I'm looking mine it's
> > much bigger and already consuming 256B, but it's because I enabled more
> > things (userfaultfd, lockdep, etc.).
> 
> Note that I enabled everything that you would expect on a production system
> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.

I still doubt whether you at least enabled userfaultfd, e.g., your previous
paste has:

  struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */

Not something that matters.. but just in case you didn't use the expected
config file you wanted to use..

-- 
Peter Xu
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 28.04.25 18:24, Peter Xu wrote:
> On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
>>> Probably due to what config you have.  E.g., when I'm looking mine it's
>>> much bigger and already consuming 256B, but it's because I enabled more
>>> things (userfaultfd, lockdep, etc.).
>>
>> Note that I enabled everything that you would expect on a production system
>> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
> 
> I still doubt whether you at least enabled userfaultfd, e.g., your previous
> paste has:
> 
>    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> 
> Not something that matters.. but just in case you didn't use the expected
> config file you wanted to use..

You're absolutely right. I only briefly rechecked for this purpose here 
on my notebook, and only looked for the existence of members, not 
expecting that we have confusing stuff like vm_userfaultfd_ctx.

I checked again and the size stays at 192 with allyesconfig and then 
disabling debug options.

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Lorenzo Stoakes 9 months, 2 weeks ago
On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
> On 28.04.25 18:24, Peter Xu wrote:
> > On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
> > > > Probably due to what config you have.  E.g., when I'm looking mine it's
> > > > much bigger and already consuming 256B, but it's because I enabled more
> > > > things (userfaultfd, lockdep, etc.).
> > >
> > > Note that I enabled everything that you would expect on a production system
> > > (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
> >
> > I still doubt whether you at least enabled userfaultfd, e.g., your previous
> > paste has:
> >
> >    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> >
> > Not something that matters.. but just in case you didn't use the expected
> > config file you wanted to use..
>
> You're absolutely right. I only briefly rechecked for this purpose here on
> my notebook, and only looked for the existence of members, not expecting
> that we have confusing stuff like vm_userfaultfd_ctx.
>
> I checked again and the size stays at 192 with allyesconfig and then
> disabling debug options.

I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
don't care about nommu.

So:

CONFIG_PER_VMA_LOCK
CONFIG_SWAP
CONFIG_MMU (exclude the nommu vm_region field)
CONFIG_NUMA
CONFIG_NUMA_BALANCING
CONFIG_PER_VMA_LOCK
CONFIG_ANON_VMA_NAME
__HAVE_PFNMAP_TRACKING

So to be clear - allyesconfig w/o debug gives us this yes? And we don't add a
cache line? In which case all good :)


>
> --
> Cheers,
>
> David / dhildenb
>
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 28.04.25 21:37, Lorenzo Stoakes wrote:
> On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
>> On 28.04.25 18:24, Peter Xu wrote:
>>> On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
>>>>> Probably due to what config you have.  E.g., when I'm looking mine it's
>>>>> much bigger and already consuming 256B, but it's because I enabled more
>>>>> things (userfaultfd, lockdep, etc.).
>>>>
>>>> Note that I enabled everything that you would expect on a production system
>>>> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
>>>
>>> I still doubt whether you at least enabled userfaultfd, e.g., your previous
>>> paste has:
>>>
>>>     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
>>>
>>> Not something that matters.. but just in case you didn't use the expected
>>> config file you wanted to use..
>>
>> You're absolutely right. I only briefly rechecked for this purpose here on
>> my notebook, and only looked for the existence of members, not expecting
>> that we have confusing stuff like vm_userfaultfd_ctx.
>>
>> I checked again and the size stays at 192 with allyesconfig and then
>> disabling debug options.
> 
> I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
> don't care about nommu.
> 
> So:
> 
> CONFIG_PER_VMA_LOCK
> CONFIG_SWAP
> CONFIG_MMU (exclude the nommu vm_region field)
> CONFIG_NUMA
> CONFIG_NUMA_BALANCING
> CONFIG_PER_VMA_LOCK
> CONFIG_ANON_VMA_NAME
> __HAVE_PFNMAP_TRACKING

Yes.

And our ugly friend CONFIG_USERFAULTFD

that is

struct vm_userfaultfd_ctx {
	struct userfaultfd_ctx *ctx;
};
#else /* CONFIG_USERFAULTFD */
#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
struct vm_userfaultfd_ctx {};
#endif /* CONFIG_USERFAULTFD */

(yes, you made the same mistake as I made when skimming if everything 
relevant was enabled)

> 
> So to be clear - allyesconfig w/o debug gives us this yes? And we don't add a
> cache line? In which case all good :)

Looks like it!

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by Suren Baghdasaryan 9 months, 2 weeks ago
On Mon, Apr 28, 2025 at 12:37 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
> > On 28.04.25 18:24, Peter Xu wrote:
> > > On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
> > > > > Probably due to what config you have.  E.g., when I'm looking mine it's
> > > > > much bigger and already consuming 256B, but it's because I enabled more
> > > > > things (userfaultfd, lockdep, etc.).
> > > >
> > > > Note that I enabled everything that you would expect on a production system
> > > > (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
> > >
> > > I still doubt whether you at least enabled userfaultfd, e.g., your previous
> > > paste has:
> > >
> > >    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> > >
> > > Not something that matters.. but just in case you didn't use the expected
> > > config file you wanted to use..
> >
> > You're absolutely right. I only briefly rechecked for this purpose here on
> > my notebook, and only looked for the existence of members, not expecting
> > that we have confusing stuff like vm_userfaultfd_ctx.
> >
> > I checked again and the size stays at 192 with allyesconfig and then
> > disabling debug options.
>
> I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
> don't care about nommu.

I think it's safe to assume that production systems would disable
lockdep due to the performance overhead. At least that's what we do on
Android - enable it on development branches but disable in production.

>
> So:
>
> CONFIG_PER_VMA_LOCK
> CONFIG_SWAP
> CONFIG_MMU (exclude the nommu vm_region field)
> CONFIG_NUMA
> CONFIG_NUMA_BALANCING
> CONFIG_PER_VMA_LOCK
> CONFIG_ANON_VMA_NAME
> __HAVE_PFNMAP_TRACKING
>
> So to be clear - allyesconfig w/o debug gives us this yes? And we don't add a
> cache line? In which case all good :)
>
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 28.04.25 21:57, Suren Baghdasaryan wrote:
> On Mon, Apr 28, 2025 at 12:37 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
>>> On 28.04.25 18:24, Peter Xu wrote:
>>>> On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
>>>>>> Probably due to what config you have.  E.g., when I'm looking mine it's
>>>>>> much bigger and already consuming 256B, but it's because I enabled more
>>>>>> things (userfaultfd, lockdep, etc.).
>>>>>
>>>>> Note that I enabled everything that you would expect on a production system
>>>>> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
>>>>
>>>> I still doubt whether you at least enabled userfaultfd, e.g., your previous
>>>> paste has:
>>>>
>>>>     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
>>>>
>>>> Not something that matters.. but just in case you didn't use the expected
>>>> config file you wanted to use..
>>>
>>> You're absolutely right. I only briefly rechecked for this purpose here on
>>> my notebook, and only looked for the existence of members, not expecting
>>> that we have confusing stuff like vm_userfaultfd_ctx.
>>>
>>> I checked again and the size stays at 192 with allyesconfig and then
>>> disabling debug options.
>>
>> I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
>> don't care about nommu.
> 
> I think it's safe to assume that production systems would disable
> lockdep due to the performance overhead. At least that's what we do on
> Android - enable it on development branches but disable in production.

Right, and "struct lockdep_map" is ... significantly larger than 8 
bytes. With that enabled, one is already paying for extra VMA space ...

-- 
Cheers,

David / dhildenb