[PATCH v4 2/5] mm/mseal: update madvise() logic

Lorenzo Stoakes posted 5 patches 2 months, 1 week ago
[PATCH v4 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
The madvise() logic is inexplicably performed in mm/mseal.c - this ought
to be located in mm/madvise.c.

Additionally can_modify_vma_madv() is inconsistently named and, in
combination with is_ro_anon(), is very confusing logic.

Put a static function in mm/madvise.c instead - can_madvise_modify() -
that spells out exactly what's happening.  Also explicitly check for an
anon VMA.

Also add commentary to explain what's going on.

Essentially - we disallow discarding of data in mseal()'d mappings in
instances where the user couldn't otherwise write to that data.

We retain the existing behaviour here regarding MAP_PRIVATE mappings of
file-backed mappings, which entails some complexity - while this, strictly
speaking - appears to violate mseal() semantics, it may interact badly with
users which expect to be able to madvise(MADV_DONTNEED) .text mappings for
instance.

We may revisit this at a later date.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/madvise.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/mseal.c   | 49 ------------------------------------
 mm/vma.h     |  7 ------
 3 files changed, 70 insertions(+), 57 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index bb80fc5ea08f..7f9af2dbd044 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_context.h>
 #include <linux/string.h>
 #include <linux/uio.h>
 #include <linux/ksm.h>
@@ -1256,6 +1257,74 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
 			       &guard_remove_walk_ops, NULL);
 }

+#ifdef CONFIG_64BIT
+/* Does the madvise operation result in discarding of mapped data? */
+static bool is_discard(int behavior)
+{
+	switch (behavior) {
+	case MADV_FREE:
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+	case MADV_REMOVE:
+	case MADV_DONTFORK:
+	case MADV_WIPEONFORK:
+	case MADV_GUARD_INSTALL:
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
+ * circumstances - discarding of data from read-only anonymous SEALED mappings.
+ *
+ * This is because users cannot trivally discard data from these VMAs, and may
+ * only do so via an appropriate madvise() call.
+ */
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+	struct vm_area_struct *vma = madv_behavior->vma;
+
+	/* If the VMA isn't sealed we're good. */
+	if (can_modify_vma(vma))
+		return true;
+
+	/* For a sealed VMA, we only care about discard operations. */
+	if (!is_discard(madv_behavior->behavior))
+		return true;
+
+	/*
+	 * We explicitly permit all file-backed mappings, whether MAP_SHARED or
+	 * MAP_PRIVATE.
+	 *
+	 * The latter causes some complications. Because now, one can mmap()
+	 * read/write a MAP_PRIVATE mapping, write to it, then mprotect()
+	 * read-only, mseal() and a discard will be permitted.
+	 *
+	 * However, in order to avoid issues with potential use of madvise(...,
+	 * MADV_DONTNEED) of mseal()'d .text mappings we, for the time being,
+	 * permit this.
+	 */
+	if (!vma_is_anonymous(vma))
+		return true;
+
+	/* If the user could write to the mapping anyway, then this is fine. */
+	if ((vma->vm_flags & VM_WRITE) &&
+	    arch_vma_access_permitted(vma, /* write= */ true,
+			/* execute= */ false, /* foreign= */ false))
+		return true;
+
+	/* Otherwise, we are not permitted to perform this operation. */
+	return false;
+}
+#else
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+	return true;
+}
+#endif
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1269,7 +1338,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 	struct madvise_behavior_range *range = &madv_behavior->range;
 	int error;

-	if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
+	if (unlikely(!can_madvise_modify(madv_behavior)))
 		return -EPERM;

 	switch (behavior) {
diff --git a/mm/mseal.c b/mm/mseal.c
index c27197ac04e8..1308e88ab184 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -11,7 +11,6 @@
 #include <linux/mman.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
-#include <linux/mmu_context.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include "internal.h"
@@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
 	vm_flags_set(vma, VM_SEALED);
 }

-static bool is_madv_discard(int behavior)
-{
-	switch (behavior) {
-	case MADV_FREE:
-	case MADV_DONTNEED:
-	case MADV_DONTNEED_LOCKED:
-	case MADV_REMOVE:
-	case MADV_DONTFORK:
-	case MADV_WIPEONFORK:
-	case MADV_GUARD_INSTALL:
-		return true;
-	}
-
-	return false;
-}
-
-static bool is_ro_anon(struct vm_area_struct *vma)
-{
-	/* check anonymous mapping. */
-	if (vma->vm_file || vma->vm_flags & VM_SHARED)
-		return false;
-
-	/*
-	 * check for non-writable:
-	 * PROT=RO or PKRU is not writeable.
-	 */
-	if (!(vma->vm_flags & VM_WRITE) ||
-		!arch_vma_access_permitted(vma, true, false, false))
-		return true;
-
-	return false;
-}
-
-/*
- * Check if a vma is allowed to be modified by madvise.
- */
-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
-	if (!is_madv_discard(behavior))
-		return true;
-
-	if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
-		return false;
-
-	/* Allow by default. */
-	return true;
-}
-
 static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct **prev, unsigned long start,
 		unsigned long end, vm_flags_t newflags)
diff --git a/mm/vma.h b/mm/vma.h
index acdcc515c459..85db5e880fcc 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
 	return true;
 }

-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
-
 #else

 static inline bool can_modify_vma(struct vm_area_struct *vma)
@@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
 	return true;
 }

-static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
-	return true;
-}
-
 #endif

 #if defined(CONFIG_STACK_GROWSUP)
--
2.50.1
Re: [PATCH v4 2/5] mm/mseal: update madvise() logic
Posted by Jeff Xu 2 months, 1 week ago
Hi Lorenzo

On Fri, Jul 25, 2025 at 1:30 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> to be located in mm/madvise.c.
>
> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
>
> Put a static function in mm/madvise.c instead - can_madvise_modify() -
> that spells out exactly what's happening.  Also explicitly check for an
> anon VMA.
>
> Also add commentary to explain what's going on.
>
> Essentially - we disallow discarding of data in mseal()'d mappings in
> instances where the user couldn't otherwise write to that data.
>
> We retain the existing behaviour here regarding MAP_PRIVATE mappings of
> file-backed mappings, which entails some complexity - while this, strictly
> speaking - appears to violate mseal() semantics, it may interact badly with
> users which expect to be able to madvise(MADV_DONTNEED) .text mappings for
> instance.
>
> We may revisit this at a later date.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/madvise.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  mm/mseal.c   | 49 ------------------------------------
>  mm/vma.h     |  7 ------
>  3 files changed, 70 insertions(+), 57 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bb80fc5ea08f..7f9af2dbd044 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/mmu_context.h>
>  #include <linux/string.h>
>  #include <linux/uio.h>
>  #include <linux/ksm.h>
> @@ -1256,6 +1257,74 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
>                                &guard_remove_walk_ops, NULL);
>  }
>
> +#ifdef CONFIG_64BIT
> +/* Does the madvise operation result in discarding of mapped data? */
> +static bool is_discard(int behavior)
> +{
> +       switch (behavior) {
> +       case MADV_FREE:
> +       case MADV_DONTNEED:
> +       case MADV_DONTNEED_LOCKED:
> +       case MADV_REMOVE:
> +       case MADV_DONTFORK:
> +       case MADV_WIPEONFORK:
> +       case MADV_GUARD_INSTALL:
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +/*
> + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> + *
> + * This is because users cannot trivally discard data from these VMAs, and may
> + * only do so via an appropriate madvise() call.
> + */
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       struct vm_area_struct *vma = madv_behavior->vma;
> +
> +       /* If the VMA isn't sealed we're good. */
> +       if (can_modify_vma(vma))
> +               return true;
> +
> +       /* For a sealed VMA, we only care about discard operations. */
> +       if (!is_discard(madv_behavior->behavior))
> +               return true;
> +
> +       /*
> +        * We explicitly permit all file-backed mappings, whether MAP_SHARED or
> +        * MAP_PRIVATE.
> +        *
> +        * The latter causes some complications. Because now, one can mmap()
> +        * read/write a MAP_PRIVATE mapping, write to it, then mprotect()
> +        * read-only, mseal() and a discard will be permitted.
> +        *
> +        * However, in order to avoid issues with potential use of madvise(...,
> +        * MADV_DONTNEED) of mseal()'d .text mappings we, for the time being,
> +        * permit this.
> +        */
> +       if (!vma_is_anonymous(vma))
> +               return true;
> +
> +       /* If the user could write to the mapping anyway, then this is fine. */
> +       if ((vma->vm_flags & VM_WRITE) &&
> +           arch_vma_access_permitted(vma, /* write= */ true,
> +                       /* execute= */ false, /* foreign= */ false))
> +               return true;
> +
> +       /* Otherwise, we are not permitted to perform this operation. */
> +       return false;
> +}
> +#else
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       return true;
> +}
> +#endif
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1269,7 +1338,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>         struct madvise_behavior_range *range = &madv_behavior->range;
>         int error;
>
> -       if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> +       if (unlikely(!can_madvise_modify(madv_behavior)))
>                 return -EPERM;
>
>         switch (behavior) {
> diff --git a/mm/mseal.c b/mm/mseal.c
> index c27197ac04e8..1308e88ab184 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -11,7 +11,6 @@
>  #include <linux/mman.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> -#include <linux/mmu_context.h>
>  #include <linux/syscalls.h>
>  #include <linux/sched.h>
>  #include "internal.h"
> @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
>         vm_flags_set(vma, VM_SEALED);
>  }
>
> -static bool is_madv_discard(int behavior)
> -{
> -       switch (behavior) {
> -       case MADV_FREE:
> -       case MADV_DONTNEED:
> -       case MADV_DONTNEED_LOCKED:
> -       case MADV_REMOVE:
> -       case MADV_DONTFORK:
> -       case MADV_WIPEONFORK:
> -       case MADV_GUARD_INSTALL:
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
> -static bool is_ro_anon(struct vm_area_struct *vma)
> -{
> -       /* check anonymous mapping. */
> -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> -               return false;

In this patch, the check for anonymous mapping are replaced with:

 if (!vma_is_anonymous(vma))
      return true;

vma_is_anonymous()  is implemented as following:
static inline bool vma_is_anonymous(struct vm_area_struct *vma)
{
   return !vma->vm_ops;
}

I'm curious to know if those two checks have the exact same scope.

The original intention is only file-backed mapping can allow
destructive madvise while sealed. I want to make sure that we don't
accidentally increase the scope.

Thanks and regards,
-Jeff



> -
> -       /*
> -        * check for non-writable:
> -        * PROT=RO or PKRU is not writeable.
> -        */
> -       if (!(vma->vm_flags & VM_WRITE) ||
> -               !arch_vma_access_permitted(vma, true, false, false))
> -               return true;
> -
> -       return false;
> -}
> -
> -/*
> - * Check if a vma is allowed to be modified by madvise.
> - */
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       if (!is_madv_discard(behavior))
> -               return true;
> -
> -       if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> -               return false;
> -
> -       /* Allow by default. */
> -       return true;
> -}
> -
>  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                 struct vm_area_struct **prev, unsigned long start,
>                 unsigned long end, vm_flags_t newflags)
> diff --git a/mm/vma.h b/mm/vma.h
> index acdcc515c459..85db5e880fcc 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> -
>  #else
>
>  static inline bool can_modify_vma(struct vm_area_struct *vma)
> @@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       return true;
> -}
> -
>  #endif
>
>  #if defined(CONFIG_STACK_GROWSUP)
> --
> 2.50.1
Re: [PATCH v4 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Fri, Jul 25, 2025 at 10:28:57AM -0700, Jeff Xu wrote:

> > -static bool is_ro_anon(struct vm_area_struct *vma)
> > -{
> > -       /* check anonymous mapping. */
> > -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > -               return false;
>
> In this patch, the check for anonymous mapping are replaced with:
>
>  if (!vma_is_anonymous(vma))
>       return true;
>
> vma_is_anonymous()  is implemented as following:
> static inline bool vma_is_anonymous(struct vm_area_struct *vma)
> {
>    return !vma->vm_ops;
> }
>
> I'm curious to know if those two checks have the exact same scope.
>
> The original intention is only file-backed mapping can allow
> destructive madvise while sealed. I want to make sure that we don't
> accidentally increase the scope.
>
> Thanks and regards,
> -Jeff

Thanks, that's a good question.

So for a function to be mmap()'d and file-backed, vm_ops _must_ be
supplied.

You can see this in the fault-handler:

do_pte_mising()
-> do_fault()
if anon -> fault anon otherwise fault file-backed

So if this were not the case, you'd have file-backed mappings going into
the the anonymous fault handler logic.

This covers off MAP_PRIVATE mappings of file-backed mappings too, as this
is handled in do_fault() by:

	} else if (!(vmf->flags & FAULT_FLAG_WRITE))
		ret = do_read_fault(vmf);
	else if (!(vma->vm_flags & VM_SHARED))
		ret = do_cow_fault(vmf);

That does the CoW fault handling.

So the vma_is_anonymous_check() here should have the same semantics.

Cheers, Lorenzo
Re: [PATCH v4 2/5] mm/mseal: update madvise() logic
Posted by Jeff Xu 2 months, 1 week ago
Hi Lorenzo,


On Fri, Jul 25, 2025 at 10:54 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jul 25, 2025 at 10:28:57AM -0700, Jeff Xu wrote:
>
> > > -static bool is_ro_anon(struct vm_area_struct *vma)
> > > -{
> > > -       /* check anonymous mapping. */
> > > -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > -               return false;
> >
> > In this patch, the check for anonymous mapping are replaced with:
> >
> >  if (!vma_is_anonymous(vma))
> >       return true;
> >
> > vma_is_anonymous()  is implemented as following:
> > static inline bool vma_is_anonymous(struct vm_area_struct *vma)
> > {
> >    return !vma->vm_ops;
> > }
> >
> > I'm curious to know if those two checks have the exact same scope.
> >
> > The original intention is only file-backed mapping can allow
> > destructive madvise while sealed. I want to make sure that we don't
> > accidentally increase the scope.
> >
> > Thanks and regards,
> > -Jeff
>
> Thanks, that's a good question.
>
> So for a function to be mmap()'d and file-backed, vm_ops _must_ be
> supplied.
>
This says that all file-backed mappings must have vm_ops set, but what
about the reverse? Are mappings with vm_ops always file-backed?

> You can see this in the fault-handler:
>
> do_pte_mising()
> -> do_fault()
> if anon -> fault anon otherwise fault file-backed
>
> So if this were not the case, you'd have file-backed mappings going into
> the the anonymous fault handler logic.
>
> This covers off MAP_PRIVATE mappings of file-backed mappings too, as this
> is handled in do_fault() by:
>
>         } else if (!(vmf->flags & FAULT_FLAG_WRITE))
>                 ret = do_read_fault(vmf);
>         else if (!(vma->vm_flags & VM_SHARED))
>                 ret = do_cow_fault(vmf);
>
> That does the CoW fault handling.
>
> So the vma_is_anonymous_check() here should have the same semantics.
>
Just to be extra careful, does the reverse hold true as well?

In anycase, if you are confident about this, please do state this
change in the commit description that vma->vm_file and VM_SHARED flag
check is replaced by vma_is_anonymous_check(), which is expected to be
a non-functional change.

Thanks and regards,
-Jeff

> Cheers, Lorenzo
Re: [PATCH v4 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Fri, Jul 25, 2025 at 11:41:15AM -0700, Jeff Xu wrote:
> Hi Lorenzo,
>
>
> On Fri, Jul 25, 2025 at 10:54 AM Lorenzo Stoakes
> > Thanks, that's a good question.
> >
> > So for a function to be mmap()'d and file-backed, vm_ops _must_ be
> > supplied.
> >
> This says that all file-backed mappings must have vm_ops set, but what
> about the reverse? Are mappings with vm_ops always file-backed?

Yes? Otherwise they'd get treated as anonymous?

We call this vma_is_anonymous() for a reason ;)

>
> > You can see this in the fault-handler:
> >
> > do_pte_mising()
> > -> do_fault()
> > if anon -> fault anon otherwise fault file-backed
> >
> > So if this were not the case, you'd have file-backed mappings going into
> > the the anonymous fault handler logic.
> >
> > This covers off MAP_PRIVATE mappings of file-backed mappings too, as this
> > is handled in do_fault() by:
> >
> >         } else if (!(vmf->flags & FAULT_FLAG_WRITE))
> >                 ret = do_read_fault(vmf);
> >         else if (!(vma->vm_flags & VM_SHARED))
> >                 ret = do_cow_fault(vmf);
> >
> > That does the CoW fault handling.
> >
> > So the vma_is_anonymous_check() here should have the same semantics.
> >
> Just to be extra careful, does the reverse hold true as well?
>
> In anycase, if you are confident about this, please do state this
> change in the commit description that vma->vm_file and VM_SHARED flag
> check is replaced by vma_is_anonymous_check(), which is expected to be
> a non-functional change.

It's functionally equivalent and can be seen in the diff so I don't think
this is necessary.