[PATCH v3 2/5] mm/mseal: update madvise() logic

Lorenzo Stoakes posted 5 patches 2 months, 3 weeks ago
[PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 3 weeks ago
The madvise() logic is inexplicably performed in mm/mseal.c - this ought
to be located in mm/madvise.c.

Additionally can_modify_vma_madv() is inconsistently named and, in
combination with is_ro_anon(), is very confusing logic.

Put a static function in mm/madvise.c instead - can_madvise_modify() -
that spells out exactly what's happening.  Also explicitly check for an
anon VMA.

Also add commentary to explain what's going on.

Essentially - we disallow discarding of data in mseal()'d mappings in
instances where the user couldn't otherwise write to that data.

Shared mappings are always backed, so no discard will actually truly
discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
mappings are the ones we are interested in.

We make a change to the logic here to correct a mistake - we must disallow
discard of read-only MAP_PRIVATE file-backed mappings, which previously we
were not.

The justification for this change is to account for the case where:

1. A MAP_PRIVATE R/W file-backed mapping is established.
2. The mapping is written to, which backs it with anonymous memory.
3. The mapping is mprotect()'d read-only.
4. The mapping is mseal()'d.

If we were to now allow discard of this data, it would mean mseal() would
not prevent the unrecoverable discarding of data and it was thus violate
the semantics of sealed VMAs.

Finally, update the mseal tests, which were asserting previously that a
read-only MAP_PRIVATE file-backed mapping could be discarded.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/madvise.c                            | 63 ++++++++++++++++++++++++-
 mm/mseal.c                              | 49 -------------------
 mm/vma.h                                |  7 ---
 tools/testing/selftests/mm/mseal_test.c |  3 +-
 4 files changed, 63 insertions(+), 59 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 2bf80989d5b6..dc3d8497b0f4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_context.h>
 #include <linux/string.h>
 #include <linux/uio.h>
 #include <linux/ksm.h>
@@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
 			       &guard_remove_walk_ops, NULL);
 }
 
+#ifdef CONFIG_64BIT
+/* Does the madvise operation result in discarding of mapped data? */
+static bool is_discard(int behavior)
+{
+	switch (behavior) {
+	case MADV_FREE:
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+	case MADV_REMOVE:
+	case MADV_DONTFORK:
+	case MADV_WIPEONFORK:
+	case MADV_GUARD_INSTALL:
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
+ * circumstances - discarding of data from read-only anonymous SEALED mappings.
+ *
+ * This is because users cannot trivally discard data from these VMAs, and may
+ * only do so via an appropriate madvise() call.
+ */
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+	struct vm_area_struct *vma = madv_behavior->vma;
+
+	/* If the VMA isn't sealed we're good. */
+	if (can_modify_vma(vma))
+		return true;
+
+	/* For a sealed VMA, we only care about discard operations. */
+	if (!is_discard(madv_behavior->behavior))
+		return true;
+
+	/*
+	 * But shared mappings are fine, as dirty pages will be written to a
+	 * backing store regardless of discard.
+	 */
+	if (vma->vm_flags & VM_SHARED)
+		return true;
+
+	/* If the user could write to the mapping anyway, then this is fine. */
+	if ((vma->vm_flags & VM_WRITE) &&
+	    arch_vma_access_permitted(vma, /* write= */ true,
+			/* execute= */ false, /* foreign= */ false))
+		return true;
+
+	/* Otherwise, we are not permitted to perform this operation. */
+	return false;
+}
+#else
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+	return true;
+}
+#endif
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 	struct madvise_behavior_range *range = &madv_behavior->range;
 	int error;
 
-	if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
+	if (unlikely(!can_madvise_modify(madv_behavior)))
 		return -EPERM;
 
 	switch (behavior) {
diff --git a/mm/mseal.c b/mm/mseal.c
index c27197ac04e8..1308e88ab184 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -11,7 +11,6 @@
 #include <linux/mman.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
-#include <linux/mmu_context.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include "internal.h"
@@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
 	vm_flags_set(vma, VM_SEALED);
 }
 
-static bool is_madv_discard(int behavior)
-{
-	switch (behavior) {
-	case MADV_FREE:
-	case MADV_DONTNEED:
-	case MADV_DONTNEED_LOCKED:
-	case MADV_REMOVE:
-	case MADV_DONTFORK:
-	case MADV_WIPEONFORK:
-	case MADV_GUARD_INSTALL:
-		return true;
-	}
-
-	return false;
-}
-
-static bool is_ro_anon(struct vm_area_struct *vma)
-{
-	/* check anonymous mapping. */
-	if (vma->vm_file || vma->vm_flags & VM_SHARED)
-		return false;
-
-	/*
-	 * check for non-writable:
-	 * PROT=RO or PKRU is not writeable.
-	 */
-	if (!(vma->vm_flags & VM_WRITE) ||
-		!arch_vma_access_permitted(vma, true, false, false))
-		return true;
-
-	return false;
-}
-
-/*
- * Check if a vma is allowed to be modified by madvise.
- */
-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
-	if (!is_madv_discard(behavior))
-		return true;
-
-	if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
-		return false;
-
-	/* Allow by default. */
-	return true;
-}
-
 static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct **prev, unsigned long start,
 		unsigned long end, vm_flags_t newflags)
diff --git a/mm/vma.h b/mm/vma.h
index acdcc515c459..85db5e880fcc 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
 	return true;
 }
 
-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
-
 #else
 
 static inline bool can_modify_vma(struct vm_area_struct *vma)
@@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
 	return true;
 }
 
-static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
-	return true;
-}
-
 #endif
 
 #if defined(CONFIG_STACK_GROWSUP)
diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
index 005f29c86484..34c042da4de2 100644
--- a/tools/testing/selftests/mm/mseal_test.c
+++ b/tools/testing/selftests/mm/mseal_test.c
@@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
 	unsigned long size = 4 * page_size;
 	int ret;
 	int fd;
-	unsigned long mapflags = MAP_PRIVATE;
 
 	fd = memfd_create("test", 0);
 	FAIL_TEST_IF_FALSE(fd > 0);
@@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
 	ret = fallocate(fd, 0, 0, size);
 	FAIL_TEST_IF_FALSE(!ret);
 
-	ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
+	ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
 	FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
 
 	if (seal) {
-- 
2.50.1
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
On 16.07.25 19:38, Lorenzo Stoakes wrote:
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> to be located in mm/madvise.c.
> 
> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
> 
> Put a static function in mm/madvise.c instead - can_madvise_modify() -
> that spells out exactly what's happening.  Also explicitly check for an
> anon VMA.
> 
> Also add commentary to explain what's going on.
> 
> Essentially - we disallow discarding of data in mseal()'d mappings in
> instances where the user couldn't otherwise write to that data.
> 
> Shared mappings are always backed, so no discard will actually truly
> discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> mappings are the ones we are interested in.
> 
> We make a change to the logic here to correct a mistake - we must disallow
> discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> were not.
> 
> The justification for this change is to account for the case where:
> 
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.

Thinking about this a bit (should have realized this implication 
earlier) ... assuming we have:

1. A MAP_PRIVATE R/O file-backed mapping.
2. The mapping is mseal()'d.

We only really have anon folios in there with things like (a) uprobe (b) 
debugger access (c) similarly weird FOLL_FORCE stuff.

Now, most executables/libraries are mapped that way. If someone would 
rely on MADV_DONTNEED to zap pages in there (to free up memory), that 
would get rejected.

Does something like that rely on MADV_DONTNEED working? Good question.

Checking for anon_vma in addition, ad mentioned in the other thread, 
would be a "cheap" check to rule out that there are currently anon vmas 
in there.

Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Fri, Jul 25, 2025 at 12:12:54AM +0200, David Hildenbrand wrote:
> On 16.07.25 19:38, Lorenzo Stoakes wrote:
> > The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> > to be located in mm/madvise.c.
> >
> > Additionally can_modify_vma_madv() is inconsistently named and, in
> > combination with is_ro_anon(), is very confusing logic.
> >
> > Put a static function in mm/madvise.c instead - can_madvise_modify() -
> > that spells out exactly what's happening.  Also explicitly check for an
> > anon VMA.
> >
> > Also add commentary to explain what's going on.
> >
> > Essentially - we disallow discarding of data in mseal()'d mappings in
> > instances where the user couldn't otherwise write to that data.
> >
> > Shared mappings are always backed, so no discard will actually truly
> > discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> > mappings are the ones we are interested in.
> >
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> >
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
>
> Thinking about this a bit (should have realized this implication earlier)

Well none of us did...

> ... assuming we have:
>
> 1. A MAP_PRIVATE R/O file-backed mapping.
> 2. The mapping is mseal()'d.
>
> We only really have anon folios in there with things like (a) uprobe (b)
> debugger access (c) similarly weird FOLL_FORCE stuff.
>
> Now, most executables/libraries are mapped that way. If someone would rely
> on MADV_DONTNEED to zap pages in there (to free up memory), that would get
> rejected.

Right, yes.

This is odd behaviour to me. But I guess this is what Jeff meant by 'detecting
this' in android.

The documentation is really not specific enough, we need to fix that. It's
effectively stating any anon mappings are sealed, which is just not true with
existing semantics.

However I see:

	Memory sealing can automatically be applied by the runtime loader to
	seal .text and .rodata pages and applications can additionally seal
	security critical data at runtime.

So yes, we're going to break MADV_DONTNEED of this mappings.

BUT.

Would you really want to MADV_DONTNEED away uprobes etc.?? That seems... very
strange and broken behaviour no?

Note that, also, mappings of read-only files have VM_SHARED stripped. So they
become read-only (With ~VM_MAYWRITE).

To be clear this is where the mode of the file is read-only, not that the
mapping is read-only alone.

So with this change, we'd disallow discard of this.

It'd be pretty odd to mseal() a read-only file-backed mapping and then try to
discard, but maybe somebody would weirdly rely upon this?

It's inconsistent, as a person MAP_SHARED mapping a file that is read/write but
mapped read-only (or r/w of course), can discard fine eve if sealed, but if the
file happens to be read-only can't.

But we could add a VM_MAYWRITE check also.

OK maybe I"m softening on the anon_vma thing see below.

So we could combine these checks to avoid these issues.


>
> Does something like that rely on MADV_DONTNEED working? Good question.

Kees/Jeff? Can you check if android relies on this?

>
> Checking for anon_vma in addition, ad mentioned in the other thread, would
> be a "cheap" check to rule out that there are currently anon vmas in there.
>
> Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...

But hang on, it's read-only so we shouldn't get racing faults... right?

Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
rule out the _usual_ cases.

We're not changing zapping logic for this though, sorry. That's just a crazy
length to go to.

>
> --
> Cheers,
>
> David / dhildenb
>

In any case, I'm going to send a version of this with the controversial bit
stripped so we (hopefully) land the refactorings for 6.17 and can change
semantics if necessary for 6.18.

Cheers, Lorenzo
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
>>>
>>> The justification for this change is to account for the case where:
>>>
>>> 1. A MAP_PRIVATE R/W file-backed mapping is established.
>>> 2. The mapping is written to, which backs it with anonymous memory.
>>> 3. The mapping is mprotect()'d read-only.
>>> 4. The mapping is mseal()'d.
>>
>> Thinking about this a bit (should have realized this implication earlier)
> 
> Well none of us did...
 > >> ... assuming we have:
>>
>> 1. A MAP_PRIVATE R/O file-backed mapping.
>> 2. The mapping is mseal()'d.
>>
>> We only really have anon folios in there with things like (a) uprobe (b)
>> debugger access (c) similarly weird FOLL_FORCE stuff.
>>
>> Now, most executables/libraries are mapped that way. If someone would rely
>> on MADV_DONTNEED to zap pages in there (to free up memory), that would get
>> rejected.
> 
> Right, yes.
> 
> This is odd behaviour to me. But I guess this is what Jeff meant by 'detecting
> this' in android.

It's rather weird usage of MADV_DONTNEED, but maybe, for some R/O 
buffers ...

> 
> The documentation is really not specific enough, we need to fix that. It's
> effectively stating any anon mappings are sealed, which is just not true with
> existing semantics.
> 
> However I see:
> 
> 	Memory sealing can automatically be applied by the runtime loader to
> 	seal .text and .rodata pages and applications can additionally seal
> 	security critical data at runtime.
> 
> So yes, we're going to break MADV_DONTNEED of this mappings.
 > > BUT.
> 
> Would you really want to MADV_DONTNEED away uprobes etc.?? That seems... very
> strange and broken behaviour no?
> 
> Note that, also, mappings of read-only files have VM_SHARED stripped. So they
> become read-only (With ~VM_MAYWRITE).
> 
> To be clear this is where the mode of the file is read-only, not that the
> mapping is read-only alone.
> 
> So with this change, we'd disallow discard of this.
> 
> It'd be pretty odd to mseal() a read-only file-backed mapping and then try to
> discard, but maybe somebody would weirdly rely upon this?
> 
> It's inconsistent, as a person MAP_SHARED mapping a file that is read/write but
> mapped read-only (or r/w of course), can discard fine eve if sealed, but if the
> file happens to be read-only can't.
> 
> But we could add a VM_MAYWRITE check also.
> 
> OK maybe I"m softening on the anon_vma thing see below.
> 
> So we could combine these checks to avoid these issues.
> 
> 
>>
>> Does something like that rely on MADV_DONTNEED working? Good question.
> 
> Kees/Jeff? Can you check if android relies on this?
> 
>>
>> Checking for anon_vma in addition, ad mentioned in the other thread, would
>> be a "cheap" check to rule out that there are currently anon vmas in there.
>>
>> Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...
> 
> But hang on, it's read-only so we shouldn't get racing faults... right?

You mean, ones that populate anon folios.

Well, there is long-term pinning that can break COW and other weird 
stuff like FOLL_FORCE. Most of the latter probably holds the mmap lock 
in write mode. Probably.

> 
> Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> rule out the _usual_ cases.

Yeah, something to evaluate.

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Fri, Jul 25, 2025 at 09:38:14AM +0200, David Hildenbrand wrote:
> > >> ... assuming we have:
> > >
> > > 1. A MAP_PRIVATE R/O file-backed mapping.
> > > 2. The mapping is mseal()'d.
> > >
> > > We only really have anon folios in there with things like (a) uprobe (b)
> > > debugger access (c) similarly weird FOLL_FORCE stuff.
> > >
> > > Now, most executables/libraries are mapped that way. If someone would rely
> > > on MADV_DONTNEED to zap pages in there (to free up memory), that would get
> > > rejected.
> >
> > Right, yes.
> >
> > This is odd behaviour to me. But I guess this is what Jeff meant by 'detecting
> > this' in android.
>
> It's rather weird usage of MADV_DONTNEED, but maybe, for some R/O buffers
> ...

Yeah, curious if anybody is actually doing this.

> > >
> > > Checking for anon_vma in addition, ad mentioned in the other thread, would
> > > be a "cheap" check to rule out that there are currently anon vmas in there.
> > >
> > > Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...
> >
> > But hang on, it's read-only so we shouldn't get racing faults... right?
>
> You mean, ones that populate anon folios.

Right, but these are the only ones we care about right? file-backed mappings
won't change vma->anon_vma.

Changes to that field from NULL use mmap read lock and... page_table_lock :P
fun.

>
> Well, there is long-term pinning that can break COW and other weird stuff
> like FOLL_FORCE. Most of the latter probably holds the mmap lock in write
> mode. Probably.

Well GUP uses read lock.

FOLL_FORCE won't override anything as we have this check in check_vma_flags():

	if (write) {
		if (!vma_anon &&
		    !writable_file_mapping_allowed(vma, gup_flags))
			return -EFAULT;

		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
			if (!(gup_flags & FOLL_FORCE))
				return -EFAULT;
			/*
			 * We used to let the write,force case do COW in a
			 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
			 * set a breakpoint in a read-only mapping of an
			 * executable, without corrupting the file (yet only
			 * when that file had been opened for writing!).
			 * Anon pages in shared mappings are surprising: now
			 * just reject it.
			 */
			if (!is_cow_mapping(vm_flags))
				return -EFAULT;
		}
	}

With:

static inline bool is_cow_mapping(vm_flags_t flags)
{
	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
mappings.

Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
FOLL_WRITE and !VM_WRITE situation.

>
> >
> > Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> > rule out the _usual_ cases.
>
> Yeah, something to evaluate.

I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
!(vma->vm_flags & VM_MAYWRITE).

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
>>
>> Well, there is long-term pinning that can break COW and other weird stuff
>> like FOLL_FORCE. Most of the latter probably holds the mmap lock in write
>> mode. Probably.
> 
> Well GUP uses read lock.

Right, so it can race with MADV_DONTNEED.

> 
> FOLL_FORCE won't override anything as we have this check in check_vma_flags():
> 
> 	if (write) {
> 		if (!vma_anon &&
> 		    !writable_file_mapping_allowed(vma, gup_flags))
> 			return -EFAULT;
> 
> 		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
> 			if (!(gup_flags & FOLL_FORCE))
> 				return -EFAULT;
> 			/*
> 			 * We used to let the write,force case do COW in a
> 			 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
> 			 * set a breakpoint in a read-only mapping of an
> 			 * executable, without corrupting the file (yet only
> 			 * when that file had been opened for writing!).
> 			 * Anon pages in shared mappings are surprising: now
> 			 * just reject it.
> 			 */
> 			if (!is_cow_mapping(vm_flags))
> 				return -EFAULT;
> 		}
> 	}
> 
> With:
> 
> static inline bool is_cow_mapping(vm_flags_t flags)
> {
> 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> }
> 

Not sure what you mean. Using FOLL_FORCE you can write into MAP_PRIVATE 
R/O mappings. Particular useful for installing breakpoints into loaded 
executables etc.

is_cow_mapping() tells you exactly that: the only place where we can 
have anon folios is when we have a MAP_PRIVATE mapping (!VM_SHARED) that 
can be writable, for example, through mprotect(PROT_WRITE) (VM_MAYWRITE).

A MAP_PRIVATE R/O file mapping matches is_cow_mapping().

> So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
> mappings.
> 
> Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
> FOLL_WRITE and !VM_WRITE situation.
> 
>>
>>>
>>> Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
>>> rule out the _usual_ cases.
>>
>> Yeah, something to evaluate.
> 
> I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
> !(vma->vm_flags & VM_MAYWRITE).

I think there are possible races, the question is how much you care 
about them.

In case of CoW-unsharing, you're not actually discarding data, because 
the page in the anon folio is to maintain a copy of the pagecache page 
(of course, they can go out of sync, but that's a different discussion).

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Fri, Jul 25, 2025 at 11:46:13AM +0200, David Hildenbrand wrote:
> > >
> > > Well, there is long-term pinning that can break COW and other weird stuff
> > > like FOLL_FORCE. Most of the latter probably holds the mmap lock in write
> > > mode. Probably.
> >
> > Well GUP uses read lock.
>
> Right, so it can race with MADV_DONTNEED.
>
> >
> > FOLL_FORCE won't override anything as we have this check in check_vma_flags():
> >
> > 	if (write) {
> > 		if (!vma_anon &&
> > 		    !writable_file_mapping_allowed(vma, gup_flags))
> > 			return -EFAULT;
> >
> > 		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
> > 			if (!(gup_flags & FOLL_FORCE))
> > 				return -EFAULT;
> > 			/*
> > 			 * We used to let the write,force case do COW in a
> > 			 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
> > 			 * set a breakpoint in a read-only mapping of an
> > 			 * executable, without corrupting the file (yet only
> > 			 * when that file had been opened for writing!).
> > 			 * Anon pages in shared mappings are surprising: now
> > 			 * just reject it.
> > 			 */
> > 			if (!is_cow_mapping(vm_flags))
> > 				return -EFAULT;
> > 		}
> > 	}
> >
> > With:
> >
> > static inline bool is_cow_mapping(vm_flags_t flags)
> > {
> > 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > }
> >
>
> Not sure what you mean. Using FOLL_FORCE you can write into MAP_PRIVATE R/O
> mappings. Particular useful for installing breakpoints into loaded
> executables etc.

Sigh. Really sorry, I'm having a terrible week, my brain just isn't working at
the moment.

Yes it proves the opposite of what I said, I misread it foolishly.

Please disregard.

>
> is_cow_mapping() tells you exactly that: the only place where we can have
> anon folios is when we have a MAP_PRIVATE mapping (!VM_SHARED) that can be
> writable, for example, through mprotect(PROT_WRITE) (VM_MAYWRITE).
>
> A MAP_PRIVATE R/O file mapping matches is_cow_mapping().

Yup.

>
> > So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
> > mappings.
> >
> > Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
> > FOLL_WRITE and !VM_WRITE situation.
> >
> > >
> > > >
> > > > Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> > > > rule out the _usual_ cases.
> > >
> > > Yeah, something to evaluate.
> >
> > I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
> > !(vma->vm_flags & VM_MAYWRITE).
>
> I think there are possible races, the question is how much you care about
> them.

Yes I was just wrong. Please just disregard.

I mean racing with MADV_POPULATE_WRITE seems a niche thing to worry about, and
so what if you did, it's writing a... copy of the underlying file-backed folios
no?

Equally long-term GUP, assuming it breaks CoW for migration, is what, populating
unchanged folios, so what's the issue?


>
> In case of CoW-unsharing, you're not actually discarding data, because the
> page in the anon folio is to maintain a copy of the pagecache page (of
> course, they can go out of sync, but that's a different discussion).

Yes I know.
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
>>> So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
>>> mappings.
>>>
>>> Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
>>> FOLL_WRITE and !VM_WRITE situation.
>>>
>>>>
>>>>>
>>>>> Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
>>>>> rule out the _usual_ cases.
>>>>
>>>> Yeah, something to evaluate.
>>>
>>> I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
>>> !(vma->vm_flags & VM_MAYWRITE).
>>
>> I think there are possible races, the question is how much you care about
>> them.
> 
> Yes I was just wrong. Please just disregard.
> 
> I mean racing with MADV_POPULATE_WRITE seems a niche thing to worry about, and
> so what if you did, it's writing a... copy of the underlying file-backed folios
> no?

MADV_POPULATE_WRITE does not apply. It's only racing with FOLL_FORCE, 
like debugger access.

Again, a race one probably shouldn't worry about in the context of mseal.

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Fri, Jul 25, 2025 at 12:10:07PM +0200, David Hildenbrand wrote:
> > > > So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
> > > > mappings.
> > > >
> > > > Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
> > > > FOLL_WRITE and !VM_WRITE situation.
> > > >
> > > > >
> > > > > >
> > > > > > Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> > > > > > rule out the _usual_ cases.
> > > > >
> > > > > Yeah, something to evaluate.
> > > >
> > > > I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
> > > > !(vma->vm_flags & VM_MAYWRITE).
> > >
> > > I think there are possible races, the question is how much you care about
> > > them.
> >
> > Yes I was just wrong. Please just disregard.
> >
> > I mean racing with MADV_POPULATE_WRITE seems a niche thing to worry about, and
> > so what if you did, it's writing a... copy of the underlying file-backed folios
> > no?
>
> MADV_POPULATE_WRITE does not apply. It's only racing with FOLL_FORCE, like
> debugger access.

Yeah right, that too. OK so we might have some breakpoint or changed some data,
then discard, I don't think this is _too_ problematic right? The debugger is
naturally racey anyway.

I think we can accept these.

>
> Again, a race one probably shouldn't worry about in the context of mseal.

Yeah agreed, so I think this is still sensible.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Kees Cook 2 months, 1 week ago
On Wed, Jul 16, 2025 at 06:38:03PM +0100, Lorenzo Stoakes wrote:
> We make a change to the logic here to correct a mistake - we must disallow
> discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> were not.
> The justification for this change is to account for the case where:
> 
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
> 
> If we were to now allow discard of this data, it would mean mseal() would
> not prevent the unrecoverable discarding of data and it was thus violate
> the semantics of sealed VMAs.

I want to make sure I'm understanding this right:

Was the old behavior to allow discard? (If so, that seems like it wasn't
doing what Linus asked for[1], but it's not clear to me if that was
the behavior Chrome wanted.) The test doesn't appear to validate which
contents end up being visible after the discard, only whether or not
madvise() succeeds.

As an aside, why should discard work in this case even without step 4?
Wouldn't setting "read-only" imply you don't want the memory to change
out from under you? I guess I'm not clear on the semantics: how do memory
protection bits map to madvise actions like this?

-Kees

[1] https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

-- 
Kees Cook
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Jeff Xu 2 months, 1 week ago
On Thu, Jul 24, 2025 at 2:15 PM Kees Cook <kees@kernel.org> wrote:
>
> On Wed, Jul 16, 2025 at 06:38:03PM +0100, Lorenzo Stoakes wrote:
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
> >
> > If we were to now allow discard of this data, it would mean mseal() would
> > not prevent the unrecoverable discarding of data and it was thus violate
> > the semantics of sealed VMAs.
>
> I want to make sure I'm understanding this right:
>
> Was the old behavior to allow discard? (If so, that seems like it wasn't
> doing what Linus asked for[1], but it's not clear to me if that was
> the behavior Chrome wanted.)
Chrome V8 JIT engine only cares about  anonymous mapping, not file backed.
I'm not sure about all of Chrome though.

> The test doesn't appear to validate which
> contents end up being visible after the discard, only whether or not
> madvise() succeeds.
>
Agree the test can be improved to validate the read-back after discard.
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Thu, Jul 24, 2025 at 02:15:26PM -0700, Kees Cook wrote:
> On Wed, Jul 16, 2025 at 06:38:03PM +0100, Lorenzo Stoakes wrote:
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
> >
> > If we were to now allow discard of this data, it would mean mseal() would
> > not prevent the unrecoverable discarding of data and it was thus violate
> > the semantics of sealed VMAs.
>
> I want to make sure I'm understanding this right:
>
> Was the old behavior to allow discard? (If so, that seems like it wasn't
> doing what Linus asked for[1], but it's not clear to me if that was
> the behavior Chrome wanted.) The test doesn't appear to validate which
> contents end up being visible after the discard, only whether or not
> madvise() succeeds.

Yes the old behaviour allowed discard in this case, because:

	/* check anonymous mapping. */
	if (vma->vm_file || vma->vm_flags & VM_SHARED)
		return false;

In is_ro_anon() would return false (we have vma->vm_file), and in
can_modify_vma_madv() we'd hit:

	if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
		return false;

	/* Allow by default. */
	return true;

The fix is to check vma->vm_files & VM_SHARED only in effect.

>
> As an aside, why should discard work in this case even without step 4?
> Wouldn't setting "read-only" imply you don't want the memory to change
> out from under you? I guess I'm not clear on the semantics: how do memory
> protection bits map to madvise actions like this?

I mean this is uAPI so it's moot, we can't change this.

I think you're thinking read-only is stronger than you think it is in the
general case.

VM_MAYWRITE is the key thing here.

In do_mmap() in mm/mmap.c:

	if (file) {
		struct inode *inode = file_inode(file);
		unsigned long flags_mask;
		int err;

		...

		switch (flags & MAP_TYPE) {
		case MAP_SHARED:
			...
			fallthrough;
		case MAP_SHARED_VALIDATE:
			...
			if (!(file->f_mode & FMODE_WRITE))
				vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

			...
		}
		...
	}

So we're only actually prevented VM_MAYWRITE if the _file_ itself doesn't have
write permission.

Otherwise we might at any time mprotect() the mapping to be writable in any
csae.

mseal() changes things, as it's a stronger requirement. You're explicitly saying
'I don't want this data to be discarded', which is why we should be firmer here.

I disagree this needs to be changed more broadly, but in any case, it'd break
uAPI so it's moot.

And wrt this series, it's further moot.

>
> -Kees
>
> [1] https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
>
> --
> Kees Cook
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
> As an aside, why should discard work in this case even without step 4?
> Wouldn't setting "read-only" imply you don't want the memory to change
> out from under you? I guess I'm not clear on the semantics: how do memory
> protection bits map to madvise actions like this?

They generally don't affect MADV_DONTNEED behavior. The only documented 
(man page) reason for EPERM in the man page is related to MADV_HWPOISON.

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
On 24.07.25 23:32, David Hildenbrand wrote:
>> As an aside, why should discard work in this case even without step 4?
>> Wouldn't setting "read-only" imply you don't want the memory to change
>> out from under you? I guess I'm not clear on the semantics: how do memory
>> protection bits map to madvise actions like this?
> 
> They generally don't affect MADV_DONTNEED behavior. The only documented
> (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
> 

(Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires 
corresponding permissions)

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Kees Cook 2 months, 1 week ago
On Thu, Jul 24, 2025 at 11:41:04PM +0200, David Hildenbrand wrote:
> On 24.07.25 23:32, David Hildenbrand wrote:
> > > As an aside, why should discard work in this case even without step 4?
> > > Wouldn't setting "read-only" imply you don't want the memory to change
> > > out from under you? I guess I'm not clear on the semantics: how do memory
> > > protection bits map to madvise actions like this?
> > 
> > They generally don't affect MADV_DONTNEED behavior. The only documented
> > (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
> > 
> 
> (Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires corresponding
> permissions)

Shouldn't an MADV action that changes memory contents require the W bit
though? I mean, I assume the ship may have sailed on this, but it feels
mismatched to me.

-- 
Kees Cook
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
On 25.07.25 00:29, Kees Cook wrote:
> On Thu, Jul 24, 2025 at 11:41:04PM +0200, David Hildenbrand wrote:
>> On 24.07.25 23:32, David Hildenbrand wrote:
>>>> As an aside, why should discard work in this case even without step 4?
>>>> Wouldn't setting "read-only" imply you don't want the memory to change
>>>> out from under you? I guess I'm not clear on the semantics: how do memory
>>>> protection bits map to madvise actions like this?
>>>
>>> They generally don't affect MADV_DONTNEED behavior. The only documented
>>> (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
>>>
>>
>> (Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires corresponding
>> permissions)
> 
> Shouldn't an MADV action that changes memory contents require the W bit
> though?

In a MAP_RPIVATE file mapping, to know whether you are actually 
modifying memory ("discarding pages" ...) would require checking the 
actually mapped pages (mixture of anon and !anon folios). Only zapping 
anon folios is the problematic bit, really.

It could be implemented (e.g., fail halfway through while actually 
walking the page tables and zap).

But, yeah ...

> I mean, I assume the ship may have sailed on this, but it feels
> mismatched to me.

... I think that is that case, unfortunately.

I remember that userfaultfd can do some really nasty stuff with 
UFFDIO_COPY and MADV_DONTNEED on memory areas that don't have write 
permissions ... or even read permissions. Not sure if CRIU or other use 
cases depend on that in some weird way.

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
On 25.07.25 00:47, David Hildenbrand wrote:
> On 25.07.25 00:29, Kees Cook wrote:
>> On Thu, Jul 24, 2025 at 11:41:04PM +0200, David Hildenbrand wrote:
>>> On 24.07.25 23:32, David Hildenbrand wrote:
>>>>> As an aside, why should discard work in this case even without step 4?
>>>>> Wouldn't setting "read-only" imply you don't want the memory to change
>>>>> out from under you? I guess I'm not clear on the semantics: how do memory
>>>>> protection bits map to madvise actions like this?
>>>>
>>>> They generally don't affect MADV_DONTNEED behavior. The only documented
>>>> (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
>>>>
>>>
>>> (Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires corresponding
>>> permissions)
>>
>> Shouldn't an MADV action that changes memory contents require the W bit
>> though?
> 

Pondering about this some more, at least MADV_DONTNEED is mostly a 
cheaper way of doing mmap(MAP_FIXED): in other word, zap everything but 
leave the original mapping unchanged.

So if you allow for mmap(MAP_FIXED) -- ignore any permissions bits, of 
course -- nothing really wrong about allowing MADV_DONTNEED.

With mseal(), it got all weird I am afraid, because we have this 
exception list, and apparently, it has holes. :(

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Jeff Xu 2 months, 2 weeks ago
Hi Lorenzo,

This change has two parts: a non-functional refactoring work of moving
function from mseal.c to madvise.c, and a functional change to the
behavior of madvise under mseal. Would you consider splitting the
change into two parts? which seems to be a common practice at
supplying patches in lkml.

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> to be located in mm/madvise.c.
>
This is part one of a non-functional refactoring work to move a
function from mseal.c to madvise.c.

There are two main reasons why I initially wanted to keep all
mseal-related functions in mseal.c:

1 Keeping all mseal related logic in mseal.c makes it easier for
developers to find all the impacted areas of mseal.
2 mseal is not supported in 32 bits, and mseal.c is excluded from 32
bits build (see makefile).This would prevent accidentally using mseal
in code paths shared between 32-bit and 64-bit architectures. It also
avoids adding #Ifdef in the .c file, which is recommended by the mm
coding standard (I received comments about avoiding  #ifdef in .c  in
the past).

Point 2 can go aways if 32 bits mseal support is coming soon, same as
my other comments for patch 1/5.

Point 1 remains. However, I want to focus on the functional change
part of this patch, rather than the other aspects.

> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
>
> Put a static function in mm/madvise.c instead - can_madvise_modify() -
> that spells out exactly what's happening.  Also explicitly check for an
> anon VMA.
>
> Also add commentary to explain what's going on.
>
> Essentially - we disallow discarding of data in mseal()'d mappings in
> instances where the user couldn't otherwise write to that data.
>
> Shared mappings are always backed, so no discard will actually truly
> discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> mappings are the ones we are interested in.
>
> We make a change to the logic here to correct a mistake - we must disallow
> discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> were not.
>
> The justification for this change is to account for the case where:
>
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
>
> If we were to now allow discard of this data, it would mean mseal() would
> not prevent the unrecoverable discarding of data and it was thus violate
> the semantics of sealed VMAs.
>
This is the functional change and the most important area that I would
like to discuss in this patch series.

Since Jann Horn first highlighted [1] the problematic behavior of
destructive madvise for anonymous mapping during initial discussions
of mseal(), the proper solution has been open to discussion and
exploration. Linus Torvalds has expressed interest [2] in fixing this
within madvise itself, without requiring mseal, and I copied it here
for ease of reference:

“Hmm. I actually would be happier if we just made that change in
general. Maybe even without sealing, but I agree that it *definitely*
makes sense in general as a sealing thing.”

After mseal was merged, Pedro Falcato raised a valid concern regarding
file-backed private mappings. The issue is that these mappings can be
written to, and subsequently changed to read-only, which is precisely
the problem this patch aims to fix. I attempted to address this in
[3]. However, that patch was rejected due to the introduction of a new
vm_flags, which was a valid rejection as it wasn't the ideal solution.
Nevertheless, it sparked an interesting discussion, with me raising a
point that userspace might use this feature to free up RAM for
file-backed private mapping that is never written to,  and input about
this topic from Vlastimil Babka [4] is about MADV_COLD/MADV_PAGEOUT.

A detail about userspace calling those madvise for file-backed private
mapping. Previously, I added extra logging in the kernel, and observed
many instances of those calls  during Android phone startup, although
I haven’t delved into the reason behind why user space calls those,
e.g. if it is from an individual app or Android platform.

Incidentally, recently while I was studying selinux code, particularly
exemod permission [5] , I learned that selinux utilizes vma->anon_vma
to identify file-backed mappings that have been written to.  Jann
pointed out to me that the kernel creates a COW mapping when a private
file-backed mapping is written. Now I'm wondering if this could be the
answer to our problem. We could try having destructive madvise check
vma->anon_vma and reject the call if it's true. I haven't had a chance
to test this theory yet, though.

To summarize all the discussion points so far:
1. It's questionable behavior for madvise to allow destructive
behavior for read-only anonymous mappings, regardless of mseal state.
2. We could potentially fix point 1 within madvise itself, without
involving mseal, as Linus desires.
3. Android userspace uses destructive madvise to free up RAM, but I
need to take a closer look at the patterns and usage to understand why
they do that.
4. We could ask applications to switch to non-destructive madvise,
like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
switch the kernel to use non-destructive madvise implicitly for
destructive madvise in suitable situations.
5. We could investigate more based on vma->anon_vma

Link: https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[1]
Link: https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
[2]
Link: https://lore.kernel.org/all/20241017005105.3047458-2-jeffxu@chromium.org/
[3]
Link: https://lore.kernel.org/all/8f68ad82-2f60-49f8-b150-0cf183c9cc71@suse.cz/
[4]
Link: https://elixir.bootlin.com/linux/v6.15.7/source/security/selinux/hooks.c#L3878
[5]

> Finally, update the mseal tests, which were asserting previously that a
> read-only MAP_PRIVATE file-backed mapping could be discarded.
>
The test you are updating is not intended for the scenario this patch
is aimed to fix: i.e. the scenario:
1. A MAP_PRIVATE R/W file-backed mapping is established.
2. The mapping is written to, which backs it with anonymous memory.
3. The mapping is mprotect()'d read-only.
4. The mapping is mseal()'d.

The test case doesn't include steps 1, 2, and 3, is it possible to
keep the existing one and create a new test case? But I don't think
that's the main discussion point. We can revisit test cases once we've
fully discussed the design.

Thanks and regards,
-Jeff
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/madvise.c                            | 63 ++++++++++++++++++++++++-
>  mm/mseal.c                              | 49 -------------------
>  mm/vma.h                                |  7 ---
>  tools/testing/selftests/mm/mseal_test.c |  3 +-
>  4 files changed, 63 insertions(+), 59 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 2bf80989d5b6..dc3d8497b0f4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/mmu_context.h>
>  #include <linux/string.h>
>  #include <linux/uio.h>
>  #include <linux/ksm.h>
> @@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
>                                &guard_remove_walk_ops, NULL);
>  }
>
> +#ifdef CONFIG_64BIT
> +/* Does the madvise operation result in discarding of mapped data? */
> +static bool is_discard(int behavior)
> +{
> +       switch (behavior) {
> +       case MADV_FREE:
> +       case MADV_DONTNEED:
> +       case MADV_DONTNEED_LOCKED:
> +       case MADV_REMOVE:
> +       case MADV_DONTFORK:
> +       case MADV_WIPEONFORK:
> +       case MADV_GUARD_INSTALL:
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +/*
> + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> + *
> + * This is because users cannot trivally discard data from these VMAs, and may
> + * only do so via an appropriate madvise() call.
> + */
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       struct vm_area_struct *vma = madv_behavior->vma;
> +
> +       /* If the VMA isn't sealed we're good. */
> +       if (can_modify_vma(vma))
> +               return true;
> +
> +       /* For a sealed VMA, we only care about discard operations. */
> +       if (!is_discard(madv_behavior->behavior))
> +               return true;
> +
> +       /*
> +        * But shared mappings are fine, as dirty pages will be written to a
> +        * backing store regardless of discard.
> +        */
> +       if (vma->vm_flags & VM_SHARED)
> +               return true;
> +
> +       /* If the user could write to the mapping anyway, then this is fine. */
> +       if ((vma->vm_flags & VM_WRITE) &&
> +           arch_vma_access_permitted(vma, /* write= */ true,
> +                       /* execute= */ false, /* foreign= */ false))
> +               return true;
> +
> +       /* Otherwise, we are not permitted to perform this operation. */
> +       return false;
> +}
> +#else
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       return true;
> +}
> +#endif
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>         struct madvise_behavior_range *range = &madv_behavior->range;
>         int error;
>
> -       if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> +       if (unlikely(!can_madvise_modify(madv_behavior)))
>                 return -EPERM;
>
>         switch (behavior) {
> diff --git a/mm/mseal.c b/mm/mseal.c
> index c27197ac04e8..1308e88ab184 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -11,7 +11,6 @@
>  #include <linux/mman.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> -#include <linux/mmu_context.h>
>  #include <linux/syscalls.h>
>  #include <linux/sched.h>
>  #include "internal.h"
> @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
>         vm_flags_set(vma, VM_SEALED);
>  }
>
> -static bool is_madv_discard(int behavior)
> -{
> -       switch (behavior) {
> -       case MADV_FREE:
> -       case MADV_DONTNEED:
> -       case MADV_DONTNEED_LOCKED:
> -       case MADV_REMOVE:
> -       case MADV_DONTFORK:
> -       case MADV_WIPEONFORK:
> -       case MADV_GUARD_INSTALL:
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
> -static bool is_ro_anon(struct vm_area_struct *vma)
> -{
> -       /* check anonymous mapping. */
> -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> -               return false;
> -
> -       /*
> -        * check for non-writable:
> -        * PROT=RO or PKRU is not writeable.
> -        */
> -       if (!(vma->vm_flags & VM_WRITE) ||
> -               !arch_vma_access_permitted(vma, true, false, false))
> -               return true;
> -
> -       return false;
> -}
> -
> -/*
> - * Check if a vma is allowed to be modified by madvise.
> - */
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       if (!is_madv_discard(behavior))
> -               return true;
> -
> -       if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> -               return false;
> -
> -       /* Allow by default. */
> -       return true;
> -}
> -
>  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                 struct vm_area_struct **prev, unsigned long start,
>                 unsigned long end, vm_flags_t newflags)
> diff --git a/mm/vma.h b/mm/vma.h
> index acdcc515c459..85db5e880fcc 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> -
>  #else
>
>  static inline bool can_modify_vma(struct vm_area_struct *vma)
> @@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       return true;
> -}
> -
>  #endif
>
>  #if defined(CONFIG_STACK_GROWSUP)
> diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
> index 005f29c86484..34c042da4de2 100644
> --- a/tools/testing/selftests/mm/mseal_test.c
> +++ b/tools/testing/selftests/mm/mseal_test.c
> @@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
>         unsigned long size = 4 * page_size;
>         int ret;
>         int fd;
> -       unsigned long mapflags = MAP_PRIVATE;
>
>         fd = memfd_create("test", 0);
>         FAIL_TEST_IF_FALSE(fd > 0);
> @@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
>         ret = fallocate(fd, 0, 0, size);
>         FAIL_TEST_IF_FALSE(!ret);
>
> -       ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
> +       ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
>         FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
>
>         if (seal) {
> --
> 2.50.1
>
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 2 weeks ago
On Thu, Jul 24, 2025 at 11:39:05AM -0700, Jeff Xu wrote:
> Hi Lorenzo,
>
> This change has two parts: a non-functional refactoring work of moving
> function from mseal.c to madvise.c, and a functional change to the
> behavior of madvise under mseal. Would you consider splitting the
> change into two parts? which seems to be a common practice at
> supplying patches in lkml.

No I won't do that.

it's a very small change, and it was intentionally bundled so we correct
the oddly implemented version while moving.

>
> On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> > to be located in mm/madvise.c.
> >
> This is part one of a non-functional refactoring work to move a
> function from mseal.c to madvise.c.
>
> There are two main reasons why I initially wanted to keep all
> mseal-related functions in mseal.c:
>
> 1 Keeping all mseal related logic in mseal.c makes it easier for
> developers to find all the impacted areas of mseal.

That's very silly, sorry but no.

You're putting stuff entirely specific to madvise() away from madvise(),
to bundle things up for no really good reason.

Again, you have a desire to do things that are at odds with how everything
else in mm is done.

> 2 mseal is not supported in 32 bits, and mseal.c is excluded from 32
> bits build (see makefile).This would prevent accidentally using mseal
> in code paths shared between 32-bit and 64-bit architectures. It also
> avoids adding #Ifdef in the .c file, which is recommended by the mm
> coding standard (I received comments about avoiding  #ifdef in .c  in
> the past).

You're doing something strictly worse by putting madvise() stuff to bitrot
in another file.

It's always a trade-off.

There's no set 'coding standard', there's what maintainers accept.

>
> Point 2 can go aways if 32 bits mseal support is coming soon, same as
> my other comments for patch 1/5.

But that's not the reason, as commit message states.

>
> Point 1 remains. However, I want to focus on the functional change
> part of this patch, rather than the other aspects.

No point 1 is dimsissed as is point 2.

>
> > Additionally can_modify_vma_madv() is inconsistently named and, in
> > combination with is_ro_anon(), is very confusing logic.
> >
> > Put a static function in mm/madvise.c instead - can_madvise_modify() -
> > that spells out exactly what's happening.  Also explicitly check for an
> > anon VMA.
> >
> > Also add commentary to explain what's going on.
> >
> > Essentially - we disallow discarding of data in mseal()'d mappings in
> > instances where the user couldn't otherwise write to that data.
> >
> > Shared mappings are always backed, so no discard will actually truly
> > discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> > mappings are the ones we are interested in.
> >
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> >
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
> >
> > If we were to now allow discard of this data, it would mean mseal() would
> > not prevent the unrecoverable discarding of data and it was thus violate
> > the semantics of sealed VMAs.
> >
> This is the functional change and the most important area that I would
> like to discuss in this patch series.

OK.

>
> Since Jann Horn first highlighted [1] the problematic behavior of
> destructive madvise for anonymous mapping during initial discussions
> of mseal(), the proper solution has been open to discussion and
> exploration. Linus Torvalds has expressed interest [2] in fixing this
> within madvise itself, without requiring mseal, and I copied it here
> for ease of reference:
>
> “Hmm. I actually would be happier if we just made that change in
> general. Maybe even without sealing, but I agree that it *definitely*
> makes sense in general as a sealing thing.”
>
> After mseal was merged, Pedro Falcato raised a valid concern regarding
> file-backed private mappings. The issue is that these mappings can be
> written to, and subsequently changed to read-only, which is precisely
> the problem this patch aims to fix. I attempted to address this in
> [3]. However, that patch was rejected due to the introduction of a new
> vm_flags, which was a valid rejection as it wasn't the ideal solution.
> Nevertheless, it sparked an interesting discussion, with me raising a
> point that userspace might use this feature to free up RAM for
> file-backed private mapping that is never written to,  and input about
> this topic from Vlastimil Babka [4] is about MADV_COLD/MADV_PAGEOUT.

OK not sure what point you're making yet?

>
> A detail about userspace calling those madvise for file-backed private
> mapping. Previously, I added extra logging in the kernel, and observed
> many instances of those calls  during Android phone startup, although
> I haven’t delved into the reason behind why user space calls those,
> e.g. if it is from an individual app or Android platform.

Hang on, sorry, are you saying that you intentionally allowed destruction
of mseal()'d VMAs to serve android?

I hope I'm misunderstanding you here.

Either way I don't know what your point is? Don't mseal them if you want to
perform destructive operations on them?

You have to argue as to why this change is not valid in _upstream_ linux.

>
> Incidentally, recently while I was studying selinux code, particularly
> exemod permission [5] , I learned that selinux utilizes vma->anon_vma
> to identify file-backed mappings that have been written to.  Jann
> pointed out to me that the kernel creates a COW mapping when a private
> file-backed mapping is written. Now I'm wondering if this could be the
> answer to our problem. We could try having destructive madvise check
> vma->anon_vma and reject the call if it's true. I haven't had a chance
> to test this theory yet, though.

Umm what? Why? What are you talking about?

A MAP_PRIVATE mapping will not have VM_SHARED set. This is why I changed
the check to this.

I really don't understand what point you're trying to make here.

The check I've provided works correctly to disallow the previously
_incorrectly allowed_ MAP_PRIVATE mapping of a file-backed mapping.

This was something you missed, and is now corrected.

>
> To summarize all the discussion points so far:
> 1. It's questionable behavior for madvise to allow destructive
> behavior for read-only anonymous mappings, regardless of mseal state.

Umm, ok. Well maybe, but it's essentially uAPI at this point. This feels
irrelevant to this series.

> 2. We could potentially fix point 1 within madvise itself, without
> involving mseal, as Linus desires.

No, we will not do that.

> 3. Android userspace uses destructive madvise to free up RAM, but I
> need to take a closer look at the patterns and usage to understand why
> they do that.

OK so what?

> 4. We could ask applications to switch to non-destructive madvise,
> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> switch the kernel to use non-destructive madvise implicitly for
> destructive madvise in suitable situations.

Umm what? I don't understand your point.

> 5. We could investigate more based on vma->anon_vma

I have no idea what you mean by this. I am an rmap maintainer and have
worked extensively with anon_vma, what's the point exactly?

>
> Link: https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
> [1]
> Link: https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
> [2]
> Link: https://lore.kernel.org/all/20241017005105.3047458-2-jeffxu@chromium.org/
> [3]
> Link: https://lore.kernel.org/all/8f68ad82-2f60-49f8-b150-0cf183c9cc71@suse.cz/
> [4]
> Link: https://elixir.bootlin.com/linux/v6.15.7/source/security/selinux/hooks.c#L3878
> [5]
>
> > Finally, update the mseal tests, which were asserting previously that a
> > read-only MAP_PRIVATE file-backed mapping could be discarded.
> >
> The test you are updating is not intended for the scenario this patch
> is aimed to fix: i.e. the scenario:
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
>
> The test case doesn't include steps 1, 2, and 3, is it possible to
> keep the existing one and create a new test case? But I don't think
> that's the main discussion point. We can revisit test cases once we've
> fully discussed the design.

I do not know why you put this here. Can you please put review along side
the code you're reviewing.

You're making my life unnecessarily hard here.

But yes, you're right, I messed up the test here, I'll send a fix-patch.

Incidentally, it seems like the test _explicitly_ asserted that this
behaviour was the opposite of what it should be which makes me think again
this might be intentional... Concerning.

>
> Thanks and regards,
> -Jeff
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/madvise.c                            | 63 ++++++++++++++++++++++++-
> >  mm/mseal.c                              | 49 -------------------
> >  mm/vma.h                                |  7 ---
> >  tools/testing/selftests/mm/mseal_test.c |  3 +-
> >  4 files changed, 63 insertions(+), 59 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 2bf80989d5b6..dc3d8497b0f4 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/mm_inline.h>
> > +#include <linux/mmu_context.h>
> >  #include <linux/string.h>
> >  #include <linux/uio.h>
> >  #include <linux/ksm.h>
> > @@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
> >                                &guard_remove_walk_ops, NULL);
> >  }
> >
> > +#ifdef CONFIG_64BIT
> > +/* Does the madvise operation result in discarding of mapped data? */
> > +static bool is_discard(int behavior)
> > +{
> > +       switch (behavior) {
> > +       case MADV_FREE:
> > +       case MADV_DONTNEED:
> > +       case MADV_DONTNEED_LOCKED:
> > +       case MADV_REMOVE:
> > +       case MADV_DONTFORK:
> > +       case MADV_WIPEONFORK:
> > +       case MADV_GUARD_INSTALL:
> > +               return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> > +/*
> > + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> > + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> > + *
> > + * This is because users cannot trivally discard data from these VMAs, and may
> > + * only do so via an appropriate madvise() call.
> > + */
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > +       struct vm_area_struct *vma = madv_behavior->vma;
> > +
> > +       /* If the VMA isn't sealed we're good. */
> > +       if (can_modify_vma(vma))
> > +               return true;
> > +
> > +       /* For a sealed VMA, we only care about discard operations. */
> > +       if (!is_discard(madv_behavior->behavior))
> > +               return true;
> > +
> > +       /*
> > +        * But shared mappings are fine, as dirty pages will be written to a
> > +        * backing store regardless of discard.
> > +        */
> > +       if (vma->vm_flags & VM_SHARED)
> > +               return true;
> > +
> > +       /* If the user could write to the mapping anyway, then this is fine. */
> > +       if ((vma->vm_flags & VM_WRITE) &&
> > +           arch_vma_access_permitted(vma, /* write= */ true,
> > +                       /* execute= */ false, /* foreign= */ false))
> > +               return true;
> > +
> > +       /* Otherwise, we are not permitted to perform this operation. */
> > +       return false;
> > +}
> > +#else
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > +       return true;
> > +}
> > +#endif
> > +
> >  /*
> >   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
> >   * will handle splitting a vm area into separate areas, each area with its own
> > @@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
> >         struct madvise_behavior_range *range = &madv_behavior->range;
> >         int error;
> >
> > -       if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> > +       if (unlikely(!can_madvise_modify(madv_behavior)))
> >                 return -EPERM;
> >
> >         switch (behavior) {
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index c27197ac04e8..1308e88ab184 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -11,7 +11,6 @@
> >  #include <linux/mman.h>
> >  #include <linux/mm.h>
> >  #include <linux/mm_inline.h>
> > -#include <linux/mmu_context.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/sched.h>
> >  #include "internal.h"
> > @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
> >         vm_flags_set(vma, VM_SEALED);
> >  }
> >
> > -static bool is_madv_discard(int behavior)
> > -{
> > -       switch (behavior) {
> > -       case MADV_FREE:
> > -       case MADV_DONTNEED:
> > -       case MADV_DONTNEED_LOCKED:
> > -       case MADV_REMOVE:
> > -       case MADV_DONTFORK:
> > -       case MADV_WIPEONFORK:
> > -       case MADV_GUARD_INSTALL:
> > -               return true;
> > -       }
> > -
> > -       return false;
> > -}
> > -
> > -static bool is_ro_anon(struct vm_area_struct *vma)
> > -{
> > -       /* check anonymous mapping. */
> > -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > -               return false;
> > -
> > -       /*
> > -        * check for non-writable:
> > -        * PROT=RO or PKRU is not writeable.
> > -        */
> > -       if (!(vma->vm_flags & VM_WRITE) ||
> > -               !arch_vma_access_permitted(vma, true, false, false))
> > -               return true;
> > -
> > -       return false;
> > -}
> > -
> > -/*
> > - * Check if a vma is allowed to be modified by madvise.
> > - */
> > -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> > -{
> > -       if (!is_madv_discard(behavior))
> > -               return true;
> > -
> > -       if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> > -               return false;
> > -
> > -       /* Allow by default. */
> > -       return true;
> > -}
> > -
> >  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >                 struct vm_area_struct **prev, unsigned long start,
> >                 unsigned long end, vm_flags_t newflags)
> > diff --git a/mm/vma.h b/mm/vma.h
> > index acdcc515c459..85db5e880fcc 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> >         return true;
> >  }
> >
> > -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> > -
> >  #else
> >
> >  static inline bool can_modify_vma(struct vm_area_struct *vma)
> > @@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> >         return true;
> >  }
> >
> > -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> > -{
> > -       return true;
> > -}
> > -
> >  #endif
> >
> >  #if defined(CONFIG_STACK_GROWSUP)
> > diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
> > index 005f29c86484..34c042da4de2 100644
> > --- a/tools/testing/selftests/mm/mseal_test.c
> > +++ b/tools/testing/selftests/mm/mseal_test.c
> > @@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
> >         unsigned long size = 4 * page_size;
> >         int ret;
> >         int fd;
> > -       unsigned long mapflags = MAP_PRIVATE;
> >
> >         fd = memfd_create("test", 0);
> >         FAIL_TEST_IF_FALSE(fd > 0);
> > @@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
> >         ret = fallocate(fd, 0, 0, size);
> >         FAIL_TEST_IF_FALSE(!ret);
> >
> > -       ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
> > +       ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
> >         FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
> >
> >         if (seal) {
> > --
> > 2.50.1
> >
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
> 
>> 4. We could ask applications to switch to non-destructive madvise,
>> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
>> switch the kernel to use non-destructive madvise implicitly for
>> destructive madvise in suitable situations.
> 
> Umm what? I don't understand your point.
> 
>> 5. We could investigate more based on vma->anon_vma
> 
> I have no idea what you mean by this. I am an rmap maintainer and have
> worked extensively with anon_vma, what's the point exactly?

I think, the idea would be to add an additional anon_vma check: so if 
you have a MAP_PRIVATE file mapping, you could still allow for 
MADV_DONTNEED if you are sure that there are no anon folios in there.

If there is an anon_vma, the only way to find out is actually looking at 
the page tables.

To be completely precise, one would have to enlighten the zap logic to 
refuse to zap if there is any anon folio there, and bail out.

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Jeff Xu 2 months, 1 week ago
On Thu, Jul 24, 2025 at 2:53 PM David Hildenbrand <david@redhat.com> wrote:
>
> >
> >> 4. We could ask applications to switch to non-destructive madvise,
> >> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> >> switch the kernel to use non-destructive madvise implicitly for
> >> destructive madvise in suitable situations.
> >
> > Umm what? I don't understand your point.
> >
> >> 5. We could investigate more based on vma->anon_vma
> >
> > I have no idea what you mean by this. I am an rmap maintainer and have
> > worked extensively with anon_vma, what's the point exactly?
>
> I think, the idea would be to add an additional anon_vma check: so if
> you have a MAP_PRIVATE file mapping, you could still allow for
> MADV_DONTNEED if you are sure that there are no anon folios in there.
>
Yes. That is the theory, thanks for clarifying it for me.
This is exactly what I was trying to say in my previous response:

"We could try having destructive madvise check
vma->anon_vma and reject the call if it's true. I haven't had a chance
to test this theory yet, though."


> If there is an anon_vma, the only way to find out is actually looking at
> the page tables.
>
> To be completely precise, one would have to enlighten the zap logic to
> refuse to zap if there is any anon folio there, and bail out.
>
> --
> Cheers,
>
> David / dhildenb
>
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by Lorenzo Stoakes 2 months, 1 week ago
On Thu, Jul 24, 2025 at 11:53:52PM +0200, David Hildenbrand wrote:
> >
> > > 4. We could ask applications to switch to non-destructive madvise,
> > > like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> > > switch the kernel to use non-destructive madvise implicitly for
> > > destructive madvise in suitable situations.
> >
> > Umm what? I don't understand your point.
> >
> > > 5. We could investigate more based on vma->anon_vma
> >
> > I have no idea what you mean by this. I am an rmap maintainer and have
> > worked extensively with anon_vma, what's the point exactly?
>
> I think, the idea would be to add an additional anon_vma check: so if you
> have a MAP_PRIVATE file mapping, you could still allow for MADV_DONTNEED if
> you are sure that there are no anon folios in there.

OK this is a more coherent explanation of what this means, thanks.

In no other case are we checking if there is data there that is different from
post-discard, so this would be inconsistent with other disallowed madvise()
modes.

Equally, to me setting mprotect(PROT_READ) then mseal()'ing is a contract, and
adding a 'but we let you discard if we go check and it's fine' feels like really
inconsistent semantics.

We're dealing with a real edge-case scenario here of a MAP_PRIVATE mapping
(which means you are essentially asking for anon) being intentionally marked
read-only then sealed.

I think it's _better_ to be clearer on this.

>
> If there is an anon_vma, the only way to find out is actually looking at the
> page tables.
>
> To be completely precise, one would have to enlighten the zap logic to
> refuse to zap if there is any anon folio there, and bail out.

Yeah absolutely not this would be crazy, especially for such an edge case.

I'm sure you agree :)

>
> --
> Cheers,
>
> David / dhildenb
>
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 2 weeks ago
> 
> To summarize all the discussion points so far:
> 1. It's questionable behavior for madvise to allow destructive
> behavior for read-only anonymous mappings, regardless of mseal state.
 > 2. We could potentially fix point 1 within madvise itself, without> 
involving mseal, as Linus desires.

IIUC: disallow madvise(MADV_DONTNEED) without PROT_WRITE.

I am 99.99999% sure that that would break user case, unfortunately.

> 3. Android userspace uses destructive madvise to free up RAM, but I
> need to take a closer look at the patterns and usage to understand why
> they do that.

I am shocked that you question why they would use MADV_DONTNEED instead 
of ...

 > 4. We could ask applications to switch to non-destructive madvise,> 
like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> switch the kernel to use non-destructive madvise implicitly for
> destructive madvise in suitable situations.

... MADV_COLD / MADV_PAGEOUT.

I am also shocked that you think asking apps to switch would not make us 
break user space.

> 5. We could investigate more based on vma->anon_vma

Or we do what sealing is supposed to do.

With the hope that this sealing fix here would not break user space.

-- 
Cheers,

David / dhildenb
Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
Posted by David Hildenbrand 2 months, 1 week ago
On 24.07.25 20:56, David Hildenbrand wrote:
>>
>> To summarize all the discussion points so far:
>> 1. It's questionable behavior for madvise to allow destructive
>> behavior for read-only anonymous mappings, regardless of mseal state.
>   > 2. We could potentially fix point 1 within madvise itself, without>
> involving mseal, as Linus desires.
> 
> IIUC: disallow madvise(MADV_DONTNEED) without PROT_WRITE.
> 
> I am 99.99999% sure that that would break user case, unfortunately.
> 
>> 3. Android userspace uses destructive madvise to free up RAM, but I
>> need to take a closer look at the patterns and usage to understand why
>> they do that.
> 
> I am shocked that you question why they would use MADV_DONTNEED instead
> of ...
> 
>   > 4. We could ask applications to switch to non-destructive madvise,>
> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
>> switch the kernel to use non-destructive madvise implicitly for
>> destructive madvise in suitable situations.
> 
> ... MADV_COLD / MADV_PAGEOUT.
> 
> I am also shocked that you think asking apps to switch would not make us
> break user space.
 > >> 5. We could investigate more based on vma->anon_vma
> 
> Or we do what sealing is supposed to do.

Sorry for the rather hard replies, I was not understanding at all what 
you were getting at really.

> 
> With the hope that this sealing fix here would not break user space.

Is your concern that something (in Chrome?) would be relying on 
MADV_DONTNEED working in case we had a MAP_PRIVATE R/O file mapping?

Again, disallowing that completely (even without mseal()) would break 
user space, I am very sure.

Whether we should allow zapping *anonymous folios* in MAP_PRIVATE R/O 
file mapping is a good  question, hard to tell if that would break anything.

For zapping *anonymous folios* in MAP_PRIVATE R/O anon mappings, I am 
sure there are use cases around userfaultfd, I'm afraid ...

-- 
Cheers,

David / dhildenb