[PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

Lorenzo Stoakes posted 3 patches 3 months, 1 week ago
There is a newer version of this series
[PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
Currently, if a user needs to determine if guard regions are present in a
range, they have to scan all VMAs (or have knowledge of which ones might
have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
pagemap") and the related commit a516403787e0 ("fs/proc: extend the
PAGEMAP_SCAN ioctl to report guard regions"), users can use either
/proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag,
VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
uncertain because we cannot reasonably determine whether a
MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit
architectures also, a flag that was previously used by VM_DENYWRITE, which
was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
lock (and also VMA write lock) whereas previously it did not, but this
seems a reasonable overhead.

We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify
that we ought to copy page tables on fork.

For anonymous mappings this is inherent, however since commit f807123d578d
 ("mm: allow guard regions in file-backed and read-only mappings") which
 allowed file-backed guard regions, we have unfortunately had to enforce
this behaviour by settings vma->anon_vma to force page table copying.

The existence of this flag removes the need for this, so we simply update
vma_needs_copy() to check for this flag instead.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 Documentation/filesystems/proc.rst |  1 +
 fs/proc/task_mmu.c                 |  1 +
 include/linux/mm.h                 |  1 +
 include/trace/events/mmflags.h     |  1 +
 mm/madvise.c                       | 22 ++++++++++++++--------
 mm/memory.c                        |  4 ++++
 tools/testing/vma/vma_internal.h   |  1 +
 7 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 0b86a8022fa1..b8a423ca590a 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -591,6 +591,7 @@ encoded manner. The codes are the following:
     sl    sealed
     lf    lock on fault pages
     dp    always lazily freeable mapping
+    gu    maybe contains guard regions (if not set, definitely doesn't)
     ==    =======================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc35a0543f01..db16ed91c269 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_MAYSHARE)]	= "ms",
 		[ilog2(VM_GROWSDOWN)]	= "gd",
 		[ilog2(VM_PFNMAP)]	= "pf",
+		[ilog2(VM_MAYBE_GUARD)]	= "gu",
 		[ilog2(VM_LOCKED)]	= "lo",
 		[ilog2(VM_IO)]		= "io",
 		[ilog2(VM_SEQ_READ)]	= "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index aada935c4950..f963afa1b9de 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_UFFD_MISSING	0
 #endif /* CONFIG_MMU */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
+#define VM_MAYBE_GUARD	0x00000800	/* The VMA maybe contains guard regions. */
 #define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
 
 #define VM_LOCKED	0x00002000
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index aa441f593e9a..a6e5a44c9b42 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
 	{VM_UFFD_MISSING,		"uffd_missing"	},		\
 IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,	"uffd_minor"	)		\
 	{VM_PFNMAP,			"pfnmap"	},		\
+	{VM_MAYBE_GUARD,		"maybe_guard"	},		\
 	{VM_UFFD_WP,			"uffd_wp"	},		\
 	{VM_LOCKED,			"locked"	},		\
 	{VM_IO,				"io"		},		\
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..216ae6ed344e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 		return -EINVAL;
 
 	/*
-	 * If we install guard markers, then the range is no longer
-	 * empty from a page table perspective and therefore it's
-	 * appropriate to have an anon_vma.
+	 * It would be confusing for anonymous mappings to have page table
+	 * entries but no anon_vma established, so ensure that it is.
+	 */
+	if (vma_is_anonymous(vma))
+		anon_vma_prepare(vma);
+
+	/*
+	 * Indicate that the VMA may contain guard regions, making it visible to
+	 * the user that a VMA may contain these, narrowing down the range which
+	 * must be scanned in order to detect them.
 	 *
-	 * This ensures that on fork, we copy page tables correctly.
+	 * This additionally causes page tables to be copied on fork regardless
+	 * of whether the VMA is anonymous or not, correctly preserving the
+	 * guard region page table entries.
 	 */
-	err = anon_vma_prepare(vma);
-	if (err)
-		return err;
+	vm_flags_set(vma, VM_MAYBE_GUARD);
 
 	/*
 	 * Optimistically try to install the guard marker pages first. If any
@@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
-	case MADV_GUARD_INSTALL:
 	case MADV_GUARD_REMOVE:
 		return MADVISE_MMAP_READ_LOCK;
 	case MADV_DONTNEED:
diff --git a/mm/memory.c b/mm/memory.c
index 4c3a7e09a159..a2c79ee43d68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	if (src_vma->anon_vma)
 		return true;
 
+	/* Guard regions have momdified page tables that require copying. */
+	if (src_vma->vm_flags & VM_MAYBE_GUARD)
+		return true;
+
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.  Fork
 	 * becomes much lighter when there are big shared or private readonly
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index d873667704e8..e40c93edc5a7 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr;
 #define VM_MAYEXEC	0x00000040
 #define VM_GROWSDOWN	0x00000100
 #define VM_PFNMAP	0x00000400
+#define VM_MAYBE_GUARD	0x00000800
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
-- 
2.51.0
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Pedro Falcato 3 months, 1 week ago
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> Currently, if a user needs to determine if guard regions are present in a
> range, they have to scan all VMAs (or have knowledge of which ones might
> have guard regions).
> 
> Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> operation at a virtual address level.
> 
> This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> that guard regions exist in ranges.
> 
> This patch remedies the situation by establishing a new VMA flag,
> VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> uncertain because we cannot reasonably determine whether a
> MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> additionally VMAs may change across merge/split).
> 
> We utilise 0x800 for this flag which makes it available to 32-bit
> architectures also, a flag that was previously used by VM_DENYWRITE, which
> was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> bee reused yet.
> 
> The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> lock (and also VMA write lock) whereas previously it did not, but this
> seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with
the read lock held? This would make it so we can't split the VMA (and tightly
define what "may have a guard page"), but it sounds much better than introducing
lock contention. I don't think it is reasonable to add a write lock to a feature
that may be used by such things as thread stack allocation, malloc, etc.

-- 
Pedro
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > Currently, if a user needs to determine if guard regions are present in a
> > range, they have to scan all VMAs (or have knowledge of which ones might
> > have guard regions).
> >
> > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > operation at a virtual address level.
> >
> > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > that guard regions exist in ranges.
> >
> > This patch remedies the situation by establishing a new VMA flag,
> > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > uncertain because we cannot reasonably determine whether a
> > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > additionally VMAs may change across merge/split).
> >
> > We utilise 0x800 for this flag which makes it available to 32-bit
> > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > bee reused yet.
> >
> > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > lock (and also VMA write lock) whereas previously it did not, but this
> > seems a reasonable overhead.
>
> Do you though? Could it be possible to simply atomically set the flag with
> the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any
workable way.

I also don't think it's at all necessary, see below.

> define what "may have a guard page"), but it sounds much better than introducing
> lock contention. I don't think it is reasonable to add a write lock to a feature
> that may be used by such things as thread stack allocation, malloc, etc.

What lock contention? It's per-VMA so the contention is limited to the VMA in
question, and only over the span of time you are setting the gaurd region.

When allocating thread stacks you'll be mapping things into memory which... take
the write lock. malloc() if it goes to the kernel will also take the write lock.

So I think you're overly worried about an operation that a. isn't going to be
something that happens all that often, b. when it does, it's at a time when
you'd be taking write locks anyway and c. won't contend important stuff like
page faults for any VMA other than the one having the the guard region
installed.

>
> --
> Pedro

Thanks, Lorenzo
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Pedro Falcato 3 months, 1 week ago
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > > Currently, if a user needs to determine if guard regions are present in a
> > > range, they have to scan all VMAs (or have knowledge of which ones might
> > > have guard regions).
> > >
> > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > > operation at a virtual address level.
> > >
> > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > > that guard regions exist in ranges.
> > >
> > > This patch remedies the situation by establishing a new VMA flag,
> > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > > uncertain because we cannot reasonably determine whether a
> > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > > additionally VMAs may change across merge/split).
> > >
> > > We utilise 0x800 for this flag which makes it available to 32-bit
> > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > > bee reused yet.
> > >
> > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > > lock (and also VMA write lock) whereas previously it did not, but this
> > > seems a reasonable overhead.
> >
> > Do you though? Could it be possible to simply atomically set the flag with
> > the read lock held? This would make it so we can't split the VMA (and tightly
> 
> VMA flags are not accessed atomically so no I don't think we can do that in any
> workable way.
>

FWIW I think you could work it as an atomic flag and treat those races as benign
(this one, at least).

> I also don't think it's at all necessary, see below.
> 
> > define what "may have a guard page"), but it sounds much better than introducing
> > lock contention. I don't think it is reasonable to add a write lock to a feature
> > that may be used by such things as thread stack allocation, malloc, etc.
> 
> What lock contention? It's per-VMA so the contention is limited to the VMA in
> question, and only over the span of time you are setting the gaurd region.

Don't we always need to take the mmap write lock when grabbing a VMA write
lock as well?

> When allocating thread stacks you'll be mapping things into memory which... take
> the write lock. malloc() if it goes to the kernel will also take the write lock.
>

But yes, good point, you're already serializing anyway. I don't think this is
a big deal.

> So I think you're overly worried about an operation that a. isn't going to be
> something that happens all that often, b. when it does, it's at a time when
> you'd be taking write locks anyway and c. won't contend important stuff like
> page faults for any VMA other than the one having the the guard region
> installed.

Yep, thanks.

-- 
Pedro
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > > > Currently, if a user needs to determine if guard regions are present in a
> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> > > > have guard regions).
> > > >
> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > > > operation at a virtual address level.
> > > >
> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > > > that guard regions exist in ranges.
> > > >
> > > > This patch remedies the situation by establishing a new VMA flag,
> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > > > uncertain because we cannot reasonably determine whether a
> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > > > additionally VMAs may change across merge/split).
> > > >
> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > > > bee reused yet.
> > > >
> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > > > lock (and also VMA write lock) whereas previously it did not, but this
> > > > seems a reasonable overhead.
> > >
> > > Do you though? Could it be possible to simply atomically set the flag with
> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >
> > VMA flags are not accessed atomically so no I don't think we can do that in any
> > workable way.
> >
>
> FWIW I think you could work it as an atomic flag and treat those races as benign
> (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated
on fork.

>
> > I also don't think it's at all necessary, see below.
> >
> > > define what "may have a guard page"), but it sounds much better than introducing
> > > lock contention. I don't think it is reasonable to add a write lock to a feature
> > > that may be used by such things as thread stack allocation, malloc, etc.
> >
> > What lock contention? It's per-VMA so the contention is limited to the VMA in
> > question, and only over the span of time you are setting the gaurd region.
>
> Don't we always need to take the mmap write lock when grabbing a VMA write
> lock as well?

Yup. But at the same time you're doing the kind of operations that'd use this
you'd already be taking the lock anyway.

You don't hold it for long and you won't be doing this any more often than you'd
be doing other write operations, which you're also not going to be holding up
faults on other VMAs either (they can access other VMAs despite mmap write lock
being held), so I don't think there's ay issue here.

>
> > When allocating thread stacks you'll be mapping things into memory which... take
> > the write lock. malloc() if it goes to the kernel will also take the write lock.
> >
>
> But yes, good point, you're already serializing anyway. I don't think this is
> a big deal.

Indeed

>
> > So I think you're overly worried about an operation that a. isn't going to be
> > something that happens all that often, b. when it does, it's at a time when
> > you'd be taking write locks anyway and c. won't contend important stuff like
> > page faults for any VMA other than the one having the the guard region
> > installed.
>
> Yep, thanks.

No problemo, you can get yourself into sticky situations with lock contention
but I think this is not one! :)

>
> --
> Pedro

Cheers, Lorenzo
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Vlastimil Babka 3 months, 1 week ago
On 10/30/25 17:43, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
>> > > > Currently, if a user needs to determine if guard regions are present in a
>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
>> > > > have guard regions).
>> > > >
>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
>> > > > operation at a virtual address level.
>> > > >
>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
>> > > > that guard regions exist in ranges.
>> > > >
>> > > > This patch remedies the situation by establishing a new VMA flag,
>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
>> > > > uncertain because we cannot reasonably determine whether a
>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
>> > > > additionally VMAs may change across merge/split).
>> > > >
>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
>> > > > bee reused yet.
>> > > >
>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
>> > > > lock (and also VMA write lock) whereas previously it did not, but this
>> > > > seems a reasonable overhead.
>> > >
>> > > Do you though? Could it be possible to simply atomically set the flag with
>> > > the read lock held? This would make it so we can't split the VMA (and tightly
>> >
>> > VMA flags are not accessed atomically so no I don't think we can do that in any
>> > workable way.
>> >
>>
>> FWIW I think you could work it as an atomic flag and treat those races as benign
>> (this one, at least).
> 
> It's not benign as we need to ensure that page tables are correctly propagated
> on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
setting? I think the places that could race with us to cause RMW use vma
write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
the oldmm) and it probably would't make sense to start doing it. Maybe we
could think of something to deal with this special case...

>>
>> > I also don't think it's at all necessary, see below.
>> >
>> > > define what "may have a guard page"), but it sounds much better than introducing
>> > > lock contention. I don't think it is reasonable to add a write lock to a feature
>> > > that may be used by such things as thread stack allocation, malloc, etc.
>> >
>> > What lock contention? It's per-VMA so the contention is limited to the VMA in
>> > question, and only over the span of time you are setting the gaurd region.
>>
>> Don't we always need to take the mmap write lock when grabbing a VMA write
>> lock as well?
> 
> Yup. But at the same time you're doing the kind of operations that'd use this
> you'd already be taking the lock anyway.
> 
> You don't hold it for long and you won't be doing this any more often than you'd
> be doing other write operations, which you're also not going to be holding up
> faults on other VMAs either (they can access other VMAs despite mmap write lock
> being held), so I don't think there's ay issue here.
> 
>>
>> > When allocating thread stacks you'll be mapping things into memory which... take
>> > the write lock. malloc() if it goes to the kernel will also take the write lock.
>> >
>>
>> But yes, good point, you're already serializing anyway. I don't think this is
>> a big deal.
> 
> Indeed

Besides thread stacks, I'm thinking of the userspace allocators usecase
(jemalloc etc) where guard regions were supposed to allow a cheap
use-after-free protection by replacing their existing
MADV_DONTNEED/MADV_FREE use with MADV_GUARD_INSTALL.
Now MADV_DONTNEED/MADV_FREE use MADVISE_VMA_READ_LOCK, and guard regions
moves from (worse but still reasonable) MADVISE_MMAP_READ_LOCK to the heavy
MADVISE_MMAP_WRITE_LOCK. I'm afraid this might be too heavy price for that
usecase :(

>>
>> > So I think you're overly worried about an operation that a. isn't going to be
>> > something that happens all that often, b. when it does, it's at a time when
>> > you'd be taking write locks anyway and c. won't contend important stuff like
>> > page faults for any VMA other than the one having the the guard region
>> > installed.
>>
>> Yep, thanks.
> 
> No problemo, you can get yourself into sticky situations with lock contention
> but I think this is not one! :)
> 
>>
>> --
>> Pedro
> 
> Cheers, Lorenzo
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Thu, Oct 30, 2025 at 07:31:26PM +0100, Vlastimil Babka wrote:
> On 10/30/25 17:43, Lorenzo Stoakes wrote:
> > On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> >> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> >> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> >> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> >> > > > Currently, if a user needs to determine if guard regions are present in a
> >> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> >> > > > have guard regions).
> >> > > >
> >> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> >> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> >> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> >> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> >> > > > operation at a virtual address level.
> >> > > >
> >> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> >> > > > that guard regions exist in ranges.
> >> > > >
> >> > > > This patch remedies the situation by establishing a new VMA flag,
> >> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> >> > > > uncertain because we cannot reasonably determine whether a
> >> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> >> > > > additionally VMAs may change across merge/split).
> >> > > >
> >> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> >> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> >> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> >> > > > bee reused yet.
> >> > > >
> >> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> >> > > > lock (and also VMA write lock) whereas previously it did not, but this
> >> > > > seems a reasonable overhead.
> >> > >
> >> > > Do you though? Could it be possible to simply atomically set the flag with
> >> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >> >
> >> > VMA flags are not accessed atomically so no I don't think we can do that in any
> >> > workable way.
> >> >
> >>
> >> FWIW I think you could work it as an atomic flag and treat those races as benign
> >> (this one, at least).
> >
> > It's not benign as we need to ensure that page tables are correctly propagated
> > on fork.
>
> Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> setting? I think the places that could race with us to cause RMW use vma

I mean, I just spoke about why I didn't think introducing an entirely new
(afaik) one-sided atomic VMA flag write, so maybe deal with that first before
proposing something new...

We probably can use VMA read lock (actually introduced _after_ I introduced
guard regions I believe) if we do not need the write lock.

But I need to hear a compelling non-hand waving reason for introducing an
entirely new mechanism like that which is going to violate an assumption we have
about VMA flags everywhere...

> write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> the oldmm) and it probably would't make sense to start doing it. Maybe we

I mean, the entire point is fork behaviour, so that makes this a non-starter...

> could think of something to deal with this special case...

If we implemented that I guess we'd have to atomically read the VMA flag,
but a race between the two would be very problematic.

Note that we'd also have to have a new mechanism for writing a VMA flag
(this is entirely novel now) because vm_flags_set() itself _literally_ has
vma_start_write() in it.

We'd also have to audit any and all cases where VMA flags are accessed to
ensure that this new novel method does not cause any issues.

We're also inviting others to do this kind of thing (e.g. drivers).

Seems dangerous in itself.

In any case, since you're now asking for something entirely novel, you
really have to provide a compelling argument as to why we can't write lock.

We can't allow races there or we break forking.

>
> >>
> >> > I also don't think it's at all necessary, see below.
> >> >
> >> > > define what "may have a guard page"), but it sounds much better than introducing
> >> > > lock contention. I don't think it is reasonable to add a write lock to a feature
> >> > > that may be used by such things as thread stack allocation, malloc, etc.
> >> >
> >> > What lock contention? It's per-VMA so the contention is limited to the VMA in
> >> > question, and only over the span of time you are setting the gaurd region.
> >>
> >> Don't we always need to take the mmap write lock when grabbing a VMA write
> >> lock as well?
> >
> > Yup. But at the same time you're doing the kind of operations that'd use this
> > you'd already be taking the lock anyway.
> >
> > You don't hold it for long and you won't be doing this any more often than you'd
> > be doing other write operations, which you're also not going to be holding up
> > faults on other VMAs either (they can access other VMAs despite mmap write lock
> > being held), so I don't think there's ay issue here.
> >
> >>
> >> > When allocating thread stacks you'll be mapping things into memory which... take
> >> > the write lock. malloc() if it goes to the kernel will also take the write lock.
> >> >
> >>
> >> But yes, good point, you're already serializing anyway. I don't think this is
> >> a big deal.
> >
> > Indeed
>
> Besides thread stacks, I'm thinking of the userspace allocators usecase

Thread stacks require mmap/VMA write lock to establish no?

> (jemalloc etc) where guard regions were supposed to allow a cheap
> use-after-free protection by replacing their existing
> MADV_DONTNEED/MADV_FREE use with MADV_GUARD_INSTALL.

First I'm hearing of this as far as I can recall... maybe worth mentioning
at the time?

> Now MADV_DONTNEED/MADV_FREE use MADVISE_VMA_READ_LOCK, and guard regions
> moves from (worse but still reasonable) MADVISE_MMAP_READ_LOCK to the heavy

I think you're hand waving a lot here, do you have data to back anything up
here?

Be good to get some timings with the different locks and some indication that
the contention is meaningful or that this installation is in a hot-path.

Are these users not in any way manipulating VMAs (seems odd for a malloc but
anyway)? Because then you're already having to take the write lock to do so.

process_madvise() can be used to hold the lock over the whole operation
more efficiently for multiple guard region installations.

> MADVISE_MMAP_WRITE_LOCK. I'm afraid this might be too heavy price for that
> usecase :(

Based on what? Provide some data or argument to back this up please. Since
you're essentially asking me to either throw away my series or implement a
risky novel means of manipulating VMA flags I think you're going to have to
do better than hand waving.

Thanks.
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Thu, Oct 30, 2025 at 07:16:56PM +0000, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 07:31:26PM +0100, Vlastimil Babka wrote:
> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
> > > On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> > >> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> > >> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> > >> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > >> > > > Currently, if a user needs to determine if guard regions are present in a
> > >> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> > >> > > > have guard regions).
> > >> > > >
> > >> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > >> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > >> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > >> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > >> > > > operation at a virtual address level.
> > >> > > >
> > >> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > >> > > > that guard regions exist in ranges.
> > >> > > >
> > >> > > > This patch remedies the situation by establishing a new VMA flag,
> > >> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > >> > > > uncertain because we cannot reasonably determine whether a
> > >> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > >> > > > additionally VMAs may change across merge/split).
> > >> > > >
> > >> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> > >> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > >> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > >> > > > bee reused yet.
> > >> > > >
> > >> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > >> > > > lock (and also VMA write lock) whereas previously it did not, but this
> > >> > > > seems a reasonable overhead.
> > >> > >
> > >> > > Do you though? Could it be possible to simply atomically set the flag with
> > >> > > the read lock held? This would make it so we can't split the VMA (and tightly
> > >> >
> > >> > VMA flags are not accessed atomically so no I don't think we can do that in any
> > >> > workable way.
> > >> >
> > >>
> > >> FWIW I think you could work it as an atomic flag and treat those races as benign
> > >> (this one, at least).
> > >
> > > It's not benign as we need to ensure that page tables are correctly propagated
> > > on fork.
> >
> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> > setting? I think the places that could race with us to cause RMW use vma
>
> I mean, I just spoke about why I didn't think introducing an entirely new
> (afaik) one-sided atomic VMA flag write, so maybe deal with that first before
> proposing something new...

On the other hand, it's going to be difficult to get compelling data either
way as will always be workload dependent etc.

So since you and Pedro both bring this up, and it'd be a pity to establish more
stringent locking requirements here, let me look into an atomic write situation.

We'll need to tread carefully here but if we can achieve that it would obviously
be entirely preferable to requiring write lock.

I'll dig into it some :)

BTW I do think a VMA read lock is entirely possible here as-is, so we should try
to shift to that if we can make atomic VMA flag write here work.

Thanks, Lorenzo
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Vlastimil Babka 3 months, 1 week ago
On 10/30/25 19:31, Vlastimil Babka wrote:
> On 10/30/25 17:43, Lorenzo Stoakes wrote:
>> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
>>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
>>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
>>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
>>> > > > Currently, if a user needs to determine if guard regions are present in a
>>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
>>> > > > have guard regions).
>>> > > >
>>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
>>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
>>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
>>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
>>> > > > operation at a virtual address level.
>>> > > >
>>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
>>> > > > that guard regions exist in ranges.
>>> > > >
>>> > > > This patch remedies the situation by establishing a new VMA flag,
>>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
>>> > > > uncertain because we cannot reasonably determine whether a
>>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
>>> > > > additionally VMAs may change across merge/split).
>>> > > >
>>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
>>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
>>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
>>> > > > bee reused yet.
>>> > > >
>>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
>>> > > > lock (and also VMA write lock) whereas previously it did not, but this
>>> > > > seems a reasonable overhead.
>>> > >
>>> > > Do you though? Could it be possible to simply atomically set the flag with
>>> > > the read lock held? This would make it so we can't split the VMA (and tightly
>>> >
>>> > VMA flags are not accessed atomically so no I don't think we can do that in any
>>> > workable way.
>>> >
>>>
>>> FWIW I think you could work it as an atomic flag and treat those races as benign
>>> (this one, at least).
>> 
>> It's not benign as we need to ensure that page tables are correctly propagated
>> on fork.
> 
> Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> setting? I think the places that could race with us to cause RMW use vma
> write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> the oldmm) and it probably would't make sense to start doing it. Maybe we
> could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for
write on the old mm, so if we kept taking mmap sem for read, then vma lock
for read in addition (which should be cheap enough, also we'd only need it
in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
that would cover all non-bening races?
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months ago
On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
> On 10/30/25 19:31, Vlastimil Babka wrote:
> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
> >> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> >>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> >>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> >>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> >>> > > > Currently, if a user needs to determine if guard regions are present in a
> >>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> >>> > > > have guard regions).
> >>> > > >
> >>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> >>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> >>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> >>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> >>> > > > operation at a virtual address level.
> >>> > > >
> >>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> >>> > > > that guard regions exist in ranges.
> >>> > > >
> >>> > > > This patch remedies the situation by establishing a new VMA flag,
> >>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> >>> > > > uncertain because we cannot reasonably determine whether a
> >>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> >>> > > > additionally VMAs may change across merge/split).
> >>> > > >
> >>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> >>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> >>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> >>> > > > bee reused yet.
> >>> > > >
> >>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> >>> > > > lock (and also VMA write lock) whereas previously it did not, but this
> >>> > > > seems a reasonable overhead.
> >>> > >
> >>> > > Do you though? Could it be possible to simply atomically set the flag with
> >>> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >>> >
> >>> > VMA flags are not accessed atomically so no I don't think we can do that in any
> >>> > workable way.
> >>> >
> >>>
> >>> FWIW I think you could work it as an atomic flag and treat those races as benign
> >>> (this one, at least).
> >>
> >> It's not benign as we need to ensure that page tables are correctly propagated
> >> on fork.
> >
> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> > setting? I think the places that could race with us to cause RMW use vma
> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> > the oldmm) and it probably would't make sense to start doing it. Maybe we
> > could think of something to deal with this special case...
>
> During discussion with Pedro off-list I realized fork takes mmap lock for
> write on the old mm, so if we kept taking mmap sem for read, then vma lock
> for read in addition (which should be cheap enough, also we'd only need it
> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
> that would cover all non-bening races?
>
>

So thinking about this again, taking mmap read lock will exclude any VMA write
locks (which must hold mmap write lock), so no need to additionally take VMA
read lock.

Also as mentioned later in thread, the invocation of vma_needs_copy() is
performed under VMA write lock (and this mmap write lock) so no need to read
atomically there either.

As per Pedro, we can treat other races as benign.

Merge/Split etc. are done under VMA write lock so there's no race that matters
there.

The only other place we even care about this flag in is for /proc/$pid/smaps,
but there it'd be benign as you'd happen not to observe the flag being set up at
the point at which a concurrent guard region install is happening - something
that you could race with _anyway_.

As Pedro says, remaining cases where you read flags would be benign, as the
readers should never observe a meaningful torn read being a bitmap and given
this flag only matters on fork and smaps.

So I think just having something that sets atomically with an allow list is
fine, but making that very strict so literally just this flag is allowed
(currently).
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Vlastimil Babka 3 months ago
On 11/5/25 20:48, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
>> On 10/30/25 19:31, Vlastimil Babka wrote:
>> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
>> >
>> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
>> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
>> > setting? I think the places that could race with us to cause RMW use vma
>> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
>> > the oldmm) and it probably would't make sense to start doing it. Maybe we
>> > could think of something to deal with this special case...
>>
>> During discussion with Pedro off-list I realized fork takes mmap lock for
>> write on the old mm, so if we kept taking mmap sem for read, then vma lock
>> for read in addition (which should be cheap enough, also we'd only need it
>> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
>> that would cover all non-bening races?
>>
>>
> 
> So thinking about this again, taking mmap read lock will exclude any VMA write
> locks (which must hold mmap write lock), so no need to additionally take VMA
> read lock.

Right. Of course it would be nice if we could later also replace the mmap
read lock with vma read lock. Hopefully should be feasible because
MADV_DONTNEED can do it and guard ranges perform similar kind of operations
(with more complex page table handling necessary, IIRC but that should
hopefully be still fine with vma read lock).

It's obviously not in scope of the series here, maybe just to keep in mind
if that next step will be compatible with anything done here.
> Also as mentioned later in thread, the invocation of vma_needs_copy() is
> performed under VMA write lock (and this mmap write lock) so no need to read
> atomically there either.
> 
> As per Pedro, we can treat other races as benign.
> 
> Merge/Split etc. are done under VMA write lock so there's no race that matters
> there.
> 
> The only other place we even care about this flag in is for /proc/$pid/smaps,
> but there it'd be benign as you'd happen not to observe the flag being set up at
> the point at which a concurrent guard region install is happening - something
> that you could race with _anyway_.

Right.

> As Pedro says, remaining cases where you read flags would be benign, as the
> readers should never observe a meaningful torn read being a bitmap and given
> this flag only matters on fork and smaps.
> 
> So I think just having something that sets atomically with an allow list is
> fine, but making that very strict so literally just this flag is allowed
> (currently).
Ack.
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
> On 10/30/25 19:31, Vlastimil Babka wrote:
> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
> >> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> >>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> >>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> >>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> >>> > > > Currently, if a user needs to determine if guard regions are present in a
> >>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> >>> > > > have guard regions).
> >>> > > >
> >>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> >>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> >>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> >>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> >>> > > > operation at a virtual address level.
> >>> > > >
> >>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> >>> > > > that guard regions exist in ranges.
> >>> > > >
> >>> > > > This patch remedies the situation by establishing a new VMA flag,
> >>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> >>> > > > uncertain because we cannot reasonably determine whether a
> >>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> >>> > > > additionally VMAs may change across merge/split).
> >>> > > >
> >>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> >>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> >>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> >>> > > > bee reused yet.
> >>> > > >
> >>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> >>> > > > lock (and also VMA write lock) whereas previously it did not, but this
> >>> > > > seems a reasonable overhead.
> >>> > >
> >>> > > Do you though? Could it be possible to simply atomically set the flag with
> >>> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >>> >
> >>> > VMA flags are not accessed atomically so no I don't think we can do that in any
> >>> > workable way.
> >>> >
> >>>
> >>> FWIW I think you could work it as an atomic flag and treat those races as benign
> >>> (this one, at least).
> >>
> >> It's not benign as we need to ensure that page tables are correctly propagated
> >> on fork.
> >
> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> > setting? I think the places that could race with us to cause RMW use vma
> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> > the oldmm) and it probably would't make sense to start doing it. Maybe we
> > could think of something to deal with this special case...
>
> During discussion with Pedro off-list I realized fork takes mmap lock for
> write on the old mm, so if we kept taking mmap sem for read, then vma lock
> for read in addition (which should be cheap enough, also we'd only need it
> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
> that would cover all non-bening races?
>
>

We take VMA write lock in dup_mmap() on each mpnt (old VMA).

We take the VMA write lock (vma_start_write()) for each mpnt.

We then vm_area_dup() the mpnt to the new VMA before calling:

copy_page_range()
-> vma_needs_copy()

Which is where the check is done.

So we are holding the VMA write lock, so a VMA read lock should suffice no?

For belts + braces we could atomically read the flag in vma_needs_copy(),
though note it's intended VM_COPY_ON_FORK could have more than one flag.

We could drop that for now and be explicit.
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Vlastimil Babka 3 months, 1 week ago
On 10/30/25 20:47, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
>> >
>> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
>> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
>> > setting? I think the places that could race with us to cause RMW use vma
>> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
>> > the oldmm) and it probably would't make sense to start doing it. Maybe we
>> > could think of something to deal with this special case...
>>
>> During discussion with Pedro off-list I realized fork takes mmap lock for
>> write on the old mm, so if we kept taking mmap sem for read, then vma lock
>> for read in addition (which should be cheap enough, also we'd only need it
>> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
>> that would cover all non-bening races?
>>
>>
> 
> We take VMA write lock in dup_mmap() on each mpnt (old VMA).

Ah yes I thought it was the new one.

> We take the VMA write lock (vma_start_write()) for each mpnt.
> 
> We then vm_area_dup() the mpnt to the new VMA before calling:
> 
> copy_page_range()
> -> vma_needs_copy()
> 
> Which is where the check is done.
> 
> So we are holding the VMA write lock, so a VMA read lock should suffice no?

Yeah, even better!

> For belts + braces we could atomically read the flag in vma_needs_copy(),
> though note it's intended VM_COPY_ON_FORK could have more than one flag.
> 
> We could drop that for now and be explicit.

Great!
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Suren Baghdasaryan 3 months, 1 week ago
On Thu, Oct 30, 2025 at 2:48 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/30/25 20:47, Lorenzo Stoakes wrote:
> > On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
> >> >
> >> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> >> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> >> > setting? I think the places that could race with us to cause RMW use vma
> >> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> >> > the oldmm) and it probably would't make sense to start doing it. Maybe we
> >> > could think of something to deal with this special case...
> >>
> >> During discussion with Pedro off-list I realized fork takes mmap lock for
> >> write on the old mm, so if we kept taking mmap sem for read, then vma lock
> >> for read in addition (which should be cheap enough, also we'd only need it
> >> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
> >> that would cover all non-bening races?
> >>
> >>
> >
> > We take VMA write lock in dup_mmap() on each mpnt (old VMA).
>
> Ah yes I thought it was the new one.
>
> > We take the VMA write lock (vma_start_write()) for each mpnt.
> >
> > We then vm_area_dup() the mpnt to the new VMA before calling:
> >
> > copy_page_range()
> > -> vma_needs_copy()
> >
> > Which is where the check is done.
> >
> > So we are holding the VMA write lock, so a VMA read lock should suffice no?
>
> Yeah, even better!
>
> > For belts + braces we could atomically read the flag in vma_needs_copy(),
> > though note it's intended VM_COPY_ON_FORK could have more than one flag.
> >
> > We could drop that for now and be explicit.
>
> Great!

Overall, I think it should be possible to set this flag atomically
under VMA read-lock. However, if you introduce new vm_flags
manipulation functions, please make sure they can't be used for other
vm_flags. In Android I've seen several "interesting" attempts to
update vm_flags under a read-lock (specifically in the page-fault
path) and had to explain why that's a bad idea.
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Fri, Oct 31, 2025 at 04:12:51PM -0700, Suren Baghdasaryan wrote:
>
> Overall, I think it should be possible to set this flag atomically
> under VMA read-lock. However, if you introduce new vm_flags
> manipulation functions, please make sure they can't be used for other
> vm_flags. In Android I've seen several "interesting" attempts to
> update vm_flags under a read-lock (specifically in the page-fault
> path) and had to explain why that's a bad idea.

Yeah agreed, so the idea would be to absolutely ring-fence any flag we do
this with entirely. Probably a VM_WARN_ON_ONCE() for anybody trying it with
other flags so bots can catch anybody being naughty.

That kind of 'interesting' stuff is another reason I prefer to really limit
what drivers can do btw ;)

Will have a respin with this idea applied relatively soon hopefully.

Cheers, Lorenzo
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Suren Baghdasaryan 3 months, 1 week ago
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Currently, if a user needs to determine if guard regions are present in a
> range, they have to scan all VMAs (or have knowledge of which ones might
> have guard regions).
>
> Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> operation at a virtual address level.
>
> This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> that guard regions exist in ranges.
>
> This patch remedies the situation by establishing a new VMA flag,
> VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> uncertain because we cannot reasonably determine whether a
> MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> additionally VMAs may change across merge/split).

nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would
better represent the meaning.

>
> We utilise 0x800 for this flag which makes it available to 32-bit
> architectures also, a flag that was previously used by VM_DENYWRITE, which
> was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> bee reused yet.

s/bee/been
but I'm not even sure the above paragraph has to be included in the
changelog. It's a technical detail IMHO.

>
> The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> lock (and also VMA write lock) whereas previously it did not, but this
> seems a reasonable overhead.

I guess this is because it is modifying vm_flags now?

>
> We also update the smaps logic and documentation to identify these VMAs.
>
> Another major use of this functionality is that we can use it to identify
> that we ought to copy page tables on fork.
>
> For anonymous mappings this is inherent, however since commit f807123d578d
>  ("mm: allow guard regions in file-backed and read-only mappings") which
>  allowed file-backed guard regions, we have unfortunately had to enforce
> this behaviour by settings vma->anon_vma to force page table copying.
>
> The existence of this flag removes the need for this, so we simply update
> vma_needs_copy() to check for this flag instead.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Overall, makes sense to me and I think we could use it.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

It would be nice to have a way for userspace to reset this flag if it
confirms that the VMA does not really have any guards (using say
PAGEMAP_SCAN) but I think such an API can be abused.

> ---
>  Documentation/filesystems/proc.rst |  1 +
>  fs/proc/task_mmu.c                 |  1 +
>  include/linux/mm.h                 |  1 +
>  include/trace/events/mmflags.h     |  1 +
>  mm/madvise.c                       | 22 ++++++++++++++--------
>  mm/memory.c                        |  4 ++++
>  tools/testing/vma/vma_internal.h   |  1 +
>  7 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 0b86a8022fa1..b8a423ca590a 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -591,6 +591,7 @@ encoded manner. The codes are the following:
>      sl    sealed
>      lf    lock on fault pages
>      dp    always lazily freeable mapping
> +    gu    maybe contains guard regions (if not set, definitely doesn't)
>      ==    =======================================
>
>  Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index fc35a0543f01..db16ed91c269 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>                 [ilog2(VM_MAYSHARE)]    = "ms",
>                 [ilog2(VM_GROWSDOWN)]   = "gd",
>                 [ilog2(VM_PFNMAP)]      = "pf",
> +               [ilog2(VM_MAYBE_GUARD)] = "gu",
>                 [ilog2(VM_LOCKED)]      = "lo",
>                 [ilog2(VM_IO)]          = "io",
>                 [ilog2(VM_SEQ_READ)]    = "sr",
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index aada935c4950..f963afa1b9de 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_UFFD_MISSING        0
>  #endif /* CONFIG_MMU */
>  #define VM_PFNMAP      0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
> +#define VM_MAYBE_GUARD 0x00000800      /* The VMA maybe contains guard regions. */
>  #define VM_UFFD_WP     0x00001000      /* wrprotect pages tracking */
>
>  #define VM_LOCKED      0x00002000
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index aa441f593e9a..a6e5a44c9b42 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
>         {VM_UFFD_MISSING,               "uffd_missing"  },              \
>  IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,      "uffd_minor"    )               \
>         {VM_PFNMAP,                     "pfnmap"        },              \
> +       {VM_MAYBE_GUARD,                "maybe_guard"   },              \
>         {VM_UFFD_WP,                    "uffd_wp"       },              \
>         {VM_LOCKED,                     "locked"        },              \
>         {VM_IO,                         "io"            },              \
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fb1c86e630b6..216ae6ed344e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
>                 return -EINVAL;
>
>         /*
> -        * If we install guard markers, then the range is no longer
> -        * empty from a page table perspective and therefore it's
> -        * appropriate to have an anon_vma.
> +        * It would be confusing for anonymous mappings to have page table
> +        * entries but no anon_vma established, so ensure that it is.
> +        */
> +       if (vma_is_anonymous(vma))
> +               anon_vma_prepare(vma);
> +
> +       /*
> +        * Indicate that the VMA may contain guard regions, making it visible to
> +        * the user that a VMA may contain these, narrowing down the range which
> +        * must be scanned in order to detect them.
>          *
> -        * This ensures that on fork, we copy page tables correctly.
> +        * This additionally causes page tables to be copied on fork regardless
> +        * of whether the VMA is anonymous or not, correctly preserving the
> +        * guard region page table entries.
>          */
> -       err = anon_vma_prepare(vma);
> -       if (err)
> -               return err;
> +       vm_flags_set(vma, VM_MAYBE_GUARD);
>
>         /*
>          * Optimistically try to install the guard marker pages first. If any
> @@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
>         case MADV_COLLAPSE:
> -       case MADV_GUARD_INSTALL:
>         case MADV_GUARD_REMOVE:
>                 return MADVISE_MMAP_READ_LOCK;
>         case MADV_DONTNEED:
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..a2c79ee43d68 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>         if (src_vma->anon_vma)
>                 return true;
>
> +       /* Guard regions have momdified page tables that require copying. */
> +       if (src_vma->vm_flags & VM_MAYBE_GUARD)
> +               return true;
> +
>         /*
>          * Don't copy ptes where a page fault will fill them correctly.  Fork
>          * becomes much lighter when there are big shared or private readonly
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index d873667704e8..e40c93edc5a7 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr;
>  #define VM_MAYEXEC     0x00000040
>  #define VM_GROWSDOWN   0x00000100
>  #define VM_PFNMAP      0x00000400
> +#define VM_MAYBE_GUARD 0x00000800
>  #define VM_LOCKED      0x00002000
>  #define VM_IO           0x00004000
>  #define VM_SEQ_READ    0x00008000      /* App will access data sequentially */
> --
> 2.51.0
>
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Wed, Oct 29, 2025 at 06:05:54PM -0700, Suren Baghdasaryan wrote:
> On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Currently, if a user needs to determine if guard regions are present in a
> > range, they have to scan all VMAs (or have knowledge of which ones might
> > have guard regions).
> >
> > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > operation at a virtual address level.
> >
> > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > that guard regions exist in ranges.
> >
> > This patch remedies the situation by establishing a new VMA flag,
> > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > uncertain because we cannot reasonably determine whether a
> > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > additionally VMAs may change across merge/split).
>
> nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would
> better represent the meaning.

We all suck at naming :) it's the hardest bit! :P

Hm I don't love that, bit overwrought, I do think 'maybe guard' is a better
shorthand for this flag name.

I am open to other suggestions but I think the original wins on
succinctness here!

>
> >
> > We utilise 0x800 for this flag which makes it available to 32-bit
> > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > bee reused yet.
>
> s/bee/been

Yeah this series appears to be a bonanza of typos... not sure why :) will
fix.

> but I'm not even sure the above paragraph has to be included in the
> changelog. It's a technical detail IMHO.

Well, I think it's actually important to highlight that we have a VMA flag
free and why. I know it's bordering on extraneous, but I don't think
there's any harm in mentioning it.

Otherwise people might wonder 'oh is this flag used elsewhere somehow' etc.

>
> >
> > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > lock (and also VMA write lock) whereas previously it did not, but this
> > seems a reasonable overhead.
>
> I guess this is because it is modifying vm_flags now?

Yes

>
> >
> > We also update the smaps logic and documentation to identify these VMAs.
> >
> > Another major use of this functionality is that we can use it to identify
> > that we ought to copy page tables on fork.
> >
> > For anonymous mappings this is inherent, however since commit f807123d578d
> >  ("mm: allow guard regions in file-backed and read-only mappings") which
> >  allowed file-backed guard regions, we have unfortunately had to enforce
> > this behaviour by settings vma->anon_vma to force page table copying.
> >
> > The existence of this flag removes the need for this, so we simply update
> > vma_needs_copy() to check for this flag instead.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Overall, makes sense to me and I think we could use it.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

>
> It would be nice to have a way for userspace to reset this flag if it
> confirms that the VMA does not really have any guards (using say
> PAGEMAP_SCAN) but I think such an API can be abused.

Yeah, I'd rather not for that reason.

>
> > ---
> >  Documentation/filesystems/proc.rst |  1 +
> >  fs/proc/task_mmu.c                 |  1 +
> >  include/linux/mm.h                 |  1 +
> >  include/trace/events/mmflags.h     |  1 +
> >  mm/madvise.c                       | 22 ++++++++++++++--------
> >  mm/memory.c                        |  4 ++++
> >  tools/testing/vma/vma_internal.h   |  1 +
> >  7 files changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 0b86a8022fa1..b8a423ca590a 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -591,6 +591,7 @@ encoded manner. The codes are the following:
> >      sl    sealed
> >      lf    lock on fault pages
> >      dp    always lazily freeable mapping
> > +    gu    maybe contains guard regions (if not set, definitely doesn't)
> >      ==    =======================================
> >
> >  Note that there is no guarantee that every flag and associated mnemonic will
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index fc35a0543f01..db16ed91c269 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> >                 [ilog2(VM_MAYSHARE)]    = "ms",
> >                 [ilog2(VM_GROWSDOWN)]   = "gd",
> >                 [ilog2(VM_PFNMAP)]      = "pf",
> > +               [ilog2(VM_MAYBE_GUARD)] = "gu",
> >                 [ilog2(VM_LOCKED)]      = "lo",
> >                 [ilog2(VM_IO)]          = "io",
> >                 [ilog2(VM_SEQ_READ)]    = "sr",
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index aada935c4950..f963afa1b9de 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp);
> >  #define VM_UFFD_MISSING        0
> >  #endif /* CONFIG_MMU */
> >  #define VM_PFNMAP      0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
> > +#define VM_MAYBE_GUARD 0x00000800      /* The VMA maybe contains guard regions. */
> >  #define VM_UFFD_WP     0x00001000      /* wrprotect pages tracking */
> >
> >  #define VM_LOCKED      0x00002000
> > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > index aa441f593e9a..a6e5a44c9b42 100644
> > --- a/include/trace/events/mmflags.h
> > +++ b/include/trace/events/mmflags.h
> > @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
> >         {VM_UFFD_MISSING,               "uffd_missing"  },              \
> >  IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,      "uffd_minor"    )               \
> >         {VM_PFNMAP,                     "pfnmap"        },              \
> > +       {VM_MAYBE_GUARD,                "maybe_guard"   },              \
> >         {VM_UFFD_WP,                    "uffd_wp"       },              \
> >         {VM_LOCKED,                     "locked"        },              \
> >         {VM_IO,                         "io"            },              \
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index fb1c86e630b6..216ae6ed344e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
> >                 return -EINVAL;
> >
> >         /*
> > -        * If we install guard markers, then the range is no longer
> > -        * empty from a page table perspective and therefore it's
> > -        * appropriate to have an anon_vma.
> > +        * It would be confusing for anonymous mappings to have page table
> > +        * entries but no anon_vma established, so ensure that it is.
> > +        */
> > +       if (vma_is_anonymous(vma))
> > +               anon_vma_prepare(vma);
> > +
> > +       /*
> > +        * Indicate that the VMA may contain guard regions, making it visible to
> > +        * the user that a VMA may contain these, narrowing down the range which
> > +        * must be scanned in order to detect them.
> >          *
> > -        * This ensures that on fork, we copy page tables correctly.
> > +        * This additionally causes page tables to be copied on fork regardless
> > +        * of whether the VMA is anonymous or not, correctly preserving the
> > +        * guard region page table entries.
> >          */
> > -       err = anon_vma_prepare(vma);
> > -       if (err)
> > -               return err;
> > +       vm_flags_set(vma, VM_MAYBE_GUARD);
> >
> >         /*
> >          * Optimistically try to install the guard marker pages first. If any
> > @@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> >         case MADV_COLLAPSE:
> > -       case MADV_GUARD_INSTALL:
> >         case MADV_GUARD_REMOVE:
> >                 return MADVISE_MMAP_READ_LOCK;
> >         case MADV_DONTNEED:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4c3a7e09a159..a2c79ee43d68 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >         if (src_vma->anon_vma)
> >                 return true;
> >
> > +       /* Guard regions have momdified page tables that require copying. */
> > +       if (src_vma->vm_flags & VM_MAYBE_GUARD)
> > +               return true;
> > +
> >         /*
> >          * Don't copy ptes where a page fault will fill them correctly.  Fork
> >          * becomes much lighter when there are big shared or private readonly
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index d873667704e8..e40c93edc5a7 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr;
> >  #define VM_MAYEXEC     0x00000040
> >  #define VM_GROWSDOWN   0x00000100
> >  #define VM_PFNMAP      0x00000400
> > +#define VM_MAYBE_GUARD 0x00000800
> >  #define VM_LOCKED      0x00002000
> >  #define VM_IO           0x00004000
> >  #define VM_SEQ_READ    0x00008000      /* App will access data sequentially */
> > --
> > 2.51.0
> >
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Randy Dunlap 3 months, 1 week ago
On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:

> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..a2c79ee43d68 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	if (src_vma->anon_vma)
>  		return true;
>  
> +	/* Guard regions have momdified page tables that require copying. */

	                      modified

> +	if (src_vma->vm_flags & VM_MAYBE_GUARD)
> +		return true;
> +
>  	/*
>  	 * Don't copy ptes where a page fault will fill them correctly.  Fork
>  	 * becomes much lighter when there are big shared or private readonly
-- 
~Randy
Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Wed, Oct 29, 2025 at 12:50:18PM -0700, Randy Dunlap wrote:
>
> On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:
>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4c3a7e09a159..a2c79ee43d68 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >  	if (src_vma->anon_vma)
> >  		return true;
> >
> > +	/* Guard regions have momdified page tables that require copying. */
>
> 	                      modified

I appear to have entered some kind of typo nirvana in this series it seems :)

Thanks, will fix!

>
> > +	if (src_vma->vm_flags & VM_MAYBE_GUARD)
> > +		return true;
> > +
> >  	/*
> >  	 * Don't copy ptes where a page fault will fill them correctly.  Fork
> >  	 * becomes much lighter when there are big shared or private readonly
> --
> ~Randy
>

Cheers, Lorenzo