This patch adds the ability to atomically set VMA flags with only the mmap
read/VMA read lock held.
As this could be hugely problematic for VMA flags in general given that all
other accesses are non-atomic and serialised by the mmap/VMA locks, we
implement this with a strict allow-list - that is, only designated flags
are allowed to do this.
We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap
read flag upon guard region installation.
The places where this flag is used currently and matter are:
* VMA merge - performed under mmap/VMA write lock, therefore excluding
racing writes.
* /proc/$pid/smaps - can race the write, however this isn't meaningful as
the flag write is performed at the point of the guard region being
established, and thus an smaps reader can't reasonably expect to avoid
races. Due to atomicity, a reader will observe either the flag being set
or not. Therefore consistency will be maintained.
In all other cases the flag being set is irrelevant and atomicity
guarantees other flags will be read correctly.
We additionally update madvise_guard_install() to ensure that
anon_vma_prepare() is set for anonymous VMAs to maintain consistency with
the assumption that any anonymous VMA with page tables will have an
anon_vma set, and any with an anon_vma unset will not have page tables
established.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
include/linux/mm.h | 23 +++++++++++++++++++++++
mm/madvise.c | 22 ++++++++++++++--------
2 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2a5516bff75a..2ea65c646212 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp);
/* This mask represents all the VMA flag bits used by mlock */
#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
+/* These flags can be updated atomically via VMA/mmap read lock. */
+#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
+
/* Arch-specific flags to clear when updating VM flags on protection change */
#ifndef VM_ARCH_CLEAR
# define VM_ARCH_CLEAR VM_NONE
@@ -860,6 +863,26 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
__vm_flags_mod(vma, set, clear);
}
+/*
+ * Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
+ * valid flags are allowed to do this.
+ */
+static inline void vma_flag_set_atomic(struct vm_area_struct *vma,
+ int bit)
+{
+ const vm_flags_t mask = BIT(bit);
+
+ /* mmap read lock/VMA read lock must be held. */
+ if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
+ vma_assert_locked(vma);
+
+ /* Only specific flags are permitted */
+ if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
+ return;
+
+ set_bit(bit, &vma->__vm_flags);
+}
+
static inline void vma_set_anonymous(struct vm_area_struct *vma)
{
vma->vm_ops = NULL;
diff --git a/mm/madvise.c b/mm/madvise.c
index 67bdfcb315b3..de918b107cfc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1139,15 +1139,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
return -EINVAL;
/*
- * If we install guard markers, then the range is no longer
- * empty from a page table perspective and therefore it's
- * appropriate to have an anon_vma.
- *
- * This ensures that on fork, we copy page tables correctly.
+ * Set atomically under read lock. All pertinent readers will need to
+ * acquire an mmap/VMA write lock to read it. All remaining readers may
+ * or may not see the flag set, but we don't care.
+ */
+ vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
+
+ /*
+ * If anonymous and we are establishing page tables the VMA ought to
+ * have an anon_vma associated with it.
*/
- err = anon_vma_prepare(vma);
- if (err)
- return err;
+ if (vma_is_anonymous(vma)) {
+ err = anon_vma_prepare(vma);
+ if (err)
+ return err;
+ }
/*
* Optimistically try to install the guard marker pages first. If any
--
2.51.0
On Thu, Nov 06, 2025 at 10:46:13AM +0000, Lorenzo Stoakes wrote:
> This patch adds the ability to atomically set VMA flags with only the mmap
> read/VMA read lock held.
>
> As this could be hugely problematic for VMA flags in general given that all
> other accesses are non-atomic and serialised by the mmap/VMA locks, we
> implement this with a strict allow-list - that is, only designated flags
> are allowed to do this.
>
> We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap
> read flag upon guard region installation.
>
> The places where this flag is used currently and matter are:
>
> * VMA merge - performed under mmap/VMA write lock, therefore excluding
> racing writes.
>
> * /proc/$pid/smaps - can race the write, however this isn't meaningful as
> the flag write is performed at the point of the guard region being
> established, and thus an smaps reader can't reasonably expect to avoid
> races. Due to atomicity, a reader will observe either the flag being set
> or not. Therefore consistency will be maintained.
>
> In all other cases the flag being set is irrelevant and atomicity
> guarantees other flags will be read correctly.
Probably important to write down that the only reason why this doesn't make
KCSAN have a small stroke is that we are only changing one bit. i.e we can
only have one bit of atomic flags before annotating every reader.
(Source: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/kcsan/permissive.h#n51)
> We additionally update madvise_guard_install() to ensure that
> anon_vma_prepare() is set for anonymous VMAs to maintain consistency with
> the assumption that any anonymous VMA with page tables will have an
> anon_vma set, and any with an anon_vma unset will not have page tables
> established.
Isn't that what we already had? Or do you mean "*only* set for anonymous VMAs"?
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
With the nits below and above addressed:
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> ---
> include/linux/mm.h | 23 +++++++++++++++++++++++
> mm/madvise.c | 22 ++++++++++++++--------
> 2 files changed, 37 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2a5516bff75a..2ea65c646212 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp);
> /* This mask represents all the VMA flag bits used by mlock */
> #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
>
> +/* These flags can be updated atomically via VMA/mmap read lock. */
> +#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
> +
> /* Arch-specific flags to clear when updating VM flags on protection change */
> #ifndef VM_ARCH_CLEAR
> # define VM_ARCH_CLEAR VM_NONE
> @@ -860,6 +863,26 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
> __vm_flags_mod(vma, set, clear);
> }
>
> +/*
> + * Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
> + * valid flags are allowed to do this.
> + */
> +static inline void vma_flag_set_atomic(struct vm_area_struct *vma,
> + int bit)
> +{
> + const vm_flags_t mask = BIT(bit);
> +
> + /* mmap read lock/VMA read lock must be held. */
> + if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> + vma_assert_locked(vma);
> +
> + /* Only specific flags are permitted */
> + if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
> + return;
VM_WARN_ON_ONCE?
--
Pedro
On Thu, Nov 06, 2025 at 02:45:06PM +0000, Pedro Falcato wrote:
> On Thu, Nov 06, 2025 at 10:46:13AM +0000, Lorenzo Stoakes wrote:
> > This patch adds the ability to atomically set VMA flags with only the mmap
> > read/VMA read lock held.
> >
> > As this could be hugely problematic for VMA flags in general given that all
> > other accesses are non-atomic and serialised by the mmap/VMA locks, we
> > implement this with a strict allow-list - that is, only designated flags
> > are allowed to do this.
> >
> > We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap
> > read flag upon guard region installation.
> >
> > The places where this flag is used currently and matter are:
> >
> > * VMA merge - performed under mmap/VMA write lock, therefore excluding
> > racing writes.
> >
> > * /proc/$pid/smaps - can race the write, however this isn't meaningful as
> > the flag write is performed at the point of the guard region being
> > established, and thus an smaps reader can't reasonably expect to avoid
> > races. Due to atomicity, a reader will observe either the flag being set
> > or not. Therefore consistency will be maintained.
> >
> > In all other cases the flag being set is irrelevant and atomicity
> > guarantees other flags will be read correctly.
>
> Probably important to write down that the only reason why this doesn't make
> KCSAN have a small stroke is that we are only changing one bit. i.e we can
> only have one bit of atomic flags before annotating every reader.
>
> (Source: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/kcsan/permissive.h#n51)
That seems a bit specific and technical though? I guess since Vlasta is asking
for maximum commit message pedantry here the more the merrier...
>
> > We additionally update madvise_guard_install() to ensure that
> > anon_vma_prepare() is set for anonymous VMAs to maintain consistency with
> > the assumption that any anonymous VMA with page tables will have an
> > anon_vma set, and any with an anon_vma unset will not have page tables
> > established.
>
> Isn't that what we already had? Or do you mean "*only* set for anonymous VMAs"?
Yes... I'm going to expand on this explanation as per Vlasta to make it
extremely clear anyway.
>
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> With the nits below and above addressed:
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Thanks, though I disagree with nit below.
>
> > ---
> > include/linux/mm.h | 23 +++++++++++++++++++++++
> > mm/madvise.c | 22 ++++++++++++++--------
> > 2 files changed, 37 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2a5516bff75a..2ea65c646212 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp);
> > /* This mask represents all the VMA flag bits used by mlock */
> > #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
> >
> > +/* These flags can be updated atomically via VMA/mmap read lock. */
> > +#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
> > +
> > /* Arch-specific flags to clear when updating VM flags on protection change */
> > #ifndef VM_ARCH_CLEAR
> > # define VM_ARCH_CLEAR VM_NONE
> > @@ -860,6 +863,26 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
> > __vm_flags_mod(vma, set, clear);
> > }
> >
> > +/*
> > + * Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
> > + * valid flags are allowed to do this.
> > + */
> > +static inline void vma_flag_set_atomic(struct vm_area_struct *vma,
> > + int bit)
> > +{
> > + const vm_flags_t mask = BIT(bit);
> > +
> > + /* mmap read lock/VMA read lock must be held. */
> > + if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> > + vma_assert_locked(vma);
> > +
> > + /* Only specific flags are permitted */
> > + if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
> > + return;
>
> VM_WARN_ON_ONCE?
No, this was on puurpose - I don't want drivers (incl. out of tree) abusing this
so I think this should be runtime and explicitly clear. See Suren's comment on
last revision of series.
Obviously we should never be giving drivers naked vma pointers where this
matters (and actually not sure exactly where it would), my mmap_prepare series
is working to mitigate this though it's in a situation where the locking doesn't
matter.
Also you can't use VM_WARN_ON_ONCE() that way, for some reason we don't have it
return a value, go figure.
>
> --
> Pedro
Thanks, Lorenzo
On 11/6/25 11:46, Lorenzo Stoakes wrote: > This patch adds the ability to atomically set VMA flags with only the mmap > read/VMA read lock held. > > As this could be hugely problematic for VMA flags in general given that all > other accesses are non-atomic and serialised by the mmap/VMA locks, we > implement this with a strict allow-list - that is, only designated flags > are allowed to do this. > > We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap > read flag upon guard region installation. > > The places where this flag is used currently and matter are: > > * VMA merge - performed under mmap/VMA write lock, therefore excluding > racing writes. > > * /proc/$pid/smaps - can race the write, however this isn't meaningful as > the flag write is performed at the point of the guard region being > established, and thus an smaps reader can't reasonably expect to avoid > races. Due to atomicity, a reader will observe either the flag being set > or not. Therefore consistency will be maintained. > > In all other cases the flag being set is irrelevant and atomicity > guarantees other flags will be read correctly. Could we maybe also spell out that we rely on the read mmap/VMA lock to exclude with writers that have write lock and then use non-atomic updates to update completely different flags than VM_MAYBE_GUARD? Those non-atomic updates could cause RMW races when only our side uses an atomic update, but the trick is that the read lock excludes with the write lock. > We additionally update madvise_guard_install() to ensure that > anon_vma_prepare() is set for anonymous VMAs to maintain consistency with > the assumption that any anonymous VMA with page tables will have an > anon_vma set, and any with an anon_vma unset will not have page tables > established. Could we more obviously say that we did anon_vma_prepare() unconditionally before this patch to trigger the page table copying in fork, but it's not needed anymore because fork now checks also VM_MAYBE_GUARD that we're setting here. Maybe it would be even more obvious to move that vma_needs_copy() hunk from previous patch to this one, but doesn't matter that much. Also we could mention that this patch alone will prevent merging of VMAs in some situations, but that's addressed next. I don't think it's such a bisect hazard to need reordering or combining changes, just mention perhaps. > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Otherwise LGTM. Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
On Thu, Nov 06, 2025 at 12:31:29PM +0100, Vlastimil Babka wrote: > On 11/6/25 11:46, Lorenzo Stoakes wrote: > > This patch adds the ability to atomically set VMA flags with only the mmap > > read/VMA read lock held. > > > > As this could be hugely problematic for VMA flags in general given that all > > other accesses are non-atomic and serialised by the mmap/VMA locks, we > > implement this with a strict allow-list - that is, only designated flags > > are allowed to do this. > > > > We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap > > read flag upon guard region installation. > > > > The places where this flag is used currently and matter are: > > > > * VMA merge - performed under mmap/VMA write lock, therefore excluding > > racing writes. > > > > * /proc/$pid/smaps - can race the write, however this isn't meaningful as > > the flag write is performed at the point of the guard region being > > established, and thus an smaps reader can't reasonably expect to avoid > > races. Due to atomicity, a reader will observe either the flag being set > > or not. Therefore consistency will be maintained. > > > > In all other cases the flag being set is irrelevant and atomicity > > guarantees other flags will be read correctly. > > Could we maybe also spell out that we rely on the read mmap/VMA lock to > exclude with writers that have write lock and then use non-atomic updates to > update completely different flags than VM_MAYBE_GUARD? Those non-atomic > updates could cause RMW races when only our side uses an atomic update, but > the trick is that the read lock excludes with the write lock. I thought this was implicit, I guess I can spell that out. > > > We additionally update madvise_guard_install() to ensure that > > anon_vma_prepare() is set for anonymous VMAs to maintain consistency with > > the assumption that any anonymous VMA with page tables will have an > > anon_vma set, and any with an anon_vma unset will not have page tables > > established. > > Could we more obviously say that we did anon_vma_prepare() unconditionally > before this patch to trigger the page table copying in fork, but it's not > needed anymore because fork now checks also VM_MAYBE_GUARD that we're > setting here. Maybe it would be even more obvious to move that > vma_needs_copy() hunk from previous patch to this one, but doesn't matter > that much. I thought that was covered between the comment, the previous patch and this but I can spell it out also. > > Also we could mention that this patch alone will prevent merging of VMAs in > some situations, but that's addressed next. I don't think it's such a bisect > hazard to need reordering or combining changes, just mention perhaps. A little pedantic but sure :) > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Otherwise LGTM. > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > Thanks!
© 2016 - 2025 Red Hat, Inc.