Add kdoc comments, describe exactly what these functinos are used for in
detail, pointing out importantly that the anon_vma_clone() !dst->anon_vma
&& src->anon_vma dance is ONLY for fork.
Both are confusing functions that will be refactored in a subsequent patch
but the first stage is establishing documentation and some invariatns.
Add some basic CONFIG_DEBUG_VM asserts that help document expected state,
specifically:
anon_vma_clone()
- mmap write lock held.
- We do nothing if src VMA is not faulted.
- The destination VMA has no anon_vma_chain yet.
- We are always operating on the same active VMA (i.e. vma->anon-vma).
- If not forking, must operate on the same mm_struct.
unlink_anon_vmas()
- mmap lock held (read on unmap downgraded).
- That unfaulted VMAs are no-ops.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/rmap.c | 82 +++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 64 insertions(+), 18 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index d6799afe1114..0e34c0a69fbc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -257,30 +257,60 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
up_write(&root->rwsem);
}
-/*
- * Attach the anon_vmas from src to dst.
- * Returns 0 on success, -ENOMEM on failure.
- *
- * anon_vma_clone() is called by vma_expand(), vma_merge(), __split_vma(),
- * copy_vma() and anon_vma_fork(). The first four want an exact copy of src,
- * while the last one, anon_vma_fork(), may try to reuse an existing anon_vma to
- * prevent endless growth of anon_vma. Since dst->anon_vma is set to NULL before
- * call, we can identify this case by checking (!dst->anon_vma &&
- * src->anon_vma).
- *
- * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
- * and reuse existing anon_vma which has no vmas and only one child anon_vma.
- * This prevents degradation of anon_vma hierarchy to endless linear chain in
- * case of constantly forking task. On the other hand, an anon_vma with more
- * than one child isn't reused even if there was no alive vma, thus rmap
- * walker has a good chance of avoiding scanning the whole hierarchy when it
- * searches where page is mapped.
+static void check_anon_vma_clone(struct vm_area_struct *dst,
+ struct vm_area_struct *src)
+{
+ /* The write lock must be held. */
+ mmap_assert_write_locked(src->vm_mm);
+ /* If not a fork (implied by dst->anon_vma) then must be on same mm. */
+ VM_WARN_ON_ONCE(dst->anon_vma && dst->vm_mm != src->vm_mm);
+
+ /* No source anon_vma is a no-op. */
+ VM_WARN_ON_ONCE(!src->anon_vma && !list_empty(&src->anon_vma_chain));
+ VM_WARN_ON_ONCE(!src->anon_vma && dst->anon_vma);
+ /* We are establishing a new anon_vma_chain. */
+ VM_WARN_ON_ONCE(!list_empty(&dst->anon_vma_chain));
+ /*
+ * On fork, dst->anon_vma is set NULL (temporarily). Otherwise, anon_vma
+ * must be the same across dst and src.
+ */
+ VM_WARN_ON_ONCE(dst->anon_vma && dst->anon_vma != src->anon_vma);
+}
+
+/**
+ * anon_vma_clone - Establishes new anon_vma_chain objects in @dst linking to
+ * all of the anon_vma objects contained within @src anon_vma_chain's.
+ * @dst: The destination VMA with an empty anon_vma_chain.
+ * @src: The source VMA we wish to duplicate.
+ *
+ * This is the heart of the VMA side of the anon_vma implementation - we invoke
+ * this function whenever we need to set up a new VMA's anon_vma state.
+ *
+ * This is invoked for:
+ *
+ * - VMA Merge, but only when @dst is unfaulted and @src is faulted - meaning we
+ * clone @src into @dst.
+ * - VMA split.
+ * - VMA (m)remap.
+ * - Fork of faulted VMA.
+ *
+ * In all cases other than fork this is simply a duplication. Fork additionally
+ * adds a new active anon_vma.
+ *
+ * ONLY in the case of fork do we try to 'reuse' existing anon_vma's in an
+ * anon_vma hierarchy, reusing anon_vma's which have no VMA associated with them
+ * but do have a single child. This is to avoid waste of memory when repeatedly
+ * forking.
+ *
+ * Returns: 0 on success, -ENOMEM on failure.
*/
int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
struct anon_vma_chain *avc, *pavc;
struct anon_vma *root = NULL;
+ check_anon_vma_clone(dst, src);
+
list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
struct anon_vma *anon_vma;
@@ -392,11 +422,27 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
return -ENOMEM;
}
+/**
+ * unlink_anon_vmas() - remove all links between a VMA and anon_vma's, freeing
+ * anon_vma_chain objects.
+ * @vma: The VMA whose links to anon_vma objects is to be severed.
+ *
+ * As part of the process anon_vma_chain's are freed,
+ * anon_vma->num_children,num_active_vmas is updated as required and, if the
+ * relevant anon_vma references no further VMAs, its reference count is
+ * decremented.
+ */
void unlink_anon_vmas(struct vm_area_struct *vma)
{
struct anon_vma_chain *avc, *next;
struct anon_vma *root = NULL;
+ /* Always hold mmap lock, read-lock on unmap possibly. */
+ mmap_assert_locked(vma->vm_mm);
+
+ /* Unfaulted is a no-op. */
+ VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain));
+
/*
* Unlink each anon_vma chained to the VMA. This list is ordered
* from newest to oldest, ensuring the root anon_vma gets freed last.
--
2.52.0
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [251217 07:27]:
> Add kdoc comments, describe exactly what these functinos are used for in
> detail, pointing out importantly that the anon_vma_clone() !dst->anon_vma
> && src->anon_vma dance is ONLY for fork.
>
> Both are confusing functions that will be refactored in a subsequent patch
> but the first stage is establishing documentation and some invariatns.
>
> Add some basic CONFIG_DEBUG_VM asserts that help document expected state,
> specifically:
>
> anon_vma_clone()
> - mmap write lock held.
> - We do nothing if src VMA is not faulted.
> - The destination VMA has no anon_vma_chain yet.
> - We are always operating on the same active VMA (i.e. vma->anon-vma).
> - If not forking, must operate on the same mm_struct.
>
> unlink_anon_vmas()
> - mmap lock held (read on unmap downgraded).
> - That unfaulted VMAs are no-ops.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> mm/rmap.c | 82 +++++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 64 insertions(+), 18 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d6799afe1114..0e34c0a69fbc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -257,30 +257,60 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
> up_write(&root->rwsem);
> }
>
> -/*
> - * Attach the anon_vmas from src to dst.
> - * Returns 0 on success, -ENOMEM on failure.
> - *
> - * anon_vma_clone() is called by vma_expand(), vma_merge(), __split_vma(),
> - * copy_vma() and anon_vma_fork(). The first four want an exact copy of src,
> - * while the last one, anon_vma_fork(), may try to reuse an existing anon_vma to
> - * prevent endless growth of anon_vma. Since dst->anon_vma is set to NULL before
> - * call, we can identify this case by checking (!dst->anon_vma &&
> - * src->anon_vma).
> - *
> - * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
> - * and reuse existing anon_vma which has no vmas and only one child anon_vma.
> - * This prevents degradation of anon_vma hierarchy to endless linear chain in
> - * case of constantly forking task. On the other hand, an anon_vma with more
> - * than one child isn't reused even if there was no alive vma, thus rmap
> - * walker has a good chance of avoiding scanning the whole hierarchy when it
> - * searches where page is mapped.
> +static void check_anon_vma_clone(struct vm_area_struct *dst,
> + struct vm_area_struct *src)
> +{
> + /* The write lock must be held. */
> + mmap_assert_write_locked(src->vm_mm);
> + /* If not a fork (implied by dst->anon_vma) then must be on same mm. */
> + VM_WARN_ON_ONCE(dst->anon_vma && dst->vm_mm != src->vm_mm);
> +
> + /* No source anon_vma is a no-op. */
> + VM_WARN_ON_ONCE(!src->anon_vma && !list_empty(&src->anon_vma_chain));
> + VM_WARN_ON_ONCE(!src->anon_vma && dst->anon_vma);
> + /* We are establishing a new anon_vma_chain. */
> + VM_WARN_ON_ONCE(!list_empty(&dst->anon_vma_chain));
> + /*
> + * On fork, dst->anon_vma is set NULL (temporarily). Otherwise, anon_vma
> + * must be the same across dst and src.
> + */
> + VM_WARN_ON_ONCE(dst->anon_vma && dst->anon_vma != src->anon_vma);
> +}
> +
> +/**
> + * anon_vma_clone - Establishes new anon_vma_chain objects in @dst linking to
> + * all of the anon_vma objects contained within @src anon_vma_chain's.
> + * @dst: The destination VMA with an empty anon_vma_chain.
> + * @src: The source VMA we wish to duplicate.
> + *
> + * This is the heart of the VMA side of the anon_vma implementation - we invoke
> + * this function whenever we need to set up a new VMA's anon_vma state.
> + *
> + * This is invoked for:
> + *
> + * - VMA Merge, but only when @dst is unfaulted and @src is faulted - meaning we
> + * clone @src into @dst.
> + * - VMA split.
> + * - VMA (m)remap.
> + * - Fork of faulted VMA.
> + *
> + * In all cases other than fork this is simply a duplication. Fork additionally
> + * adds a new active anon_vma.
> + *
> + * ONLY in the case of fork do we try to 'reuse' existing anon_vma's in an
> + * anon_vma hierarchy, reusing anon_vma's which have no VMA associated with them
> + * but do have a single child. This is to avoid waste of memory when repeatedly
> + * forking.
> + *
> + * Returns: 0 on success, -ENOMEM on failure.
> */
> int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
> {
> struct anon_vma_chain *avc, *pavc;
> struct anon_vma *root = NULL;
>
> + check_anon_vma_clone(dst, src);
> +
> list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
> struct anon_vma *anon_vma;
>
> @@ -392,11 +422,27 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> return -ENOMEM;
> }
>
> +/**
> + * unlink_anon_vmas() - remove all links between a VMA and anon_vma's, freeing
> + * anon_vma_chain objects.
> + * @vma: The VMA whose links to anon_vma objects is to be severed.
> + *
> + * As part of the process anon_vma_chain's are freed,
> + * anon_vma->num_children,num_active_vmas is updated as required and, if the
> + * relevant anon_vma references no further VMAs, its reference count is
> + * decremented.
> + */
> void unlink_anon_vmas(struct vm_area_struct *vma)
> {
> struct anon_vma_chain *avc, *next;
> struct anon_vma *root = NULL;
>
> + /* Always hold mmap lock, read-lock on unmap possibly. */
> + mmap_assert_locked(vma->vm_mm);
> +
> + /* Unfaulted is a no-op. */
> + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain));
> +
> /*
> * Unlink each anon_vma chained to the VMA. This list is ordered
> * from newest to oldest, ensuring the root anon_vma gets freed last.
> --
> 2.52.0
>
On Fri, Dec 19, 2025 at 01:22:02PM -0500, Liam R. Howlett wrote: > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [251217 07:27]: > > Add kdoc comments, describe exactly what these functinos are used for in > > detail, pointing out importantly that the anon_vma_clone() !dst->anon_vma > > && src->anon_vma dance is ONLY for fork. > > > > Both are confusing functions that will be refactored in a subsequent patch > > but the first stage is establishing documentation and some invariatns. > > > > Add some basic CONFIG_DEBUG_VM asserts that help document expected state, > > specifically: > > > > anon_vma_clone() > > - mmap write lock held. > > - We do nothing if src VMA is not faulted. > > - The destination VMA has no anon_vma_chain yet. > > - We are always operating on the same active VMA (i.e. vma->anon-vma). > > - If not forking, must operate on the same mm_struct. > > > > unlink_anon_vmas() > > - mmap lock held (read on unmap downgraded). > > - That unfaulted VMAs are no-ops. > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Thanks!
On Fri, Dec 19, 2025 at 10:22 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [251217 07:27]:
> > Add kdoc comments, describe exactly what these functinos are used for in
> > detail, pointing out importantly that the anon_vma_clone() !dst->anon_vma
> > && src->anon_vma dance is ONLY for fork.
> >
> > Both are confusing functions that will be refactored in a subsequent patch
> > but the first stage is establishing documentation and some invariatns.
> >
> > Add some basic CONFIG_DEBUG_VM asserts that help document expected state,
> > specifically:
> >
> > anon_vma_clone()
> > - mmap write lock held.
> > - We do nothing if src VMA is not faulted.
> > - The destination VMA has no anon_vma_chain yet.
> > - We are always operating on the same active VMA (i.e. vma->anon-vma).
nit: s/vma->anon-vma/vma->anon_vma
> > - If not forking, must operate on the same mm_struct.
> >
> > unlink_anon_vmas()
> > - mmap lock held (read on unmap downgraded).
Out of curiosity I looked for the place where unlink_anon_vmas() is
called with mmap_lock downgraded to read but could not find it. Could
you please point me to it?
> > - That unfaulted VMAs are no-ops.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>
> > ---
> > mm/rmap.c | 82 +++++++++++++++++++++++++++++++++++++++++++------------
> > 1 file changed, 64 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index d6799afe1114..0e34c0a69fbc 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -257,30 +257,60 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
> > up_write(&root->rwsem);
> > }
> >
> > -/*
> > - * Attach the anon_vmas from src to dst.
> > - * Returns 0 on success, -ENOMEM on failure.
> > - *
> > - * anon_vma_clone() is called by vma_expand(), vma_merge(), __split_vma(),
> > - * copy_vma() and anon_vma_fork(). The first four want an exact copy of src,
> > - * while the last one, anon_vma_fork(), may try to reuse an existing anon_vma to
> > - * prevent endless growth of anon_vma. Since dst->anon_vma is set to NULL before
> > - * call, we can identify this case by checking (!dst->anon_vma &&
> > - * src->anon_vma).
> > - *
> > - * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
> > - * and reuse existing anon_vma which has no vmas and only one child anon_vma.
> > - * This prevents degradation of anon_vma hierarchy to endless linear chain in
> > - * case of constantly forking task. On the other hand, an anon_vma with more
> > - * than one child isn't reused even if there was no alive vma, thus rmap
> > - * walker has a good chance of avoiding scanning the whole hierarchy when it
> > - * searches where page is mapped.
> > +static void check_anon_vma_clone(struct vm_area_struct *dst,
> > + struct vm_area_struct *src)
> > +{
> > + /* The write lock must be held. */
> > + mmap_assert_write_locked(src->vm_mm);
> > + /* If not a fork (implied by dst->anon_vma) then must be on same mm. */
> > + VM_WARN_ON_ONCE(dst->anon_vma && dst->vm_mm != src->vm_mm);
> > +
> > + /* No source anon_vma is a no-op. */
I'm confused about the above comment. Do you mean that if
!src->anon_vma then it's a no-op and therefore this function shouldn't
be called? If so, we could simply have VM_WARN_ON_ONCE(!src->anon_vma)
but checks below have more conditions. Can this comment be perhaps
expanded please so that the reader clearly understands what is allowed
and what is not. For example, combination (!src->anon_vma &&
!dst->anon_vma) is allowed and we correctly not triggering a warning
here, however that's still a no-op IIUC.
> > + VM_WARN_ON_ONCE(!src->anon_vma && !list_empty(&src->anon_vma_chain));
> > + VM_WARN_ON_ONCE(!src->anon_vma && dst->anon_vma);
> > + /* We are establishing a new anon_vma_chain. */
> > + VM_WARN_ON_ONCE(!list_empty(&dst->anon_vma_chain));
> > + /*
> > + * On fork, dst->anon_vma is set NULL (temporarily). Otherwise, anon_vma
> > + * must be the same across dst and src.
This is the second time in this small function where we have to remind
that dst->anon_vma==NULL means that we are forking. Maybe it's better
to introduce a `bool forking = dst->anon_vma==NULL;` variable at the
beginning and use it in all these checks?
I know, I'm nitpicking but as you said, anon_vma code is very
compicated, so the more clarity we can bring to it the better.
> > + */
> > + VM_WARN_ON_ONCE(dst->anon_vma && dst->anon_vma != src->anon_vma);
> > +}
> > +
> > +/**
> > + * anon_vma_clone - Establishes new anon_vma_chain objects in @dst linking to
> > + * all of the anon_vma objects contained within @src anon_vma_chain's.
> > + * @dst: The destination VMA with an empty anon_vma_chain.
> > + * @src: The source VMA we wish to duplicate.
> > + *
> > + * This is the heart of the VMA side of the anon_vma implementation - we invoke
> > + * this function whenever we need to set up a new VMA's anon_vma state.
> > + *
> > + * This is invoked for:
> > + *
> > + * - VMA Merge, but only when @dst is unfaulted and @src is faulted - meaning we
> > + * clone @src into @dst.
> > + * - VMA split.
> > + * - VMA (m)remap.
> > + * - Fork of faulted VMA.
> > + *
> > + * In all cases other than fork this is simply a duplication. Fork additionally
> > + * adds a new active anon_vma.
> > + *
> > + * ONLY in the case of fork do we try to 'reuse' existing anon_vma's in an
> > + * anon_vma hierarchy, reusing anon_vma's which have no VMA associated with them
> > + * but do have a single child. This is to avoid waste of memory when repeatedly
> > + * forking.
> > + *
> > + * Returns: 0 on success, -ENOMEM on failure.
> > */
> > int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
> > {
> > struct anon_vma_chain *avc, *pavc;
> > struct anon_vma *root = NULL;
> >
> > + check_anon_vma_clone(dst, src);
> > +
> > list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
> > struct anon_vma *anon_vma;
> >
> > @@ -392,11 +422,27 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> > return -ENOMEM;
> > }
> >
> > +/**
> > + * unlink_anon_vmas() - remove all links between a VMA and anon_vma's, freeing
> > + * anon_vma_chain objects.
> > + * @vma: The VMA whose links to anon_vma objects is to be severed.
> > + *
> > + * As part of the process anon_vma_chain's are freed,
> > + * anon_vma->num_children,num_active_vmas is updated as required and, if the
> > + * relevant anon_vma references no further VMAs, its reference count is
> > + * decremented.
> > + */
> > void unlink_anon_vmas(struct vm_area_struct *vma)
> > {
> > struct anon_vma_chain *avc, *next;
> > struct anon_vma *root = NULL;
> >
> > + /* Always hold mmap lock, read-lock on unmap possibly. */
> > + mmap_assert_locked(vma->vm_mm);
> > +
> > + /* Unfaulted is a no-op. */
> > + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain));
Hmm. anon_vma_clone() calls unlink_anon_vmas() after setting
dst->anon_vma=NULL in the enomem_failure path. This warning would
imply that in such case dst->anon_vma_chain is always non-empty. But I
don't think we can always expect that... What if the very first call
to anon_vma_chain_alloc() in anon_vma_clone()'s loop failed, I think
this would result in dst->anon_vma_chain being empty, no?
> > +
> > /*
> > * Unlink each anon_vma chained to the VMA. This list is ordered
> > * from newest to oldest, ensuring the root anon_vma gets freed last.
> > --
> > 2.52.0
> >
On Mon, Dec 29, 2025 at 01:18:04PM -0800, Suren Baghdasaryan wrote:
> On Fri, Dec 19, 2025 at 10:22 AM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [251217 07:27]:
> > > Add kdoc comments, describe exactly what these functinos are used for in
> > > detail, pointing out importantly that the anon_vma_clone() !dst->anon_vma
> > > && src->anon_vma dance is ONLY for fork.
> > >
> > > Both are confusing functions that will be refactored in a subsequent patch
> > > but the first stage is establishing documentation and some invariatns.
> > >
> > > Add some basic CONFIG_DEBUG_VM asserts that help document expected state,
> > > specifically:
> > >
> > > anon_vma_clone()
> > > - mmap write lock held.
> > > - We do nothing if src VMA is not faulted.
> > > - The destination VMA has no anon_vma_chain yet.
> > > - We are always operating on the same active VMA (i.e. vma->anon-vma).
>
> nit: s/vma->anon-vma/vma->anon_vma
Thanks will correct.
>
> > > - If not forking, must operate on the same mm_struct.
> > >
> > > unlink_anon_vmas()
> > > - mmap lock held (read on unmap downgraded).
>
> Out of curiosity I looked for the place where unlink_anon_vmas() is
> called with mmap_lock downgraded to read but could not find it. Could
> you please point me to it?
In brk() we call:
-> do_vmi_align_munmap()
-> ... (below)
On munmap() we call:
-> __vm_munmap()
-> do_vmi_munmap()
-> do_vmi_align_munmap()
-> ... (below)
On mremap() when shrinking a VMA in place we call:
-> mremap_at()
-> shrink_vma()
-> do_vmi_munmap()
-> do_vmi_align_munmap()
-> ... (below)
And the ... is:
-> vms_complete_munmap_vmas() [ does downgrade since vms->unlock ]
-> vms_clear_ptes()
-> free_pgtables()
I've improved the comment anyway to make it a little clearer.
>
> > > - That unfaulted VMAs are no-ops.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> >
> > > ---
> > > mm/rmap.c | 82 +++++++++++++++++++++++++++++++++++++++++++------------
> > > 1 file changed, 64 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index d6799afe1114..0e34c0a69fbc 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -257,30 +257,60 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
> > > up_write(&root->rwsem);
> > > }
> > >
> > > -/*
> > > - * Attach the anon_vmas from src to dst.
> > > - * Returns 0 on success, -ENOMEM on failure.
> > > - *
> > > - * anon_vma_clone() is called by vma_expand(), vma_merge(), __split_vma(),
> > > - * copy_vma() and anon_vma_fork(). The first four want an exact copy of src,
> > > - * while the last one, anon_vma_fork(), may try to reuse an existing anon_vma to
> > > - * prevent endless growth of anon_vma. Since dst->anon_vma is set to NULL before
> > > - * call, we can identify this case by checking (!dst->anon_vma &&
> > > - * src->anon_vma).
> > > - *
> > > - * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
> > > - * and reuse existing anon_vma which has no vmas and only one child anon_vma.
> > > - * This prevents degradation of anon_vma hierarchy to endless linear chain in
> > > - * case of constantly forking task. On the other hand, an anon_vma with more
> > > - * than one child isn't reused even if there was no alive vma, thus rmap
> > > - * walker has a good chance of avoiding scanning the whole hierarchy when it
> > > - * searches where page is mapped.
> > > +static void check_anon_vma_clone(struct vm_area_struct *dst,
> > > + struct vm_area_struct *src)
> > > +{
> > > + /* The write lock must be held. */
> > > + mmap_assert_write_locked(src->vm_mm);
> > > + /* If not a fork (implied by dst->anon_vma) then must be on same mm. */
> > > + VM_WARN_ON_ONCE(dst->anon_vma && dst->vm_mm != src->vm_mm);
> > > +
> > > + /* No source anon_vma is a no-op. */
>
> I'm confused about the above comment. Do you mean that if
> !src->anon_vma then it's a no-op and therefore this function shouldn't
> be called? If so, we could simply have VM_WARN_ON_ONCE(!src->anon_vma)
It's a no-op :) so it makes no sense to specify other fields. In a later commit
we literally bail out of anon_vma_clone() if it's not specified. In fact the
very next patch...
> but checks below have more conditions. Can this comment be perhaps
> expanded please so that the reader clearly understands what is allowed
> and what is not. For example, combination (!src->anon_vma &&
> !dst->anon_vma) is allowed and we correctly not triggering a warning
> here, however that's still a no-op IIUC.
Yup it's correct and fine but it's a no-op, hence we have nothing to do, as you
say.
I thought it was self-documenting, given I literally spell out the expected
conditions in the asserts but obviously this isn't entirely clear. I'm trying
_not_ to write paragraphs here as that can actually make things _more_
confusing.
Will update the comment to say more:
/* If we have anything to do src->anon_vma must be provided. */
>
> > > + VM_WARN_ON_ONCE(!src->anon_vma && !list_empty(&src->anon_vma_chain));
> > > + VM_WARN_ON_ONCE(!src->anon_vma && dst->anon_vma);
> > > + /* We are establishing a new anon_vma_chain. */
> > > + VM_WARN_ON_ONCE(!list_empty(&dst->anon_vma_chain));
> > > + /*
> > > + * On fork, dst->anon_vma is set NULL (temporarily). Otherwise, anon_vma
> > > + * must be the same across dst and src.
>
> This is the second time in this small function where we have to remind
> that dst->anon_vma==NULL means that we are forking. Maybe it's better
> to introduce a `bool forking = dst->anon_vma==NULL;` variable at the
> beginning and use it in all these checks?
Later we make changes along these lines, so for the purposes of keeping things
broken up I'd rather not.
And yes, anon_vma is a complicated mess, this is why I'm trying to do things one
step at a time, so we document the things you'd have to go research to
understand, later we change the code.
>
> I know, I'm nitpicking but as you said, anon_vma code is very
> compicated, so the more clarity we can bring to it the better.
Right, sure, but it has to be one thing at a time.
>
> > > + */
> > > + VM_WARN_ON_ONCE(dst->anon_vma && dst->anon_vma != src->anon_vma);
> > > +}
> > > +
> > > +/**
> > > + * anon_vma_clone - Establishes new anon_vma_chain objects in @dst linking to
> > > + * all of the anon_vma objects contained within @src anon_vma_chain's.
> > > + * @dst: The destination VMA with an empty anon_vma_chain.
> > > + * @src: The source VMA we wish to duplicate.
> > > + *
> > > + * This is the heart of the VMA side of the anon_vma implementation - we invoke
> > > + * this function whenever we need to set up a new VMA's anon_vma state.
> > > + *
> > > + * This is invoked for:
> > > + *
> > > + * - VMA Merge, but only when @dst is unfaulted and @src is faulted - meaning we
> > > + * clone @src into @dst.
> > > + * - VMA split.
> > > + * - VMA (m)remap.
> > > + * - Fork of faulted VMA.
> > > + *
> > > + * In all cases other than fork this is simply a duplication. Fork additionally
> > > + * adds a new active anon_vma.
> > > + *
> > > + * ONLY in the case of fork do we try to 'reuse' existing anon_vma's in an
> > > + * anon_vma hierarchy, reusing anon_vma's which have no VMA associated with them
> > > + * but do have a single child. This is to avoid waste of memory when repeatedly
> > > + * forking.
> > > + *
> > > + * Returns: 0 on success, -ENOMEM on failure.
> > > */
> > > int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
> > > {
> > > struct anon_vma_chain *avc, *pavc;
> > > struct anon_vma *root = NULL;
> > >
> > > + check_anon_vma_clone(dst, src);
> > > +
> > > list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
> > > struct anon_vma *anon_vma;
> > >
> > > @@ -392,11 +422,27 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> > > return -ENOMEM;
> > > }
> > >
> > > +/**
> > > + * unlink_anon_vmas() - remove all links between a VMA and anon_vma's, freeing
> > > + * anon_vma_chain objects.
> > > + * @vma: The VMA whose links to anon_vma objects is to be severed.
> > > + *
> > > + * As part of the process anon_vma_chain's are freed,
> > > + * anon_vma->num_children,num_active_vmas is updated as required and, if the
> > > + * relevant anon_vma references no further VMAs, its reference count is
> > > + * decremented.
> > > + */
> > > void unlink_anon_vmas(struct vm_area_struct *vma)
> > > {
> > > struct anon_vma_chain *avc, *next;
> > > struct anon_vma *root = NULL;
> > >
> > > + /* Always hold mmap lock, read-lock on unmap possibly. */
> > > + mmap_assert_locked(vma->vm_mm);
> > > +
> > > + /* Unfaulted is a no-op. */
> > > + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain));
>
> Hmm. anon_vma_clone() calls unlink_anon_vmas() after setting
> dst->anon_vma=NULL in the enomem_failure path. This warning would
> imply that in such case dst->anon_vma_chain is always non-empty. But I
> don't think we can always expect that... What if the very first call
> to anon_vma_chain_alloc() in anon_vma_clone()'s loop failed, I think
> this would result in dst->anon_vma_chain being empty, no?
OK well that's a good spot, though this is never going to actually happen in
reality as an allocation failure here would really be 'too small to fail'.
It's a pity we have to give up a completely sensible invariant because of
terribly written code for an event that will never happen.
But sure will drop this then, that's awful to have to do though :/
Hey maybe we'd have bot reports on this (would require fault injection) if this
had been taken to any tree at any point. Ah well.
>
> > > +
> > > /*
> > > * Unlink each anon_vma chained to the VMA. This list is ordered
> > > * from newest to oldest, ensuring the root anon_vma gets freed last.
> > > --
> > > 2.52.0
> > >
Thanks, Lorenzo
On Tue, Jan 6, 2026 at 4:54 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Dec 29, 2025 at 01:18:04PM -0800, Suren Baghdasaryan wrote:
> > On Fri, Dec 19, 2025 at 10:22 AM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [251217 07:27]:
> > > > Add kdoc comments, describe exactly what these functinos are used for in
> > > > detail, pointing out importantly that the anon_vma_clone() !dst->anon_vma
> > > > && src->anon_vma dance is ONLY for fork.
> > > >
> > > > Both are confusing functions that will be refactored in a subsequent patch
> > > > but the first stage is establishing documentation and some invariatns.
> > > >
> > > > Add some basic CONFIG_DEBUG_VM asserts that help document expected state,
> > > > specifically:
> > > >
> > > > anon_vma_clone()
> > > > - mmap write lock held.
> > > > - We do nothing if src VMA is not faulted.
> > > > - The destination VMA has no anon_vma_chain yet.
> > > > - We are always operating on the same active VMA (i.e. vma->anon-vma).
> >
> > nit: s/vma->anon-vma/vma->anon_vma
>
> Thanks will correct.
>
> >
> > > > - If not forking, must operate on the same mm_struct.
> > > >
> > > > unlink_anon_vmas()
> > > > - mmap lock held (read on unmap downgraded).
> >
> > Out of curiosity I looked for the place where unlink_anon_vmas() is
> > called with mmap_lock downgraded to read but could not find it. Could
> > you please point me to it?
>
> In brk() we call:
>
> -> do_vmi_align_munmap()
> -> ... (below)
>
> On munmap() we call:
>
> -> __vm_munmap()
> -> do_vmi_munmap()
> -> do_vmi_align_munmap()
> -> ... (below)
>
> On mremap() when shrinking a VMA in place we call:
>
> -> mremap_at()
> -> shrink_vma()
> -> do_vmi_munmap()
> -> do_vmi_align_munmap()
> -> ... (below)
>
> And the ... is:
>
> -> vms_complete_munmap_vmas() [ does downgrade since vms->unlock ]
> -> vms_clear_ptes()
> -> free_pgtables()
>
> I've improved the comment anyway to make it a little clearer.
Ah, now I see. Thanks!
>
> >
> > > > - That unfaulted VMAs are no-ops.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >
> > > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > >
> > > > ---
> > > > mm/rmap.c | 82 +++++++++++++++++++++++++++++++++++++++++++------------
> > > > 1 file changed, 64 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index d6799afe1114..0e34c0a69fbc 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -257,30 +257,60 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
> > > > up_write(&root->rwsem);
> > > > }
> > > >
> > > > -/*
> > > > - * Attach the anon_vmas from src to dst.
> > > > - * Returns 0 on success, -ENOMEM on failure.
> > > > - *
> > > > - * anon_vma_clone() is called by vma_expand(), vma_merge(), __split_vma(),
> > > > - * copy_vma() and anon_vma_fork(). The first four want an exact copy of src,
> > > > - * while the last one, anon_vma_fork(), may try to reuse an existing anon_vma to
> > > > - * prevent endless growth of anon_vma. Since dst->anon_vma is set to NULL before
> > > > - * call, we can identify this case by checking (!dst->anon_vma &&
> > > > - * src->anon_vma).
> > > > - *
> > > > - * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
> > > > - * and reuse existing anon_vma which has no vmas and only one child anon_vma.
> > > > - * This prevents degradation of anon_vma hierarchy to endless linear chain in
> > > > - * case of constantly forking task. On the other hand, an anon_vma with more
> > > > - * than one child isn't reused even if there was no alive vma, thus rmap
> > > > - * walker has a good chance of avoiding scanning the whole hierarchy when it
> > > > - * searches where page is mapped.
> > > > +static void check_anon_vma_clone(struct vm_area_struct *dst,
> > > > + struct vm_area_struct *src)
> > > > +{
> > > > + /* The write lock must be held. */
> > > > + mmap_assert_write_locked(src->vm_mm);
> > > > + /* If not a fork (implied by dst->anon_vma) then must be on same mm. */
> > > > + VM_WARN_ON_ONCE(dst->anon_vma && dst->vm_mm != src->vm_mm);
> > > > +
> > > > + /* No source anon_vma is a no-op. */
> >
> > I'm confused about the above comment. Do you mean that if
> > !src->anon_vma then it's a no-op and therefore this function shouldn't
> > be called? If so, we could simply have VM_WARN_ON_ONCE(!src->anon_vma)
>
> It's a no-op :) so it makes no sense to specify other fields. In a later commit
> we literally bail out of anon_vma_clone() if it's not specified. In fact the
> very next patch...
>
> > but checks below have more conditions. Can this comment be perhaps
> > expanded please so that the reader clearly understands what is allowed
> > and what is not. For example, combination (!src->anon_vma &&
> > !dst->anon_vma) is allowed and we correctly not triggering a warning
> > here, however that's still a no-op IIUC.
>
> Yup it's correct and fine but it's a no-op, hence we have nothing to do, as you
> say.
>
> I thought it was self-documenting, given I literally spell out the expected
> conditions in the asserts but obviously this isn't entirely clear. I'm trying
> _not_ to write paragraphs here as that can actually make things _more_
> confusing.
Yeah, that comment just confused me a bit. If it's no-op then other
conditions should not matter, yet we are asserting them. Anyway, I
undersdand the intention and new new comment or no comment at all are
fine with me.
>
> Will update the comment to say more:
>
> /* If we have anything to do src->anon_vma must be provided. */
>
> >
> > > > + VM_WARN_ON_ONCE(!src->anon_vma && !list_empty(&src->anon_vma_chain));
> > > > + VM_WARN_ON_ONCE(!src->anon_vma && dst->anon_vma);
> > > > + /* We are establishing a new anon_vma_chain. */
> > > > + VM_WARN_ON_ONCE(!list_empty(&dst->anon_vma_chain));
> > > > + /*
> > > > + * On fork, dst->anon_vma is set NULL (temporarily). Otherwise, anon_vma
> > > > + * must be the same across dst and src.
> >
> > This is the second time in this small function where we have to remind
> > that dst->anon_vma==NULL means that we are forking. Maybe it's better
> > to introduce a `bool forking = dst->anon_vma==NULL;` variable at the
> > beginning and use it in all these checks?
>
> Later we make changes along these lines, so for the purposes of keeping things
> broken up I'd rather not.
>
> And yes, anon_vma is a complicated mess, this is why I'm trying to do things one
> step at a time, so we document the things you'd have to go research to
> understand, later we change the code.
>
> >
> > I know, I'm nitpicking but as you said, anon_vma code is very
> > compicated, so the more clarity we can bring to it the better.
>
> Right, sure, but it has to be one thing at a time.
Ack.
>
> >
> > > > + */
> > > > + VM_WARN_ON_ONCE(dst->anon_vma && dst->anon_vma != src->anon_vma);
> > > > +}
> > > > +
> > > > +/**
> > > > + * anon_vma_clone - Establishes new anon_vma_chain objects in @dst linking to
> > > > + * all of the anon_vma objects contained within @src anon_vma_chain's.
> > > > + * @dst: The destination VMA with an empty anon_vma_chain.
> > > > + * @src: The source VMA we wish to duplicate.
> > > > + *
> > > > + * This is the heart of the VMA side of the anon_vma implementation - we invoke
> > > > + * this function whenever we need to set up a new VMA's anon_vma state.
> > > > + *
> > > > + * This is invoked for:
> > > > + *
> > > > + * - VMA Merge, but only when @dst is unfaulted and @src is faulted - meaning we
> > > > + * clone @src into @dst.
> > > > + * - VMA split.
> > > > + * - VMA (m)remap.
> > > > + * - Fork of faulted VMA.
> > > > + *
> > > > + * In all cases other than fork this is simply a duplication. Fork additionally
> > > > + * adds a new active anon_vma.
> > > > + *
> > > > + * ONLY in the case of fork do we try to 'reuse' existing anon_vma's in an
> > > > + * anon_vma hierarchy, reusing anon_vma's which have no VMA associated with them
> > > > + * but do have a single child. This is to avoid waste of memory when repeatedly
> > > > + * forking.
> > > > + *
> > > > + * Returns: 0 on success, -ENOMEM on failure.
> > > > */
> > > > int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
> > > > {
> > > > struct anon_vma_chain *avc, *pavc;
> > > > struct anon_vma *root = NULL;
> > > >
> > > > + check_anon_vma_clone(dst, src);
> > > > +
> > > > list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
> > > > struct anon_vma *anon_vma;
> > > >
> > > > @@ -392,11 +422,27 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> > > > return -ENOMEM;
> > > > }
> > > >
> > > > +/**
> > > > + * unlink_anon_vmas() - remove all links between a VMA and anon_vma's, freeing
> > > > + * anon_vma_chain objects.
> > > > + * @vma: The VMA whose links to anon_vma objects is to be severed.
> > > > + *
> > > > + * As part of the process anon_vma_chain's are freed,
> > > > + * anon_vma->num_children,num_active_vmas is updated as required and, if the
> > > > + * relevant anon_vma references no further VMAs, its reference count is
> > > > + * decremented.
> > > > + */
> > > > void unlink_anon_vmas(struct vm_area_struct *vma)
> > > > {
> > > > struct anon_vma_chain *avc, *next;
> > > > struct anon_vma *root = NULL;
> > > >
> > > > + /* Always hold mmap lock, read-lock on unmap possibly. */
> > > > + mmap_assert_locked(vma->vm_mm);
> > > > +
> > > > + /* Unfaulted is a no-op. */
> > > > + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain));
> >
> > Hmm. anon_vma_clone() calls unlink_anon_vmas() after setting
> > dst->anon_vma=NULL in the enomem_failure path. This warning would
> > imply that in such case dst->anon_vma_chain is always non-empty. But I
> > don't think we can always expect that... What if the very first call
> > to anon_vma_chain_alloc() in anon_vma_clone()'s loop failed, I think
> > this would result in dst->anon_vma_chain being empty, no?
>
> OK well that's a good spot, though this is never going to actually happen in
> reality as an allocation failure here would really be 'too small to fail'.
>
> It's a pity we have to give up a completely sensible invariant because of
> terribly written code for an event that will never happen.
>
> But sure will drop this then, that's awful to have to do though :/
>
> Hey maybe we'd have bot reports on this (would require fault injection) if this
> had been taken to any tree at any point. Ah well.
I'll look into the new version to see the final result. Thanks!
>
> >
> > > > +
> > > > /*
> > > > * Unlink each anon_vma chained to the VMA. This list is ordered
> > > > * from newest to oldest, ensuring the root anon_vma gets freed last.
> > > > --
> > > > 2.52.0
> > > >
>
> Thanks, Lorenzo
On Tue, Jan 06, 2026 at 12:54:17PM +0000, Lorenzo Stoakes wrote: > > > > + /* Unfaulted is a no-op. */ > > > > + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain)); > > > > Hmm. anon_vma_clone() calls unlink_anon_vmas() after setting > > dst->anon_vma=NULL in the enomem_failure path. This warning would > > imply that in such case dst->anon_vma_chain is always non-empty. But I > > don't think we can always expect that... What if the very first call > > to anon_vma_chain_alloc() in anon_vma_clone()'s loop failed, I think > > this would result in dst->anon_vma_chain being empty, no? > > OK well that's a good spot, though this is never going to actually happen in > reality as an allocation failure here would really be 'too small to fail'. > > It's a pity we have to give up a completely sensible invariant because of > terribly written code for an event that will never happen. > > But sure will drop this then, that's awful to have to do though :/ Actually let me just update the stupid hack exit path code to increment anon_vma->num_active_vmas in this case, and do it that way (so unlink_anon_vmas() drops it again). That actually makes the whole thing _less_ of a hack as it really makes zero sense for the anon_vma not to be specified but to be working through vma->anon_vma_chain, and that's a very important invariant.
On Tue, Jan 06, 2026 at 01:01:15PM +0000, Lorenzo Stoakes wrote: > On Tue, Jan 06, 2026 at 12:54:17PM +0000, Lorenzo Stoakes wrote: > > > > > + /* Unfaulted is a no-op. */ > > > > > + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain)); > > > > > > Hmm. anon_vma_clone() calls unlink_anon_vmas() after setting > > > dst->anon_vma=NULL in the enomem_failure path. This warning would > > > imply that in such case dst->anon_vma_chain is always non-empty. But I > > > don't think we can always expect that... What if the very first call > > > to anon_vma_chain_alloc() in anon_vma_clone()'s loop failed, I think > > > this would result in dst->anon_vma_chain being empty, no? > > > > OK well that's a good spot, though this is never going to actually happen in > > reality as an allocation failure here would really be 'too small to fail'. > > > > It's a pity we have to give up a completely sensible invariant because of > > terribly written code for an event that will never happen. > > > > But sure will drop this then, that's awful to have to do though :/ > > Actually let me just update the stupid hack exit path code to increment > anon_vma->num_active_vmas in this case, and do it that way (so > unlink_anon_vmas() drops it again). > > That actually makes the whole thing _less_ of a hack as it really makes zero > sense for the anon_vma not to be specified but to be working through > vma->anon_vma_chain, and that's a very important invariant. Scratch that, this error path ruins the whole thing (we are assigning anon_vma later, because of course we are). How I hate this code. Will rethink...
On Tue, Jan 06, 2026 at 01:04:27PM +0000, Lorenzo Stoakes wrote: > On Tue, Jan 06, 2026 at 01:01:15PM +0000, Lorenzo Stoakes wrote: > > On Tue, Jan 06, 2026 at 12:54:17PM +0000, Lorenzo Stoakes wrote: > > > > > > + /* Unfaulted is a no-op. */ > > > > > > + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain)); > > > > > > > > Hmm. anon_vma_clone() calls unlink_anon_vmas() after setting > > > > dst->anon_vma=NULL in the enomem_failure path. This warning would > > > > imply that in such case dst->anon_vma_chain is always non-empty. But I > > > > don't think we can always expect that... What if the very first call > > > > to anon_vma_chain_alloc() in anon_vma_clone()'s loop failed, I think > > > > this would result in dst->anon_vma_chain being empty, no? > > > > > > OK well that's a good spot, though this is never going to actually happen in > > > reality as an allocation failure here would really be 'too small to fail'. > > > > > > It's a pity we have to give up a completely sensible invariant because of > > > terribly written code for an event that will never happen. > > > > > > But sure will drop this then, that's awful to have to do though :/ > > > > Actually let me just update the stupid hack exit path code to increment > > anon_vma->num_active_vmas in this case, and do it that way (so > > unlink_anon_vmas() drops it again). > > > > That actually makes the whole thing _less_ of a hack as it really makes zero > > sense for the anon_vma not to be specified but to be working through > > vma->anon_vma_chain, and that's a very important invariant. > > Scratch that, this error path ruins the whole thing (we are assigning anon_vma > later, because of course we are). How I hate this code. > > Will rethink... OK I think will add a specific partial cleanup path. We know we are not going to hit any empty anon_vma's because we hold the exclusive mmap write lock, so this can be radically simplified. That way we don't have VMAs with partially established anon_vma state (both vma->anon_vma and vma->anon_vma_chain) being sent to the core unlink function which is also an improvement.
On Mon, Dec 29, 2025 at 1:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Dec 19, 2025 at 10:22 AM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [251217 07:27]:
> > > Add kdoc comments, describe exactly what these functinos are used for in
> > > detail, pointing out importantly that the anon_vma_clone() !dst->anon_vma
> > > && src->anon_vma dance is ONLY for fork.
> > >
> > > Both are confusing functions that will be refactored in a subsequent patch
> > > but the first stage is establishing documentation and some invariatns.
> > >
> > > Add some basic CONFIG_DEBUG_VM asserts that help document expected state,
> > > specifically:
> > >
> > > anon_vma_clone()
> > > - mmap write lock held.
> > > - We do nothing if src VMA is not faulted.
> > > - The destination VMA has no anon_vma_chain yet.
> > > - We are always operating on the same active VMA (i.e. vma->anon-vma).
>
> nit: s/vma->anon-vma/vma->anon_vma
>
> > > - If not forking, must operate on the same mm_struct.
> > >
> > > unlink_anon_vmas()
> > > - mmap lock held (read on unmap downgraded).
>
> Out of curiosity I looked for the place where unlink_anon_vmas() is
> called with mmap_lock downgraded to read but could not find it. Could
> you please point me to it?
>
> > > - That unfaulted VMAs are no-ops.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> >
> > > ---
> > > mm/rmap.c | 82 +++++++++++++++++++++++++++++++++++++++++++------------
> > > 1 file changed, 64 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index d6799afe1114..0e34c0a69fbc 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -257,30 +257,60 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
> > > up_write(&root->rwsem);
> > > }
> > >
> > > -/*
> > > - * Attach the anon_vmas from src to dst.
> > > - * Returns 0 on success, -ENOMEM on failure.
> > > - *
> > > - * anon_vma_clone() is called by vma_expand(), vma_merge(), __split_vma(),
> > > - * copy_vma() and anon_vma_fork(). The first four want an exact copy of src,
> > > - * while the last one, anon_vma_fork(), may try to reuse an existing anon_vma to
> > > - * prevent endless growth of anon_vma. Since dst->anon_vma is set to NULL before
> > > - * call, we can identify this case by checking (!dst->anon_vma &&
> > > - * src->anon_vma).
> > > - *
> > > - * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find
> > > - * and reuse existing anon_vma which has no vmas and only one child anon_vma.
> > > - * This prevents degradation of anon_vma hierarchy to endless linear chain in
> > > - * case of constantly forking task. On the other hand, an anon_vma with more
> > > - * than one child isn't reused even if there was no alive vma, thus rmap
> > > - * walker has a good chance of avoiding scanning the whole hierarchy when it
> > > - * searches where page is mapped.
> > > +static void check_anon_vma_clone(struct vm_area_struct *dst,
> > > + struct vm_area_struct *src)
> > > +{
> > > + /* The write lock must be held. */
> > > + mmap_assert_write_locked(src->vm_mm);
As a side-note, I think we might be able to claim
vma_assert_locked(src) here if we move vma_start_write(vma) at
https://elixir.bootlin.com/linux/v6.19-rc3/source/mm/vma.c#L538 to
line 527. I looked over other places where we call anon_vma_clone()
and they seem to write-lock src VMA before cloning anon_vma.
mmap_assert_write_locked() is sufficient here because we do not change
anon_vma under VMA lock (see __vmf_anon_prepare() upgrading to mmap
lock if a new anon_vma has to be established during page fault) but I
think we can be more strict here.
> > > + /* If not a fork (implied by dst->anon_vma) then must be on same mm. */
> > > + VM_WARN_ON_ONCE(dst->anon_vma && dst->vm_mm != src->vm_mm);
> > > +
> > > + /* No source anon_vma is a no-op. */
>
> I'm confused about the above comment. Do you mean that if
> !src->anon_vma then it's a no-op and therefore this function shouldn't
> be called? If so, we could simply have VM_WARN_ON_ONCE(!src->anon_vma)
> but checks below have more conditions. Can this comment be perhaps
> expanded please so that the reader clearly understands what is allowed
> and what is not. For example, combination (!src->anon_vma &&
> !dst->anon_vma) is allowed and we correctly not triggering a warning
> here, however that's still a no-op IIUC.
>
> > > + VM_WARN_ON_ONCE(!src->anon_vma && !list_empty(&src->anon_vma_chain));
> > > + VM_WARN_ON_ONCE(!src->anon_vma && dst->anon_vma);
> > > + /* We are establishing a new anon_vma_chain. */
> > > + VM_WARN_ON_ONCE(!list_empty(&dst->anon_vma_chain));
> > > + /*
> > > + * On fork, dst->anon_vma is set NULL (temporarily). Otherwise, anon_vma
> > > + * must be the same across dst and src.
>
> This is the second time in this small function where we have to remind
> that dst->anon_vma==NULL means that we are forking. Maybe it's better
> to introduce a `bool forking = dst->anon_vma==NULL;` variable at the
> beginning and use it in all these checks?
>
> I know, I'm nitpicking but as you said, anon_vma code is very
> compicated, so the more clarity we can bring to it the better.
>
> > > + */
> > > + VM_WARN_ON_ONCE(dst->anon_vma && dst->anon_vma != src->anon_vma);
> > > +}
> > > +
> > > +/**
> > > + * anon_vma_clone - Establishes new anon_vma_chain objects in @dst linking to
> > > + * all of the anon_vma objects contained within @src anon_vma_chain's.
> > > + * @dst: The destination VMA with an empty anon_vma_chain.
> > > + * @src: The source VMA we wish to duplicate.
> > > + *
> > > + * This is the heart of the VMA side of the anon_vma implementation - we invoke
> > > + * this function whenever we need to set up a new VMA's anon_vma state.
> > > + *
> > > + * This is invoked for:
> > > + *
> > > + * - VMA Merge, but only when @dst is unfaulted and @src is faulted - meaning we
> > > + * clone @src into @dst.
> > > + * - VMA split.
> > > + * - VMA (m)remap.
> > > + * - Fork of faulted VMA.
> > > + *
> > > + * In all cases other than fork this is simply a duplication. Fork additionally
> > > + * adds a new active anon_vma.
> > > + *
> > > + * ONLY in the case of fork do we try to 'reuse' existing anon_vma's in an
> > > + * anon_vma hierarchy, reusing anon_vma's which have no VMA associated with them
> > > + * but do have a single child. This is to avoid waste of memory when repeatedly
> > > + * forking.
> > > + *
> > > + * Returns: 0 on success, -ENOMEM on failure.
> > > */
> > > int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
> > > {
> > > struct anon_vma_chain *avc, *pavc;
> > > struct anon_vma *root = NULL;
> > >
> > > + check_anon_vma_clone(dst, src);
> > > +
> > > list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
> > > struct anon_vma *anon_vma;
> > >
> > > @@ -392,11 +422,27 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> > > return -ENOMEM;
> > > }
> > >
> > > +/**
> > > + * unlink_anon_vmas() - remove all links between a VMA and anon_vma's, freeing
> > > + * anon_vma_chain objects.
> > > + * @vma: The VMA whose links to anon_vma objects is to be severed.
> > > + *
> > > + * As part of the process anon_vma_chain's are freed,
> > > + * anon_vma->num_children,num_active_vmas is updated as required and, if the
> > > + * relevant anon_vma references no further VMAs, its reference count is
> > > + * decremented.
> > > + */
> > > void unlink_anon_vmas(struct vm_area_struct *vma)
> > > {
> > > struct anon_vma_chain *avc, *next;
> > > struct anon_vma *root = NULL;
> > >
> > > + /* Always hold mmap lock, read-lock on unmap possibly. */
> > > + mmap_assert_locked(vma->vm_mm);
> > > +
> > > + /* Unfaulted is a no-op. */
> > > + VM_WARN_ON_ONCE(!vma->anon_vma && !list_empty(&vma->anon_vma_chain));
>
> Hmm. anon_vma_clone() calls unlink_anon_vmas() after setting
> dst->anon_vma=NULL in the enomem_failure path. This warning would
> imply that in such case dst->anon_vma_chain is always non-empty. But I
> don't think we can always expect that... What if the very first call
> to anon_vma_chain_alloc() in anon_vma_clone()'s loop failed, I think
> this would result in dst->anon_vma_chain being empty, no?
>
> > > +
> > > /*
> > > * Unlink each anon_vma chained to the VMA. This list is ordered
> > > * from newest to oldest, ensuring the root anon_vma gets freed last.
> > > --
> > > 2.52.0
> > >
© 2016 - 2026 Red Hat, Inc.