[PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put

Lorenzo Stoakes posted 10 patches 2 weeks, 3 days ago
There is a newer version of this series
[PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Lorenzo Stoakes 2 weeks, 3 days ago
The is_vma_writer_only() function is misnamed - this isn't determining if
there is only a write lock, as it checks for the presence of the
VM_REFCNT_EXCLUDE_READERS_FLAG.

Really, it is checking to see whether readers are excluded, with a
possibility of a false positive in the case of a detachment (there we
expect the vma->vm_refcnt to eventually be set to
VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to
eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1).

Rename the function accordingly.

Relatedly, we use a finnicky __refcount_dec_and_test() primitive directly
in vma_refcount_put(), using the old value to determine what the reference
count ought to be after the operation is complete (ignoring racing
reference count adjustments).

Wrap this into a __vma_refcount_put() function, which we can then utilise
in vma_mark_detached() and thus keep the refcount primitive usage
abstracted.

Also adjust comments, removing duplicative comments covered elsewhere and
adding more to aid understanding.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mmap_lock.h | 62 +++++++++++++++++++++++++++++++--------
 mm/mmap_lock.c            | 18 +++++-------
 2 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index a764439d0276..0b3614aadbb4 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -122,15 +122,27 @@ static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
 	vma->vm_lock_seq = UINT_MAX;
 }

-static inline bool is_vma_writer_only(int refcnt)
+/**
+ * are_readers_excluded() - Determine whether @refcnt describes a VMA which has
+ * excluded all VMA read locks.
+ * @refcnt: The VMA reference count obtained from vm_area_struct->vm_refcnt.
+ *
+ * We may be raced by other readers temporarily incrementing the reference
+ * count, though the race window is very small, this might cause spurious
+ * wakeups.
+ *
+ * In the case of a detached VMA, we may incorrectly indicate that readers are
+ * excluded when one remains, because in that scenario we target a refcount of
+ * VM_REFCNT_EXCLUDE_READERS_FLAG, rather than the attached target of
+ * VM_REFCNT_EXCLUDE_READERS_FLAG + 1.
+ *
+ * However, the race window for that is very small so it is unlikely.
+ *
+ * Returns: true if readers are excluded, false otherwise.
+ */
+static inline bool are_readers_excluded(int refcnt)
 {
 	/*
-	 * With a writer and no readers, refcnt is VM_REFCNT_EXCLUDE_READERS_FLAG
-	 * if the vma is detached and (VM_REFCNT_EXCLUDE_READERS_FLAG + 1) if it is
-	 * attached. Waiting on a detached vma happens only in
-	 * vma_mark_detached() and is a rare case, therefore most of the time
-	 * there will be no unnecessary wakeup.
-	 *
 	 * See the comment describing the vm_area_struct->vm_refcnt field for
 	 * details of possible refcnt values.
 	 */
@@ -138,18 +150,42 @@ static inline bool is_vma_writer_only(int refcnt)
 		refcnt <= VM_REFCNT_EXCLUDE_READERS_FLAG + 1;
 }

+static inline bool __vma_refcount_put(struct vm_area_struct *vma, int *refcnt)
+{
+	int oldcnt;
+	bool detached;
+
+	detached = __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
+	if (refcnt)
+		*refcnt = oldcnt - 1;
+	return detached;
+}
+
+/**
+ * vma_refcount_put() - Drop reference count in VMA vm_refcnt field due to a
+ * read-lock being dropped.
+ * @vma: The VMA whose reference count we wish to decrement.
+ *
+ * If we were the last reader, wake up threads waiting to obtain an exclusive
+ * lock.
+ */
 static inline void vma_refcount_put(struct vm_area_struct *vma)
 {
-	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
+	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt. */
 	struct mm_struct *mm = vma->vm_mm;
-	int oldcnt;
+	int refcnt;
+	bool detached;

 	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
-	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {

-		if (is_vma_writer_only(oldcnt - 1))
-			rcuwait_wake_up(&mm->vma_writer_wait);
-	}
+	detached = __vma_refcount_put(vma, &refcnt);
+	/*
+	 * __vma_enter_locked() may be sleeping waiting for readers to drop
+	 * their reference count, so wake it up if we were the last reader
+	 * blocking it from being acquired.
+	 */
+	if (!detached && are_readers_excluded(refcnt))
+		rcuwait_wake_up(&mm->vma_writer_wait);
 }

 /*
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
index 75dc098aea14..ebacb57e5f16 100644
--- a/mm/mmap_lock.c
+++ b/mm/mmap_lock.c
@@ -130,25 +130,23 @@ EXPORT_SYMBOL_GPL(__vma_start_write);

 void vma_mark_detached(struct vm_area_struct *vma)
 {
+	bool detached;
+
 	vma_assert_write_locked(vma);
 	vma_assert_attached(vma);

 	/*
-	 * We are the only writer, so no need to use vma_refcount_put().
-	 * The condition below is unlikely because the vma has been already
-	 * write-locked and readers can increment vm_refcnt only temporarily
-	 * before they check vm_lock_seq, realize the vma is locked and drop
-	 * back the vm_refcnt. That is a narrow window for observing a raised
-	 * vm_refcnt.
-	 *
 	 * See the comment describing the vm_area_struct->vm_refcnt field for
 	 * details of possible refcnt values.
 	 */
-	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+	detached = __vma_refcount_put(vma, NULL);
+	if (unlikely(!detached)) {
 		/* Wait until vma is detached with no readers. */
 		if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) {
-			bool detached;
-
+			/*
+			 * Once this is complete, no readers can increment the
+			 * reference count, and the VMA is marked detached.
+			 */
 			__vma_exit_locked(vma, &detached);
 			WARN_ON_ONCE(!detached);
 		}
--
2.52.0
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Vlastimil Babka 2 weeks, 3 days ago
On 1/22/26 14:01, Lorenzo Stoakes wrote:
> The is_vma_writer_only() function is misnamed - this isn't determining if
> there is only a write lock, as it checks for the presence of the
> VM_REFCNT_EXCLUDE_READERS_FLAG.
> 
> Really, it is checking to see whether readers are excluded, with a
> possibility of a false positive in the case of a detachment (there we
> expect the vma->vm_refcnt to eventually be set to
> VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to
> eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1).
> 
> Rename the function accordingly.
> 
> Relatedly, we use a finnicky __refcount_dec_and_test() primitive directly
> in vma_refcount_put(), using the old value to determine what the reference
> count ought to be after the operation is complete (ignoring racing
> reference count adjustments).
> 
> Wrap this into a __vma_refcount_put() function, which we can then utilise
> in vma_mark_detached() and thus keep the refcount primitive usage
> abstracted.
> 
> Also adjust comments, removing duplicative comments covered elsewhere and
> adding more to aid understanding.
> 
> No functional change intended.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Again very useful, thanks!

> ---
>  include/linux/mmap_lock.h | 62 +++++++++++++++++++++++++++++++--------
>  mm/mmap_lock.c            | 18 +++++-------
>  2 files changed, 57 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index a764439d0276..0b3614aadbb4 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -122,15 +122,27 @@ static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
>  	vma->vm_lock_seq = UINT_MAX;
>  }
> 
> -static inline bool is_vma_writer_only(int refcnt)
> +/**
> + * are_readers_excluded() - Determine whether @refcnt describes a VMA which has
> + * excluded all VMA read locks.
> + * @refcnt: The VMA reference count obtained from vm_area_struct->vm_refcnt.
> + *
> + * We may be raced by other readers temporarily incrementing the reference
> + * count, though the race window is very small, this might cause spurious
> + * wakeups.

I think this part about spurious wakeups belongs more to the usage of the
function in vma_refcount_put()? Because there are no wakeups done here. So
it should be enough to explain how it can be false positive like in the
paragraph below.

> + *
> + * In the case of a detached VMA, we may incorrectly indicate that readers are
> + * excluded when one remains, because in that scenario we target a refcount of
> + * VM_REFCNT_EXCLUDE_READERS_FLAG, rather than the attached target of
> + * VM_REFCNT_EXCLUDE_READERS_FLAG + 1.
> + *
> + * However, the race window for that is very small so it is unlikely.
> + *
> + * Returns: true if readers are excluded, false otherwise.
> + */
> +static inline bool are_readers_excluded(int refcnt)

I wonder if a include/linux/ header should have such a generically named
function (I understand it's necessary for it to be here). Maybe prefix the
name and make the comment not a kerneldoc because it's going to be only the
vma locking implementation using it and not the vma locking end-users? (i.e.
it's "intermediate").

>  {
>  	/*
> -	 * With a writer and no readers, refcnt is VM_REFCNT_EXCLUDE_READERS_FLAG
> -	 * if the vma is detached and (VM_REFCNT_EXCLUDE_READERS_FLAG + 1) if it is
> -	 * attached. Waiting on a detached vma happens only in
> -	 * vma_mark_detached() and is a rare case, therefore most of the time
> -	 * there will be no unnecessary wakeup.
> -	 *
>  	 * See the comment describing the vm_area_struct->vm_refcnt field for
>  	 * details of possible refcnt values.
>  	 */
> @@ -138,18 +150,42 @@ static inline bool is_vma_writer_only(int refcnt)
>  		refcnt <= VM_REFCNT_EXCLUDE_READERS_FLAG + 1;
>  }
> 
> +static inline bool __vma_refcount_put(struct vm_area_struct *vma, int *refcnt)

Basically change are_readers_excluded() like this, with __vma prefix?

But this one could IMHO use use some comment (also not kerneldoc) saying
what the return value and *refcnt indicate?

> +{
> +	int oldcnt;
> +	bool detached;
> +
> +	detached = __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
> +	if (refcnt)
> +		*refcnt = oldcnt - 1;
> +	return detached;
> +}
> +
> +/**
> + * vma_refcount_put() - Drop reference count in VMA vm_refcnt field due to a
> + * read-lock being dropped.
> + * @vma: The VMA whose reference count we wish to decrement.
> + *
> + * If we were the last reader, wake up threads waiting to obtain an exclusive
> + * lock.
> + */
>  static inline void vma_refcount_put(struct vm_area_struct *vma)
>  {
> -	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> +	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt. */
>  	struct mm_struct *mm = vma->vm_mm;
> -	int oldcnt;
> +	int refcnt;
> +	bool detached;
> 
>  	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> -	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> 
> -		if (is_vma_writer_only(oldcnt - 1))
> -			rcuwait_wake_up(&mm->vma_writer_wait);
> -	}
> +	detached = __vma_refcount_put(vma, &refcnt);
> +	/*
> +	 * __vma_enter_locked() may be sleeping waiting for readers to drop
> +	 * their reference count, so wake it up if we were the last reader
> +	 * blocking it from being acquired.
> +	 */
> +	if (!detached && are_readers_excluded(refcnt))
> +		rcuwait_wake_up(&mm->vma_writer_wait);
>  }
> 
>  /*
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index 75dc098aea14..ebacb57e5f16 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -130,25 +130,23 @@ EXPORT_SYMBOL_GPL(__vma_start_write);
> 
>  void vma_mark_detached(struct vm_area_struct *vma)
>  {
> +	bool detached;
> +
>  	vma_assert_write_locked(vma);
>  	vma_assert_attached(vma);
> 
>  	/*
> -	 * We are the only writer, so no need to use vma_refcount_put().
> -	 * The condition below is unlikely because the vma has been already
> -	 * write-locked and readers can increment vm_refcnt only temporarily
> -	 * before they check vm_lock_seq, realize the vma is locked and drop
> -	 * back the vm_refcnt. That is a narrow window for observing a raised
> -	 * vm_refcnt.
> -	 *
>  	 * See the comment describing the vm_area_struct->vm_refcnt field for
>  	 * details of possible refcnt values.
>  	 */
> -	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
> +	detached = __vma_refcount_put(vma, NULL);
> +	if (unlikely(!detached)) {
>  		/* Wait until vma is detached with no readers. */
>  		if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) {
> -			bool detached;
> -
> +			/*
> +			 * Once this is complete, no readers can increment the
> +			 * reference count, and the VMA is marked detached.
> +			 */
>  			__vma_exit_locked(vma, &detached);
>  			WARN_ON_ONCE(!detached);
>  		}
> --
> 2.52.0
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Lorenzo Stoakes 2 weeks, 2 days ago
On Thu, Jan 22, 2026 at 06:36:03PM +0100, Vlastimil Babka wrote:
> On 1/22/26 14:01, Lorenzo Stoakes wrote:
> > The is_vma_writer_only() function is misnamed - this isn't determining if
> > there is only a write lock, as it checks for the presence of the
> > VM_REFCNT_EXCLUDE_READERS_FLAG.
> >
> > Really, it is checking to see whether readers are excluded, with a
> > possibility of a false positive in the case of a detachment (there we
> > expect the vma->vm_refcnt to eventually be set to
> > VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to
> > eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1).
> >
> > Rename the function accordingly.
> >
> > Relatedly, we use a finnicky __refcount_dec_and_test() primitive directly
> > in vma_refcount_put(), using the old value to determine what the reference
> > count ought to be after the operation is complete (ignoring racing
> > reference count adjustments).
> >
> > Wrap this into a __vma_refcount_put() function, which we can then utilise
> > in vma_mark_detached() and thus keep the refcount primitive usage
> > abstracted.
> >
> > Also adjust comments, removing duplicative comments covered elsewhere and
> > adding more to aid understanding.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Again very useful, thanks!

Thanks.

>
> > ---
> >  include/linux/mmap_lock.h | 62 +++++++++++++++++++++++++++++++--------
> >  mm/mmap_lock.c            | 18 +++++-------
> >  2 files changed, 57 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index a764439d0276..0b3614aadbb4 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -122,15 +122,27 @@ static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
> >  	vma->vm_lock_seq = UINT_MAX;
> >  }
> >
> > -static inline bool is_vma_writer_only(int refcnt)
> > +/**
> > + * are_readers_excluded() - Determine whether @refcnt describes a VMA which has
> > + * excluded all VMA read locks.
> > + * @refcnt: The VMA reference count obtained from vm_area_struct->vm_refcnt.
> > + *
> > + * We may be raced by other readers temporarily incrementing the reference
> > + * count, though the race window is very small, this might cause spurious
> > + * wakeups.
>
> I think this part about spurious wakeups belongs more to the usage of the
> function in vma_refcount_put()? Because there are no wakeups done here. So
> it should be enough to explain how it can be false positive like in the
> paragraph below.

OK moved the paragraph to vma_refcount_put().

>
> > + *
> > + * In the case of a detached VMA, we may incorrectly indicate that readers are
> > + * excluded when one remains, because in that scenario we target a refcount of
> > + * VM_REFCNT_EXCLUDE_READERS_FLAG, rather than the attached target of
> > + * VM_REFCNT_EXCLUDE_READERS_FLAG + 1.
> > + *
> > + * However, the race window for that is very small so it is unlikely.
> > + *
> > + * Returns: true if readers are excluded, false otherwise.
> > + */
> > +static inline bool are_readers_excluded(int refcnt)
>
> I wonder if a include/linux/ header should have such a generically named
> function (I understand it's necessary for it to be here). Maybe prefix the
> name and make the comment not a kerneldoc because it's going to be only the
> vma locking implementation using it and not the vma locking end-users? (i.e.
> it's "intermediate").

OK, renamed to __vma_are_readers_excluded() and dropped the kdoc.

>
> >  {
> >  	/*
> > -	 * With a writer and no readers, refcnt is VM_REFCNT_EXCLUDE_READERS_FLAG
> > -	 * if the vma is detached and (VM_REFCNT_EXCLUDE_READERS_FLAG + 1) if it is
> > -	 * attached. Waiting on a detached vma happens only in
> > -	 * vma_mark_detached() and is a rare case, therefore most of the time
> > -	 * there will be no unnecessary wakeup.
> > -	 *
> >  	 * See the comment describing the vm_area_struct->vm_refcnt field for
> >  	 * details of possible refcnt values.
> >  	 */
> > @@ -138,18 +150,42 @@ static inline bool is_vma_writer_only(int refcnt)
> >  		refcnt <= VM_REFCNT_EXCLUDE_READERS_FLAG + 1;
> >  }
> >
> > +static inline bool __vma_refcount_put(struct vm_area_struct *vma, int *refcnt)
>
> Basically change are_readers_excluded() like this, with __vma prefix?

Yup did that already.

>
> But this one could IMHO use use some comment (also not kerneldoc) saying
> what the return value and *refcnt indicate?

I felt doing that would be overdocumenting... I've had people moan about that
before :) but sure makes sense.

Added:

/*
 * Actually decrement the VMA reference count.
 *
 * The functions sets *refcnt to the reference count immediately prior to the
 * decrement if refcnt is not NULL.
 *
 * Returns true if the decrement resulted in the VMA being detached
 * (i.e. reduced it to zero), or false otherwise.
 */

Cheers, Lorenzo
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Suren Baghdasaryan 2 weeks, 3 days ago
On Thu, Jan 22, 2026 at 9:36 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/22/26 14:01, Lorenzo Stoakes wrote:
> > The is_vma_writer_only() function is misnamed - this isn't determining if
> > there is only a write lock, as it checks for the presence of the
> > VM_REFCNT_EXCLUDE_READERS_FLAG.
> >
> > Really, it is checking to see whether readers are excluded, with a
> > possibility of a false positive in the case of a detachment (there we
> > expect the vma->vm_refcnt to eventually be set to
> > VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to
> > eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1).
> >
> > Rename the function accordingly.
> >
> > Relatedly, we use a finnicky __refcount_dec_and_test() primitive directly
> > in vma_refcount_put(), using the old value to determine what the reference
> > count ought to be after the operation is complete (ignoring racing
> > reference count adjustments).

Sorry, by mistake I replied to an earlier version here:
https://lore.kernel.org/all/CAJuCfpF-tVr==bCf-PXJFKPn99yRjfONeDnDtPvTkGUfyuvtcw@mail.gmail.com/
Copying my comments here.

IIUC, __refcount_dec_and_test() can decrement the refcount by only 1
and the old value returned (oldcnt) will be the exact value that it
was before this decrement. Therefore oldcnt - 1 must reflect the
refcount value after the decrement. It's possible the refcount gets
manipulated after this operation but that does not make this operation
wrong. I don't quite understand why you think that's racy or finnicky.

> >
> > Wrap this into a __vma_refcount_put() function, which we can then utilise
> > in vma_mark_detached() and thus keep the refcount primitive usage
> > abstracted.
> >
> > Also adjust comments, removing duplicative comments covered elsewhere and
> > adding more to aid understanding.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Again very useful, thanks!
>
> > ---
> >  include/linux/mmap_lock.h | 62 +++++++++++++++++++++++++++++++--------
> >  mm/mmap_lock.c            | 18 +++++-------
> >  2 files changed, 57 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index a764439d0276..0b3614aadbb4 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -122,15 +122,27 @@ static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
> >       vma->vm_lock_seq = UINT_MAX;
> >  }
> >
> > -static inline bool is_vma_writer_only(int refcnt)
> > +/**
> > + * are_readers_excluded() - Determine whether @refcnt describes a VMA which has
> > + * excluded all VMA read locks.
> > + * @refcnt: The VMA reference count obtained from vm_area_struct->vm_refcnt.
> > + *
> > + * We may be raced by other readers temporarily incrementing the reference
> > + * count, though the race window is very small, this might cause spurious
> > + * wakeups.
>
> I think this part about spurious wakeups belongs more to the usage of the
> function in vma_refcount_put()? Because there are no wakeups done here. So
> it should be enough to explain how it can be false positive like in the
> paragraph below.
>
> > + *
> > + * In the case of a detached VMA, we may incorrectly indicate that readers are
> > + * excluded when one remains, because in that scenario we target a refcount of
> > + * VM_REFCNT_EXCLUDE_READERS_FLAG, rather than the attached target of
> > + * VM_REFCNT_EXCLUDE_READERS_FLAG + 1.
> > + *
> > + * However, the race window for that is very small so it is unlikely.
> > + *
> > + * Returns: true if readers are excluded, false otherwise.
> > + */
> > +static inline bool are_readers_excluded(int refcnt)
>
> I wonder if a include/linux/ header should have such a generically named
> function (I understand it's necessary for it to be here). Maybe prefix the
> name and make the comment not a kerneldoc because it's going to be only the
> vma locking implementation using it and not the vma locking end-users? (i.e.
> it's "intermediate").
>
> >  {
> >       /*
> > -      * With a writer and no readers, refcnt is VM_REFCNT_EXCLUDE_READERS_FLAG
> > -      * if the vma is detached and (VM_REFCNT_EXCLUDE_READERS_FLAG + 1) if it is
> > -      * attached. Waiting on a detached vma happens only in
> > -      * vma_mark_detached() and is a rare case, therefore most of the time
> > -      * there will be no unnecessary wakeup.
> > -      *
> >        * See the comment describing the vm_area_struct->vm_refcnt field for
> >        * details of possible refcnt values.
> >        */
> > @@ -138,18 +150,42 @@ static inline bool is_vma_writer_only(int refcnt)
> >               refcnt <= VM_REFCNT_EXCLUDE_READERS_FLAG + 1;
> >  }
> >
> > +static inline bool __vma_refcount_put(struct vm_area_struct *vma, int *refcnt)
>
> Basically change are_readers_excluded() like this, with __vma prefix?
>
> But this one could IMHO use use some comment (also not kerneldoc) saying
> what the return value and *refcnt indicate?
>
> > +{
> > +     int oldcnt;
> > +     bool detached;
> > +
> > +     detached = __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
> > +     if (refcnt)
> > +             *refcnt = oldcnt - 1;
> > +     return detached;

IIUC there is always a connection between detached and *refcnt
resulting value. If detached==true then the resulting *refcnt has to
be 0. If so, __vma_refcount_put() can simply return (oldcnt - 1) as
new count:

static inline int __vma_refcount_put(struct vm_area_struct *vma)
{
       int oldcnt;

       __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
       return oldcnt - 1;
}

And later:

newcnt = __vma_refcount_put(&vma->vm_refcnt);
detached = newcnt == 0;

> > +}
> > +
> > +/**
> > + * vma_refcount_put() - Drop reference count in VMA vm_refcnt field due to a
> > + * read-lock being dropped.
> > + * @vma: The VMA whose reference count we wish to decrement.
> > + *
> > + * If we were the last reader, wake up threads waiting to obtain an exclusive
> > + * lock.
> > + */
> >  static inline void vma_refcount_put(struct vm_area_struct *vma)
> >  {
> > -     /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > +     /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt. */
> >       struct mm_struct *mm = vma->vm_mm;
> > -     int oldcnt;
> > +     int refcnt;
> > +     bool detached;
> >
> >       rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > -     if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> >
> > -             if (is_vma_writer_only(oldcnt - 1))
> > -                     rcuwait_wake_up(&mm->vma_writer_wait);
> > -     }
> > +     detached = __vma_refcount_put(vma, &refcnt);
> > +     /*
> > +      * __vma_enter_locked() may be sleeping waiting for readers to drop
> > +      * their reference count, so wake it up if we were the last reader
> > +      * blocking it from being acquired.
> > +      */
> > +     if (!detached && are_readers_excluded(refcnt))
> > +             rcuwait_wake_up(&mm->vma_writer_wait);
> >  }
> >
> >  /*
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > index 75dc098aea14..ebacb57e5f16 100644
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -130,25 +130,23 @@ EXPORT_SYMBOL_GPL(__vma_start_write);
> >
> >  void vma_mark_detached(struct vm_area_struct *vma)
> >  {
> > +     bool detached;
> > +
> >       vma_assert_write_locked(vma);
> >       vma_assert_attached(vma);
> >
> >       /*
> > -      * We are the only writer, so no need to use vma_refcount_put().
> > -      * The condition below is unlikely because the vma has been already
> > -      * write-locked and readers can increment vm_refcnt only temporarily

I think the above part of the comment is still important and should be
kept intact.

> > -      * before they check vm_lock_seq, realize the vma is locked and drop
> > -      * back the vm_refcnt. That is a narrow window for observing a raised
> > -      * vm_refcnt.
> > -      *
> >        * See the comment describing the vm_area_struct->vm_refcnt field for
> >        * details of possible refcnt values.
> >        */
> > -     if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
> > +     detached = __vma_refcount_put(vma, NULL);
> > +     if (unlikely(!detached)) {
> >               /* Wait until vma is detached with no readers. */
> >               if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) {
> > -                     bool detached;
> > -
> > +                     /*
> > +                      * Once this is complete, no readers can increment the
> > +                      * reference count, and the VMA is marked detached.
> > +                      */
> >                       __vma_exit_locked(vma, &detached);
> >                       WARN_ON_ONCE(!detached);
> >               }
> > --
> > 2.52.0
>
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Lorenzo Stoakes 2 weeks, 2 days ago
On Thu, Jan 22, 2026 at 11:31:05AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 22, 2026 at 9:36 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/22/26 14:01, Lorenzo Stoakes wrote:
> > > The is_vma_writer_only() function is misnamed - this isn't determining if
> > > there is only a write lock, as it checks for the presence of the
> > > VM_REFCNT_EXCLUDE_READERS_FLAG.
> > >
> > > Really, it is checking to see whether readers are excluded, with a
> > > possibility of a false positive in the case of a detachment (there we
> > > expect the vma->vm_refcnt to eventually be set to
> > > VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to
> > > eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1).
> > >
> > > Rename the function accordingly.
> > >
> > > Relatedly, we use a finnicky __refcount_dec_and_test() primitive directly
> > > in vma_refcount_put(), using the old value to determine what the reference
> > > count ought to be after the operation is complete (ignoring racing
> > > reference count adjustments).
>
> Sorry, by mistake I replied to an earlier version here:
> https://lore.kernel.org/all/CAJuCfpF-tVr==bCf-PXJFKPn99yRjfONeDnDtPvTkGUfyuvtcw@mail.gmail.com/
> Copying my comments here.
>
> IIUC, __refcount_dec_and_test() can decrement the refcount by only 1
> and the old value returned (oldcnt) will be the exact value that it
> was before this decrement. Therefore oldcnt - 1 must reflect the

Yes.

> refcount value after the decrement. It's possible the refcount gets

Well no...

> manipulated after this operation but that does not make this operation
> wrong. I don't quite understand why you think that's racy or finnicky.

...because of this.

I only mentioned in passing that you might get raced, which you agree with
so I think that's fine.

_I_ feel this function is _very_ finnicky given the various caveats
required to understand exactly what's happening here. But clearly that's
distracting, so I'll rephrase it.

Have updated commit msg to say:

"
Relatedly, we use a __refcount_dec_and_test() primitive directly in
vma_refcount_put(), using the old value to determine what the reference
count ought to be after the operation is complete (ignoring racing
reference count adjustments).

Wrap this into a __vma_refcount_put() function, which we can then utilise
in vma_mark_detached() and thus keep the refcount primitive usage
abstracted.

This reduces duplication in the two invocations of this function.
"

(Adding the point about duplication since it wasn't clear before).

> > > +{
> > > +     int oldcnt;
> > > +     bool detached;
> > > +
> > > +     detached = __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
> > > +     if (refcnt)
> > > +             *refcnt = oldcnt - 1;
> > > +     return detached;
>
> IIUC there is always a connection between detached and *refcnt
> resulting value. If detached==true then the resulting *refcnt has to
> be 0. If so, __vma_refcount_put() can simply return (oldcnt - 1) as
> new count:
>
> static inline int __vma_refcount_put(struct vm_area_struct *vma)
> {
>        int oldcnt;
>
>        __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);

You can't do this as it's __must_check... :)

So have to replace with __refcount_dec(), which is a void function.

>        return oldcnt - 1;
> }
>
> And later:
>
> newcnt = __vma_refcount_put(&vma->vm_refcnt);
> detached = newcnt == 0;

This is kind of horrible though.

Maybe better to just do:

	newcnt = __vma_refcount_put(vma);
	...
	if (newcnt && __vma_are_readers_excluded(newcnt))
		...

And:

	if (unlikely(__vma_refcount_put(vma))) {
		...
	}

And now we have vma->vm_refcnt documented clearly we can safely assume
developers understand what this means.

>
> > > +}
> > > +
> > > +/**
> > > + * vma_refcount_put() - Drop reference count in VMA vm_refcnt field due to a
> > > + * read-lock being dropped.
> > > + * @vma: The VMA whose reference count we wish to decrement.
> > > + *
> > > + * If we were the last reader, wake up threads waiting to obtain an exclusive
> > > + * lock.
> > > + */
> > >  static inline void vma_refcount_put(struct vm_area_struct *vma)
> > >  {
> > > -     /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > > +     /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt. */
> > >       struct mm_struct *mm = vma->vm_mm;
> > > -     int oldcnt;
> > > +     int refcnt;
> > > +     bool detached;
> > >
> > >       rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > > -     if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > >
> > > -             if (is_vma_writer_only(oldcnt - 1))
> > > -                     rcuwait_wake_up(&mm->vma_writer_wait);
> > > -     }
> > > +     detached = __vma_refcount_put(vma, &refcnt);
> > > +     /*
> > > +      * __vma_enter_locked() may be sleeping waiting for readers to drop
> > > +      * their reference count, so wake it up if we were the last reader
> > > +      * blocking it from being acquired.
> > > +      */
> > > +     if (!detached && are_readers_excluded(refcnt))
> > > +             rcuwait_wake_up(&mm->vma_writer_wait);
> > >  }
> > >
> > >  /*
> > > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > > index 75dc098aea14..ebacb57e5f16 100644
> > > --- a/mm/mmap_lock.c
> > > +++ b/mm/mmap_lock.c
> > > @@ -130,25 +130,23 @@ EXPORT_SYMBOL_GPL(__vma_start_write);
> > >
> > >  void vma_mark_detached(struct vm_area_struct *vma)
> > >  {
> > > +     bool detached;
> > > +
> > >       vma_assert_write_locked(vma);
> > >       vma_assert_attached(vma);
> > >
> > >       /*
> > > -      * We are the only writer, so no need to use vma_refcount_put().
> > > -      * The condition below is unlikely because the vma has been already
> > > -      * write-locked and readers can increment vm_refcnt only temporarily
>
> I think the above part of the comment is still important and should be
> kept intact.

I think it's confusing, because 'we are the only writer' is not clear as to
why it obviates the need to call vma_refcount_put().

In fact vma_refcount_put() explicitly invokes a lockdep read lock drop
which would be incorrect to invoke here. Also I guess the other point here
is we don't do wake ups because nobody else will be waiting.

I think updating this to actually explain the context would be hugely
distracting and not all that useful, but leaving it as-is is confusing.

So I think the best thing to do is to add back the thing about the
condition, which I've done but but placed it above unlikely(!detached) as:

	/*
	 * This condition - that the VMA is still attached (refcnt > 0) - is
	 * unlikely, because the vma has been already write-locked and readers
	 * can increment vm_refcnt only temporarily before they check
	 * vm_lock_seq, realize the vma is locked and drop back the
	 * vm_refcnt. That is a narrow window for observing a raised vm_refcnt.
	 *
	 * See the comment describing the vm_area_struct->vm_refcnt field for
	 * details of possible refcnt values.
	 */

>
> > > -      * before they check vm_lock_seq, realize the vma is locked and drop
> > > -      * back the vm_refcnt. That is a narrow window for observing a raised
> > > -      * vm_refcnt.
> > > -      *
> > >        * See the comment describing the vm_area_struct->vm_refcnt field for
> > >        * details of possible refcnt values.
> > >        */
> > > -     if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
> > > +     detached = __vma_refcount_put(vma, NULL);
> > > +     if (unlikely(!detached)) {
> > >               /* Wait until vma is detached with no readers. */
> > >               if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) {
> > > -                     bool detached;
> > > -
> > > +                     /*
> > > +                      * Once this is complete, no readers can increment the
> > > +                      * reference count, and the VMA is marked detached.
> > > +                      */
> > >                       __vma_exit_locked(vma, &detached);
> > >                       WARN_ON_ONCE(!detached);
> > >               }
> > > --
> > > 2.52.0
> >
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Lorenzo Stoakes 1 week, 6 days ago
On Fri, Jan 23, 2026 at 02:41:42PM +0000, Lorenzo Stoakes wrote:
> > > > +{
> > > > +     int oldcnt;
> > > > +     bool detached;
> > > > +
> > > > +     detached = __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
> > > > +     if (refcnt)
> > > > +             *refcnt = oldcnt - 1;
> > > > +     return detached;
> >
> > IIUC there is always a connection between detached and *refcnt
> > resulting value. If detached==true then the resulting *refcnt has to
> > be 0. If so, __vma_refcount_put() can simply return (oldcnt - 1) as
> > new count:
> >
> > static inline int __vma_refcount_put(struct vm_area_struct *vma)
> > {
> >        int oldcnt;
> >
> >        __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
>
> You can't do this as it's __must_check... :)
>
> So have to replace with __refcount_dec(), which is a void function.
>
> >        return oldcnt - 1;

Actually this doesn't work as __refcount_dec() won't let you decrement to zero
and will flag a saturated error if you do.

In the end the code looks like this:

static inline __must_check unsigned int
__vma_refcount_put_return(struct vm_area_struct *vma)
{
	int oldcnt;

	if (__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt))
		return 0;

	return oldcnt - 1;
}

Which combines the __must_check, abstraction of oldcnt - 1 and xxx_return()
naming requested on review.

Cheers, Lorenzo
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Vlastimil Babka 2 weeks, 2 days ago
On 1/22/26 20:31, Suren Baghdasaryan wrote:
>> > +     int oldcnt;
>> > +     bool detached;
>> > +
>> > +     detached = __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
>> > +     if (refcnt)
>> > +             *refcnt = oldcnt - 1;
>> > +     return detached;
> 
> IIUC there is always a connection between detached and *refcnt
> resulting value. If detached==true then the resulting *refcnt has to
> be 0. If so, __vma_refcount_put() can simply return (oldcnt - 1) as
> new count:
> 
> static inline int __vma_refcount_put(struct vm_area_struct *vma)
> {
>        int oldcnt;
> 
>        __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
>        return oldcnt - 1;
> }
> 
> And later:
> 
> newcnt = __vma_refcount_put(&vma->vm_refcnt);
> detached = newcnt == 0;

If we go that way (both ways are fine with me) I'd suggest we rename the
function to __vma_refcount_put_return to make this more obvious. (c.f.
atomic_dec_return, lockref_put_return).
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Lorenzo Stoakes 2 weeks, 2 days ago
On Fri, Jan 23, 2026 at 09:24:54AM +0100, Vlastimil Babka wrote:
> On 1/22/26 20:31, Suren Baghdasaryan wrote:
> >> > +     int oldcnt;
> >> > +     bool detached;
> >> > +
> >> > +     detached = __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
> >> > +     if (refcnt)
> >> > +             *refcnt = oldcnt - 1;
> >> > +     return detached;
> >
> > IIUC there is always a connection between detached and *refcnt
> > resulting value. If detached==true then the resulting *refcnt has to
> > be 0. If so, __vma_refcount_put() can simply return (oldcnt - 1) as
> > new count:
> >
> > static inline int __vma_refcount_put(struct vm_area_struct *vma)
> > {
> >        int oldcnt;
> >
> >        __refcount_dec_and_test(&vma->vm_refcnt, &oldcnt);
> >        return oldcnt - 1;
> > }
> >
> > And later:
> >
> > newcnt = __vma_refcount_put(&vma->vm_refcnt);
> > detached = newcnt == 0;
>
> If we go that way (both ways are fine with me) I'd suggest we rename the
> function to __vma_refcount_put_return to make this more obvious. (c.f.
> atomic_dec_return, lockref_put_return).
>

That's kind of horrible?

The lockref_put_return() seems to encode even more in it:

/**
 * lockref_put_return - Decrement reference count if possible
 * @lockref: pointer to lockref structure
 *
 * Decrement the reference count and return the new value.
 * If the lockref was dead or locked, return -1.
 */

But I guess it's still returning, it's just a weird convention, and not one
refcount uses, but perhaps because that uses output parameters.

I'll rename it I guess on the atomic basis but I just find the idea of
suffixing 'return' on a function that returns a value really... horrible.

Thanks, Lorenzo
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Vlastimil Babka 2 weeks, 2 days ago
On 1/23/26 15:52, Lorenzo Stoakes wrote:
> On Fri, Jan 23, 2026 at 09:24:54AM +0100, Vlastimil Babka wrote:
> 
> That's kind of horrible?
> 
> The lockref_put_return() seems to encode even more in it:
> 
> /**
>  * lockref_put_return - Decrement reference count if possible
>  * @lockref: pointer to lockref structure
>  *
>  * Decrement the reference count and return the new value.
>  * If the lockref was dead or locked, return -1.
>  */
> 
> But I guess it's still returning, it's just a weird convention, and not one
> refcount uses, but perhaps because that uses output parameters.

Oops I missed that -1 detail in that one, didn't mean to copy that part.

> I'll rename it I guess on the atomic basis but I just find the idea of
> suffixing 'return' on a function that returns a value really... horrible.

It's because I though it's common that things called _put() either return
nothing, or if they return 1/true it means "that was the last ref, we
removed" and this is returning something else.

But I admit it's just my feeling, there's e.g. kref_put() like this but I
haven't done a full research.

> Thanks, Lorenzo
Re: [PATCH RESEND v3 03/10] mm/vma: rename is_vma_write_only(), separate out shared refcount put
Posted by Lorenzo Stoakes 2 weeks, 2 days ago
On Fri, Jan 23, 2026 at 04:05:39PM +0100, Vlastimil Babka wrote:
> On 1/23/26 15:52, Lorenzo Stoakes wrote:
> > On Fri, Jan 23, 2026 at 09:24:54AM +0100, Vlastimil Babka wrote:
> >
> > That's kind of horrible?
> >
> > The lockref_put_return() seems to encode even more in it:
> >
> > /**
> >  * lockref_put_return - Decrement reference count if possible
> >  * @lockref: pointer to lockref structure
> >  *
> >  * Decrement the reference count and return the new value.
> >  * If the lockref was dead or locked, return -1.
> >  */
> >
> > But I guess it's still returning, it's just a weird convention, and not one
> > refcount uses, but perhaps because that uses output parameters.
>
> Oops I missed that -1 detail in that one, didn't mean to copy that part.
>
> > I'll rename it I guess on the atomic basis but I just find the idea of
> > suffixing 'return' on a function that returns a value really... horrible.
>
> It's because I though it's common that things called _put() either return
> nothing, or if they return 1/true it means "that was the last ref, we
> removed" and this is returning something else.
>
> But I admit it's just my feeling, there's e.g. kref_put() like this but I
> haven't done a full research.
>
> > Thanks, Lorenzo
>

Yup, on balance I thought probably best to add so did so.