[v9] reimplement per-vma lock as a refcount

[PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore and the vma->detached flag with
a refcount (vm_refcnt).
When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached()
to be done only after vma has been write-locked. vma_mark_attached()
changes vm_refcnt to 1 to indicate that it has been attached to the vma
tree. When a reader takes read lock, it increments vm_refcnt, unless the
top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
a writer. When writer takes write lock, it sets the top usable bit to
indicate its presence. If there are readers, writer will wait using newly
introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
mode first, there can be only one writer at a time. The last reader to
release the lock will signal the writer to wake up.
refcount might overflow if there are many competing readers, in which case
read-locking will fail. Readers are expected to handle such failures.
In summary:
1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (ignoring the writer bit), indicating an attached vma with no readers;
5. vm_refcnt overflow is handled by the readers.

While this vm_lock replacement does not yet result in a smaller
vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
allows for further size optimization by structure member regrouping
to bring the size of vm_area_struct below 192 bytes.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h               | 102 +++++++++++++++++++++----------
 include/linux/mm_types.h         |  22 +++----
 kernel/fork.c                    |  13 ++--
 mm/init-mm.c                     |   1 +
 mm/memory.c                      |  80 +++++++++++++++++++++---
 tools/testing/vma/linux/atomic.h |   5 ++
 tools/testing/vma/vma_internal.h |  66 +++++++++++---------
 7 files changed, 198 insertions(+), 91 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3432756d95e6..a99b11ee1f66 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@
 #include <linux/memremap.h>
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
+#include <linux/rcuwait.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -697,12 +698,43 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_lock_init(struct vm_area_struct *vma)
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
 {
-	init_rwsem(&vma->vm_lock.lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key lockdep_key;
+
+	lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0);
+#endif
+	if (reset_refcnt)
+		refcount_set(&vma->vm_refcnt, 0);
 	vma->vm_lock_seq = UINT_MAX;
 }
 
+static inline bool is_vma_writer_only(int refcnt)
+{
+	/*
+	 * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma
+	 * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on
+	 * a detached vma happens only in vma_mark_detached() and is a rare
+	 * case, therefore most of the time there will be no unnecessary wakeup.
+	 */
+	return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1;
+}
+
+static inline void vma_refcount_put(struct vm_area_struct *vma)
+{
+	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
+	struct mm_struct *mm = vma->vm_mm;
+	int oldcnt;
+
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
+
+		if (is_vma_writer_only(oldcnt - 1))
+			rcuwait_wake_up(&mm->vma_writer_wait);
+	}
+}
+
 /*
  * Try to read-lock a vma. The function is allowed to occasionally yield false
  * locked result to avoid performance overhead, in which case we fall back to
@@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read(struct vm_area_struct *vma)
 {
+	int oldcnt;
+
 	/*
 	 * Check before locking. A race might cause false locked result.
 	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
@@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
+	/*
+	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
+	 * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
+	 */
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT)))
 		return false;
 
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
 	/*
-	 * Overflow might produce false locked result.
+	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
+	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 *
 	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
@@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
 	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock.lock);
+		vma_refcount_put(vma);
 		return false;
 	}
+
 	return true;
 }
 
@@ -749,8 +790,14 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
  */
 static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
 {
+	int oldcnt;
+
 	mmap_assert_locked(vma->vm_mm);
-	down_read_nested(&vma->vm_lock.lock, subclass);
+	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
+						      VMA_REF_LIMIT)))
+		return false;
+
+	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
 	return true;
 }
 
@@ -762,16 +809,12 @@ static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int
  */
 static inline bool vma_start_read_locked(struct vm_area_struct *vma)
 {
-	mmap_assert_locked(vma->vm_mm);
-	down_read(&vma->vm_lock.lock);
-	return true;
+	return vma_start_read_locked_nested(vma, 0);
 }
 
 static inline void vma_end_read(struct vm_area_struct *vma)
 {
-	rcu_read_lock(); /* keeps vma alive till the end of up_read */
-	up_read(&vma->vm_lock.lock);
-	rcu_read_unlock();
+	vma_refcount_put(vma);
 }
 
 /* WARNING! Can only be used if mmap_lock is expected to be write-locked */
@@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
-	if (!rwsem_is_locked(&vma->vm_lock.lock))
+	if (refcount_read(&vma->vm_refcnt) <= 1)
 		vma_assert_write_locked(vma);
 }
 
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
 static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(vma->detached, vma);
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(!vma->detached, vma);
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_mark_attached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
-}
-
-static inline void vma_mark_detached(struct vm_area_struct *vma)
-{
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
 }
 
-static inline bool is_vma_detached(struct vm_area_struct *vma)
-{
-	return vma->detached;
-}
+void vma_mark_detached(struct vm_area_struct *vma);
 
 static inline void release_fault_lock(struct vm_fault *vmf)
 {
@@ -865,7 +905,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_lock_init(struct vm_area_struct *vma) {}
+static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -908,12 +948,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-#endif
 	vma_numab_state_init(vma);
-	vma_lock_init(vma);
+	vma_lock_init(vma, false);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6573d95f1d1e..9228d19662c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
+#include <linux/types.h>
 
 #include <asm/mmu.h>
 
@@ -629,9 +630,8 @@ static inline struct anon_vma_name *anon_vma_name_alloc(const char *name)
 }
 #endif
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
+#define VMA_LOCK_OFFSET	0x40000000
+#define VMA_REF_LIMIT	(VMA_LOCK_OFFSET - 1)
 
 struct vma_numab_state {
 	/*
@@ -709,19 +709,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/*
-	 * Flag to indicate areas detached from the mm->mm_mt tree.
-	 * Unstable RCU readers are allowed to read this.
-	 */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock->lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock->lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -784,7 +778,10 @@ struct vm_area_struct {
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_PER_VMA_LOCK
 	/* Unstable RCU readers are allowed to read this. */
-	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
+	refcount_t vm_refcnt ____cacheline_aligned_in_smp;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map vmlock_dep_map;
+#endif
 #endif
 } __randomize_layout;
 
@@ -919,6 +916,7 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct rcuwait vma_writer_wait;
 		/*
 		 * This field has lock-like semantics, meaning it is sometimes
 		 * accessed with ACQUIRE/RELEASE semantics.
diff --git a/kernel/fork.c b/kernel/fork.c
index d4c75428ccaf..9d9275783cf8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -463,12 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	 * will be reinitialized.
 	 */
 	data_race(memcpy(new, orig, sizeof(*new)));
-	vma_lock_init(new);
+	vma_lock_init(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-#ifdef CONFIG_PER_VMA_LOCK
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
-#endif
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);
 
@@ -477,6 +473,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 
 void __vm_area_free(struct vm_area_struct *vma)
 {
+	/* The vma should be detached while being destroyed. */
+	vma_assert_detached(vma);
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
@@ -488,8 +486,6 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
 	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
 						  vm_rcu);
 
-	/* The vma should not be locked while being destroyed. */
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
 	__vm_area_free(vma);
 }
 #endif
@@ -1223,6 +1219,9 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
 	mm_lock_seqcount_init(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+	rcuwait_init(&mm->vma_writer_wait);
+#endif
 }
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 6af3ad675930..4600e7605cab 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -40,6 +40,7 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= SEQCNT_ZERO(init_mm.mm_lock_seq),
 #endif
 	.user_ns	= &init_user_ns,
diff --git a/mm/memory.c b/mm/memory.c
index 236fdecd44d6..dc16b67beefa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6328,9 +6328,47 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching)
+{
+	unsigned int tgt_refcnt = VMA_LOCK_OFFSET;
+
+	/* Additional refcnt if the vma is attached. */
+	if (!detaching)
+		tgt_refcnt++;
+
+	/*
+	 * If vma is detached then only vma_mark_attached() can raise the
+	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
+	 */
+	if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
+		return false;
+
+	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
+	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
+		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
+		   TASK_UNINTERRUPTIBLE);
+	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
+
+	return true;
+}
+
+static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
+{
+	*detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
+}
+
 void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 {
-	down_write(&vma->vm_lock.lock);
+	bool locked;
+
+	/*
+	 * __vma_enter_locked() returns false immediately if the vma is not
+	 * attached, otherwise it waits until refcnt is indicating that vma
+	 * is attached with no readers.
+	 */
+	locked = __vma_enter_locked(vma, false);
+
 	/*
 	 * We should use WRITE_ONCE() here because we can have concurrent reads
 	 * from the early lockless pessimistic check in vma_start_read().
@@ -6338,10 +6376,40 @@ void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
 	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
 	 */
 	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
-	up_write(&vma->vm_lock.lock);
+
+	if (locked) {
+		bool detached;
+
+		__vma_exit_locked(vma, &detached);
+		VM_BUG_ON_VMA(detached, vma); /* vma should remain attached */
+	}
 }
 EXPORT_SYMBOL_GPL(__vma_start_write);
 
+void vma_mark_detached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_attached(vma);
+
+	/*
+	 * We are the only writer, so no need to use vma_refcount_put().
+	 * The condition below is unlikely because the vma has been already
+	 * write-locked and readers can increment vm_refcnt only temporarily
+	 * before they check vm_lock_seq, realize the vma is locked and drop
+	 * back the vm_refcnt. That is a narrow window for observing a raised
+	 * vm_refcnt.
+	 */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/* Wait until vma is detached with no readers. */
+		if (__vma_enter_locked(vma, true)) {
+			bool detached;
+
+			__vma_exit_locked(vma, &detached);
+			VM_BUG_ON_VMA(!detached, vma);
+		}
+	}
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 
 	rcu_read_lock();
-retry:
 	vma = mas_walk(&mas);
 	if (!vma)
 		goto inval;
@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_start_read(vma))
 		goto inval;
 
-	/* Check if the VMA got isolated after we found it */
-	if (is_vma_detached(vma)) {
-		vma_end_read(vma);
-		count_vm_vma_lock_event(VMA_LOCK_MISS);
-		/* The area was replaced with another one */
-		goto retry;
-	}
 	/*
 	 * At this point, we have a stable reference to a VMA: The VMA is
 	 * locked and we know it hasn't already been isolated.
diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h
index 3e1b6adc027b..788c597c4fde 100644
--- a/tools/testing/vma/linux/atomic.h
+++ b/tools/testing/vma/linux/atomic.h
@@ -9,4 +9,9 @@
 #define atomic_set(x, y) uatomic_set(x, y)
 #define U8_MAX UCHAR_MAX
 
+#ifndef atomic_cmpxchg_relaxed
+#define  atomic_cmpxchg_relaxed		uatomic_cmpxchg
+#define  atomic_cmpxchg_release         uatomic_cmpxchg
+#endif /* atomic_cmpxchg_relaxed */
+
 #endif	/* _LINUX_ATOMIC_H */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 47c8b03ffbbd..2ce032943861 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -25,7 +25,7 @@
 #include <linux/maple_tree.h>
 #include <linux/mm.h>
 #include <linux/rbtree.h>
-#include <linux/rwsem.h>
+#include <linux/refcount.h>
 
 extern unsigned long stack_guard_gap;
 #ifdef CONFIG_MMU
@@ -134,10 +134,6 @@ typedef __bitwise unsigned int vm_fault_t;
  */
 #define pr_warn_once pr_err
 
-typedef struct refcount_struct {
-	atomic_t refs;
-} refcount_t;
-
 struct kref {
 	refcount_t refcount;
 };
@@ -232,15 +228,12 @@ struct mm_struct {
 	unsigned long flags; /* Must use atomic bitops to access */
 };
 
-struct vma_lock {
-	struct rw_semaphore lock;
-};
-
-
 struct file {
 	struct address_space	*f_mapping;
 };
 
+#define VMA_LOCK_OFFSET	0x40000000
+
 struct vm_area_struct {
 	/* The first cache line has the info for VMA tree walking. */
 
@@ -268,16 +261,13 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
-	/* Flag to indicate areas detached from the mm->mm_mt tree */
-	bool detached;
-
 	/*
 	 * Can only be written (using WRITE_ONCE()) while holding both:
 	 *  - mmap_lock (in write mode)
-	 *  - vm_lock.lock (in write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set
 	 * Can be read reliably while holding one of:
 	 *  - mmap_lock (in read or write mode)
-	 *  - vm_lock.lock (in read or write mode)
+	 *  - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1
 	 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
 	 * while holding nothing (except RCU to keep the VMA struct allocated).
 	 *
@@ -286,7 +276,6 @@ struct vm_area_struct {
 	 * slowpath.
 	 */
 	unsigned int vm_lock_seq;
-	struct vma_lock vm_lock;
 #endif
 
 	/*
@@ -339,6 +328,10 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_PER_VMA_LOCK
+	/* Unstable RCU readers are allowed to read this. */
+	refcount_t vm_refcnt;
+#endif
 } __randomize_layout;
 
 struct vm_fault {};
@@ -463,23 +456,41 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
 	return mas_find(&vmi->mas, ULONG_MAX);
 }
 
-static inline void vma_lock_init(struct vm_area_struct *vma)
+/*
+ * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these
+ * assertions should be made either under mmap_write_lock or when the object
+ * has been isolated under mmap_write_lock, ensuring no competing writers.
+ */
+static inline void vma_assert_attached(struct vm_area_struct *vma)
 {
-	init_rwsem(&vma->vm_lock.lock);
-	vma->vm_lock_seq = UINT_MAX;
+	VM_BUG_ON_VMA(!refcount_read(&vma->vm_refcnt), vma);
 }
 
-static inline void vma_mark_attached(struct vm_area_struct *vma)
+static inline void vma_assert_detached(struct vm_area_struct *vma)
 {
-	vma->detached = false;
+	VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt), vma);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *);
+static inline void vma_mark_attached(struct vm_area_struct *vma)
+{
+	vma_assert_write_locked(vma);
+	vma_assert_detached(vma);
+	refcount_set(&vma->vm_refcnt, 1);
+}
+
 static inline void vma_mark_detached(struct vm_area_struct *vma)
 {
-	/* When detaching vma should be write-locked */
 	vma_assert_write_locked(vma);
-	vma->detached = true;
+	vma_assert_attached(vma);
+
+	/* We are the only writer, so no need to use vma_refcount_put(). */
+	if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) {
+		/*
+		 * Reader must have temporarily raised vm_refcnt but it will
+		 * drop it without using the vma since vma is write-locked.
+		 */
+	}
 }
 
 extern const struct vm_operations_struct vma_dummy_vm_ops;
@@ -492,9 +503,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &vma_dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	vma->detached = true;
-	vma_lock_init(vma);
+	vma->vm_lock_seq = UINT_MAX;
 }
 
 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
@@ -517,10 +526,9 @@ static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		return NULL;
 
 	memcpy(new, orig, sizeof(*new));
-	vma_lock_init(new);
+	refcount_set(&new->vm_refcnt, 0);
+	new->vm_lock_seq = UINT_MAX;
 	INIT_LIST_HEAD(&new->anon_vma_chain);
-	/* vma is not locked, can't use vma_mark_detached() */
-	new->detached = true;
 
 	return new;
 }
-- 
2.47.1.613.gc27f4b7a9f-goog

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> 	struct vm_area_struct *vma;
> 
> 	rcu_read_lock();
>-retry:
> 	vma = mas_walk(&mas);
> 	if (!vma)
> 		goto inval;
>@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> 	if (!vma_start_read(vma))
> 		goto inval;
> 
>-	/* Check if the VMA got isolated after we found it */
>-	if (is_vma_detached(vma)) {
>-		vma_end_read(vma);
>-		count_vm_vma_lock_event(VMA_LOCK_MISS);
>-		/* The area was replaced with another one */
>-		goto retry;
>-	}

We have a little behavior change here.

Originally, if we found an detached vma, we may retry. But now, we would go to
the slow path directly. 

Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
to see the percentage of this case. If it shows this is a too rare
case to impact performance, we can ignore it.

Also the event VMA_LOCK_MISS recording is removed, but the definition is
there. We may record it in the vma_start_read() when oldcnt is 0.

BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
don't have an overall success/failure statistic in vmstat.

> 	/*
> 	 * At this point, we have a stable reference to a VMA: The VMA is
> 	 * locked and we know it hasn't already been isolated.

-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       struct vm_area_struct *vma;
> >
> >       rcu_read_lock();
> >-retry:
> >       vma = mas_walk(&mas);
> >       if (!vma)
> >               goto inval;
> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       if (!vma_start_read(vma))
> >               goto inval;
> >
> >-      /* Check if the VMA got isolated after we found it */
> >-      if (is_vma_detached(vma)) {
> >-              vma_end_read(vma);
> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
> >-              /* The area was replaced with another one */
> >-              goto retry;
> >-      }
>
> We have a little behavior change here.
>
> Originally, if we found an detached vma, we may retry. But now, we would go to
> the slow path directly.

Hmm. Good point. I think the easiest way to keep the same
functionality is to make vma_start_read() return vm_area_struct* on
success, NULL on locking failure and EAGAIN if vma was detached
(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
the case of EAGAIN.

>
> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
> to see the percentage of this case. If it shows this is a too rare
> case to impact performance, we can ignore it.
>
> Also the event VMA_LOCK_MISS recording is removed, but the definition is
> there. We may record it in the vma_start_read() when oldcnt is 0.
>
> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
> don't have an overall success/failure statistic in vmstat.

Are you referring to the fact that we do not increment
VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
page fault (in which we increment VMA_LOCK_RETRY instead)?

>
> >       /*
> >        * At this point, we have a stable reference to a VMA: The VMA is
> >        * locked and we know it hasn't already been isolated.
>
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
>On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >       struct vm_area_struct *vma;
>> >
>> >       rcu_read_lock();
>> >-retry:
>> >       vma = mas_walk(&mas);
>> >       if (!vma)
>> >               goto inval;
>> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >       if (!vma_start_read(vma))
>> >               goto inval;
>> >
>> >-      /* Check if the VMA got isolated after we found it */
>> >-      if (is_vma_detached(vma)) {
>> >-              vma_end_read(vma);
>> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
>> >-              /* The area was replaced with another one */
>> >-              goto retry;
>> >-      }
>>
>> We have a little behavior change here.
>>
>> Originally, if we found an detached vma, we may retry. But now, we would go to
>> the slow path directly.
>
>Hmm. Good point. I think the easiest way to keep the same
>functionality is to make vma_start_read() return vm_area_struct* on
>success, NULL on locking failure and EAGAIN if vma was detached
>(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
>the case of EAGAIN.
>

Looks good to me.

>>
>> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
>> to see the percentage of this case. If it shows this is a too rare
>> case to impact performance, we can ignore it.
>>
>> Also the event VMA_LOCK_MISS recording is removed, but the definition is
>> there. We may record it in the vma_start_read() when oldcnt is 0.
>>
>> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
>> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
>> don't have an overall success/failure statistic in vmstat.
>
>Are you referring to the fact that we do not increment
>VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the

Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.

>page fault (in which we increment VMA_LOCK_RETRY instead)?
>

I don't follow this.

>>
>> >       /*
>> >        * At this point, we have a stable reference to a VMA: The VMA is
>> >        * locked and we know it hasn't already been isolated.
>>
>> --
>> Wei Yang
>> Help you, Help me

-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Wed, Jan 15, 2025 at 4:05 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
> >On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >       struct vm_area_struct *vma;
> >> >
> >> >       rcu_read_lock();
> >> >-retry:
> >> >       vma = mas_walk(&mas);
> >> >       if (!vma)
> >> >               goto inval;
> >> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >       if (!vma_start_read(vma))
> >> >               goto inval;
> >> >
> >> >-      /* Check if the VMA got isolated after we found it */
> >> >-      if (is_vma_detached(vma)) {
> >> >-              vma_end_read(vma);
> >> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
> >> >-              /* The area was replaced with another one */
> >> >-              goto retry;
> >> >-      }
> >>
> >> We have a little behavior change here.
> >>
> >> Originally, if we found an detached vma, we may retry. But now, we would go to
> >> the slow path directly.
> >
> >Hmm. Good point. I think the easiest way to keep the same
> >functionality is to make vma_start_read() return vm_area_struct* on
> >success, NULL on locking failure and EAGAIN if vma was detached
> >(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
> >the case of EAGAIN.
> >
>
> Looks good to me.
>
> >>
> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
> >> to see the percentage of this case. If it shows this is a too rare
> >> case to impact performance, we can ignore it.
> >>
> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
> >> there. We may record it in the vma_start_read() when oldcnt is 0.
> >>
> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
> >> don't have an overall success/failure statistic in vmstat.
> >
> >Are you referring to the fact that we do not increment
> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
>
> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
>
> >page fault (in which we increment VMA_LOCK_RETRY instead)?
> >
>
> I don't follow this.

Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
instead". IOW, when we successfully lock the vma but have to retry the
pagefault, we increment VMA_LOCK_RETRY without incrementing
VMA_LOCK_SUCCESS.

>
> >>
> >> >       /*
> >> >        * At this point, we have a stable reference to a VMA: The VMA is
> >> >        * locked and we know it hasn't already been isolated.
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
>
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Wed, Jan 15, 2025 at 07:01:56AM -0800, Suren Baghdasaryan wrote:
>On Wed, Jan 15, 2025 at 4:05 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
>> >On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>> >>
>> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>> >> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >> >       struct vm_area_struct *vma;
>> >> >
>> >> >       rcu_read_lock();
>> >> >-retry:
>> >> >       vma = mas_walk(&mas);
>> >> >       if (!vma)
>> >> >               goto inval;
>> >> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>> >> >       if (!vma_start_read(vma))
>> >> >               goto inval;
>> >> >
>> >> >-      /* Check if the VMA got isolated after we found it */
>> >> >-      if (is_vma_detached(vma)) {
>> >> >-              vma_end_read(vma);
>> >> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
>> >> >-              /* The area was replaced with another one */
>> >> >-              goto retry;
>> >> >-      }
>> >>
>> >> We have a little behavior change here.
>> >>
>> >> Originally, if we found an detached vma, we may retry. But now, we would go to
>> >> the slow path directly.
>> >
>> >Hmm. Good point. I think the easiest way to keep the same
>> >functionality is to make vma_start_read() return vm_area_struct* on
>> >success, NULL on locking failure and EAGAIN if vma was detached
>> >(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
>> >the case of EAGAIN.
>> >
>>
>> Looks good to me.
>>
>> >>
>> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
>> >> to see the percentage of this case. If it shows this is a too rare
>> >> case to impact performance, we can ignore it.
>> >>
>> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
>> >> there. We may record it in the vma_start_read() when oldcnt is 0.
>> >>
>> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
>> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
>> >> don't have an overall success/failure statistic in vmstat.
>> >
>> >Are you referring to the fact that we do not increment
>> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
>>
>> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
>>
>> >page fault (in which we increment VMA_LOCK_RETRY instead)?
>> >
>>
>> I don't follow this.
>
>Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
>instead". IOW, when we successfully lock the vma but have to retry the
>pagefault, we increment VMA_LOCK_RETRY without incrementing
>VMA_LOCK_SUCCESS.
>

Yes, this makes me confused about what VMA_LOCK_SUCCESS represents.

>>
>> >>
>> >> >       /*
>> >> >        * At this point, we have a stable reference to a VMA: The VMA is
>> >> >        * locked and we know it hasn't already been isolated.
>> >>
>> >> --
>> >> Wei Yang
>> >> Help you, Help me
>>
>> --
>> Wei Yang
>> Help you, Help me

-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Wed, Jan 15, 2025 at 5:37 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 07:01:56AM -0800, Suren Baghdasaryan wrote:
> >On Wed, Jan 15, 2025 at 4:05 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Tue, Jan 14, 2025 at 07:12:20PM -0800, Suren Baghdasaryan wrote:
> >> >On Tue, Jan 14, 2025 at 6:58 PM Wei Yang <richard.weiyang@gmail.com> wrote:
> >> >>
> >> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >> >> >@@ -6354,7 +6422,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >> >       struct vm_area_struct *vma;
> >> >> >
> >> >> >       rcu_read_lock();
> >> >> >-retry:
> >> >> >       vma = mas_walk(&mas);
> >> >> >       if (!vma)
> >> >> >               goto inval;
> >> >> >@@ -6362,13 +6429,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >> >> >       if (!vma_start_read(vma))
> >> >> >               goto inval;
> >> >> >
> >> >> >-      /* Check if the VMA got isolated after we found it */
> >> >> >-      if (is_vma_detached(vma)) {
> >> >> >-              vma_end_read(vma);
> >> >> >-              count_vm_vma_lock_event(VMA_LOCK_MISS);
> >> >> >-              /* The area was replaced with another one */
> >> >> >-              goto retry;
> >> >> >-      }
> >> >>
> >> >> We have a little behavior change here.
> >> >>
> >> >> Originally, if we found an detached vma, we may retry. But now, we would go to
> >> >> the slow path directly.
> >> >
> >> >Hmm. Good point. I think the easiest way to keep the same
> >> >functionality is to make vma_start_read() return vm_area_struct* on
> >> >success, NULL on locking failure and EAGAIN if vma was detached
> >> >(vm_refcnt==0). Then the same retry with VMA_LOCK_MISS can be done in
> >> >the case of EAGAIN.
> >> >
> >>
> >> Looks good to me.
> >>
> >> >>
> >> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
> >> >> to see the percentage of this case. If it shows this is a too rare
> >> >> case to impact performance, we can ignore it.
> >> >>
> >> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
> >> >> there. We may record it in the vma_start_read() when oldcnt is 0.
> >> >>
> >> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
> >> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
> >> >> don't have an overall success/failure statistic in vmstat.
> >> >
> >> >Are you referring to the fact that we do not increment
> >> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
> >>
> >> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
> >>
> >> >page fault (in which we increment VMA_LOCK_RETRY instead)?
> >> >
> >>
> >> I don't follow this.
> >
> >Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
> >instead". IOW, when we successfully lock the vma but have to retry the
> >pagefault, we increment VMA_LOCK_RETRY without incrementing
> >VMA_LOCK_SUCCESS.
> >
>
> Yes, this makes me confused about what VMA_LOCK_SUCCESS represents.

I'll need to look into the history of why we account it this way but
this is out of scope for this patchset.

>
> >>
> >> >>
> >> >> >       /*
> >> >> >        * At this point, we have a stable reference to a VMA: The VMA is
> >> >> >        * locked and we know it hasn't already been isolated.
> >> >>
> >> >> --
> >> >> Wei Yang
> >> >> Help you, Help me
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
>
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Wed, Jan 15, 2025 at 05:41:27PM -0800, Suren Baghdasaryan wrote:
[...]
>> >> >the case of EAGAIN.
>> >> >
>> >>
>> >> Looks good to me.
>> >>
>> >> >>
>> >> >> Maybe we can compare the event VMA_LOCK_MISS and VMA_LOCK_ABORT
>> >> >> to see the percentage of this case. If it shows this is a too rare
>> >> >> case to impact performance, we can ignore it.
>> >> >>
>> >> >> Also the event VMA_LOCK_MISS recording is removed, but the definition is
>> >> >> there. We may record it in the vma_start_read() when oldcnt is 0.
>> >> >>
>> >> >> BTW, the name of VMA_LOCK_SUCCESS confuse me a little. I thought it indicates
>> >> >> lock_vma_under_rcu() successfully get a valid vma. But seems not. Sounds we
>> >> >> don't have an overall success/failure statistic in vmstat.
>> >> >
>> >> >Are you referring to the fact that we do not increment
>> >> >VMA_LOCK_SUCCESS if we successfully locked a vma but have to retry the
>> >>
>> >> Something like this. I thought we would increase VMA_LOCK_SUCCESS on success.
>> >>
>> >> >page fault (in which we increment VMA_LOCK_RETRY instead)?
>> >> >
>> >>
>> >> I don't follow this.
>> >
>> >Sorry, I meant to say "in which case we increment VMA_LOCK_RETRY
>> >instead". IOW, when we successfully lock the vma but have to retry the
>> >pagefault, we increment VMA_LOCK_RETRY without incrementing
>> >VMA_LOCK_SUCCESS.
>> >
>>
>> Yes, this makes me confused about what VMA_LOCK_SUCCESS represents.
>
>I'll need to look into the history of why we account it this way but
>this is out of scope for this patchset.
>

Agree.


-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
[...]
> 
>+static inline bool is_vma_writer_only(int refcnt)
>+{
>+	/*
>+	 * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma
>+	 * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on
>+	 * a detached vma happens only in vma_mark_detached() and is a rare
>+	 * case, therefore most of the time there will be no unnecessary wakeup.
>+	 */
>+	return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1;

It looks equivalent to 

	return (refcnt == VMA_LOCK_OFFSET) || (refcnt == VMA_LOCK_OFFSET + 1);

And its generated code looks a little simpler.

>+}
>+
>+static inline void vma_refcount_put(struct vm_area_struct *vma)
>+{
>+	/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
>+	struct mm_struct *mm = vma->vm_mm;
>+	int oldcnt;
>+
>+	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
>+	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
>+
>+		if (is_vma_writer_only(oldcnt - 1))
>+			rcuwait_wake_up(&mm->vma_writer_wait);
>+	}
>+}
>+

-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Mon, Jan 13, 2025 at 1:36 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> [...]
> >
> >+static inline bool is_vma_writer_only(int refcnt)
> >+{
> >+      /*
> >+       * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma
> >+       * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on
> >+       * a detached vma happens only in vma_mark_detached() and is a rare
> >+       * case, therefore most of the time there will be no unnecessary wakeup.
> >+       */
> >+      return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1;
>
> It looks equivalent to
>
>         return (refcnt == VMA_LOCK_OFFSET) || (refcnt == VMA_LOCK_OFFSET + 1);
>
> And its generated code looks a little simpler.

Yeah, but I think the original version is a bit more descriptive,
checking that (1) there is a writer and (2) there are no readers.

>
> >+}
> >+
> >+static inline void vma_refcount_put(struct vm_area_struct *vma)
> >+{
> >+      /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> >+      struct mm_struct *mm = vma->vm_mm;
> >+      int oldcnt;
> >+
> >+      rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >+      if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> >+
> >+              if (is_vma_writer_only(oldcnt - 1))
> >+                      rcuwait_wake_up(&mm->vma_writer_wait);
> >+      }
> >+}
> >+
>
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> static inline void vma_end_read(struct vm_area_struct *vma) {}
>@@ -908,12 +948,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> 	vma->vm_mm = mm;
> 	vma->vm_ops = &vma_dummy_vm_ops;
> 	INIT_LIST_HEAD(&vma->anon_vma_chain);
>-#ifdef CONFIG_PER_VMA_LOCK
>-	/* vma is not locked, can't use vma_mark_detached() */
>-	vma->detached = true;
>-#endif
> 	vma_numab_state_init(vma);
>-	vma_lock_init(vma);
>+	vma_lock_init(vma, false);

vma_init(vma, mm)
  memset(vma, 0, sizeof(*vma))
  ...
  vma_lock_init(vma, false);

It looks the vm_refcnt must be reset.

BTW, I don't figure out why we want to skip the reset of vm_refcnt. Is this
related to SLAB_TYPESAFE_BY_RCU?

> }
> 

-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Sun, Jan 12, 2025 at 6:38 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> > static inline void vma_end_read(struct vm_area_struct *vma) {}
> >@@ -908,12 +948,8 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       vma->vm_mm = mm;
> >       vma->vm_ops = &vma_dummy_vm_ops;
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> >-#ifdef CONFIG_PER_VMA_LOCK
> >-      /* vma is not locked, can't use vma_mark_detached() */
> >-      vma->detached = true;
> >-#endif
> >       vma_numab_state_init(vma);
> >-      vma_lock_init(vma);
> >+      vma_lock_init(vma, false);
>
> vma_init(vma, mm)
>   memset(vma, 0, sizeof(*vma))
>   ...
>   vma_lock_init(vma, false);
>
> It looks the vm_refcnt must be reset.
>
> BTW, I don't figure out why we want to skip the reset of vm_refcnt. Is this
> related to SLAB_TYPESAFE_BY_RCU?

Earlier memset(vma, 0, sizeof(*vma)) already zeroes the entire
structure, so vm_refcnt is already 0 and does not need to be reset
again.

>
> > }
> >
>
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>rw_semaphore is a sizable structure of 40 bytes and consumes
>considerable space for each vm_area_struct. However vma_lock has
>two important specifics which can be used to replace rw_semaphore
>with a simpler structure:
>1. Readers never wait. They try to take the vma_lock and fall back to
>mmap_lock if that fails.
>2. Only one writer at a time will ever try to write-lock a vma_lock
>because writers first take mmap_lock in write mode.
>Because of these requirements, full rw_semaphore functionality is not
>needed and we can replace rw_semaphore and the vma->detached flag with
>a refcount (vm_refcnt).

This paragraph is merged into the above one in the commit log, which may not
what you expect.

Just a format issue, not sure why they are not separated.

>When vma is in detached state, vm_refcnt is 0 and only a call to
>vma_mark_attached() can take it out of this state. Note that unlike
>before, now we enforce both vma_mark_attached() and vma_mark_detached()
>to be done only after vma has been write-locked. vma_mark_attached()
>changes vm_refcnt to 1 to indicate that it has been attached to the vma
>tree. When a reader takes read lock, it increments vm_refcnt, unless the
>top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
>a writer. When writer takes write lock, it sets the top usable bit to
>indicate its presence. If there are readers, writer will wait using newly
>introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
>mode first, there can be only one writer at a time. The last reader to
>release the lock will signal the writer to wake up.
>refcount might overflow if there are many competing readers, in which case
>read-locking will fail. Readers are expected to handle such failures.
>In summary:
>1. all readers increment the vm_refcnt;
>2. writer sets top usable (writer) bit of vm_refcnt;
>3. readers cannot increment the vm_refcnt if the writer bit is set;
>4. in the presence of readers, writer must wait for the vm_refcnt to drop
>to 1 (ignoring the writer bit), indicating an attached vma with no readers;

It waits until to (VMA_LOCK_OFFSET + 1) as indicates in __vma_start_write(),
if I am right. 

>5. vm_refcnt overflow is handled by the readers.
>
>While this vm_lock replacement does not yet result in a smaller
>vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
>allows for further size optimization by structure member regrouping
>to bring the size of vm_area_struct below 192 bytes.
>
-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Sat, Jan 11, 2025 at 6:59 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >rw_semaphore is a sizable structure of 40 bytes and consumes
> >considerable space for each vm_area_struct. However vma_lock has
> >two important specifics which can be used to replace rw_semaphore
> >with a simpler structure:
> >1. Readers never wait. They try to take the vma_lock and fall back to
> >mmap_lock if that fails.
> >2. Only one writer at a time will ever try to write-lock a vma_lock
> >because writers first take mmap_lock in write mode.
> >Because of these requirements, full rw_semaphore functionality is not
> >needed and we can replace rw_semaphore and the vma->detached flag with
> >a refcount (vm_refcnt).
>
> This paragraph is merged into the above one in the commit log, which may not
> what you expect.
>
> Just a format issue, not sure why they are not separated.

I'll double-check the formatting. Thanks!

>
> >When vma is in detached state, vm_refcnt is 0 and only a call to
> >vma_mark_attached() can take it out of this state. Note that unlike
> >before, now we enforce both vma_mark_attached() and vma_mark_detached()
> >to be done only after vma has been write-locked. vma_mark_attached()
> >changes vm_refcnt to 1 to indicate that it has been attached to the vma
> >tree. When a reader takes read lock, it increments vm_refcnt, unless the
> >top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> >a writer. When writer takes write lock, it sets the top usable bit to
> >indicate its presence. If there are readers, writer will wait using newly
> >introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> >mode first, there can be only one writer at a time. The last reader to
> >release the lock will signal the writer to wake up.
> >refcount might overflow if there are many competing readers, in which case
> >read-locking will fail. Readers are expected to handle such failures.
> >In summary:
> >1. all readers increment the vm_refcnt;
> >2. writer sets top usable (writer) bit of vm_refcnt;
> >3. readers cannot increment the vm_refcnt if the writer bit is set;
> >4. in the presence of readers, writer must wait for the vm_refcnt to drop
> >to 1 (ignoring the writer bit), indicating an attached vma with no readers;
>
> It waits until to (VMA_LOCK_OFFSET + 1) as indicates in __vma_start_write(),
> if I am right.

Yeah, that's why I mentioned "(ignoring the writer bit)" but maybe
that's too confusing. How about "drop to 1 (plus the VMA_LOCK_OFFSET
writer bit)?

>
> >5. vm_refcnt overflow is handled by the readers.
> >
> >While this vm_lock replacement does not yet result in a smaller
> >vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
> >allows for further size optimization by structure member regrouping
> >to bring the size of vm_area_struct below 192 bytes.
> >
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Sun, Jan 12, 2025 at 09:35:25AM -0800, Suren Baghdasaryan wrote:
>On Sat, Jan 11, 2025 at 6:59 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>> >rw_semaphore is a sizable structure of 40 bytes and consumes
>> >considerable space for each vm_area_struct. However vma_lock has
>> >two important specifics which can be used to replace rw_semaphore
>> >with a simpler structure:
>> >1. Readers never wait. They try to take the vma_lock and fall back to
>> >mmap_lock if that fails.
>> >2. Only one writer at a time will ever try to write-lock a vma_lock
>> >because writers first take mmap_lock in write mode.
>> >Because of these requirements, full rw_semaphore functionality is not
>> >needed and we can replace rw_semaphore and the vma->detached flag with
>> >a refcount (vm_refcnt).
>>
>> This paragraph is merged into the above one in the commit log, which may not
>> what you expect.
>>
>> Just a format issue, not sure why they are not separated.
>
>I'll double-check the formatting. Thanks!
>
>>
>> >When vma is in detached state, vm_refcnt is 0 and only a call to
>> >vma_mark_attached() can take it out of this state. Note that unlike
>> >before, now we enforce both vma_mark_attached() and vma_mark_detached()
>> >to be done only after vma has been write-locked. vma_mark_attached()
>> >changes vm_refcnt to 1 to indicate that it has been attached to the vma
>> >tree. When a reader takes read lock, it increments vm_refcnt, unless the
>> >top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
>> >a writer. When writer takes write lock, it sets the top usable bit to
>> >indicate its presence. If there are readers, writer will wait using newly
>> >introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
>> >mode first, there can be only one writer at a time. The last reader to
>> >release the lock will signal the writer to wake up.
>> >refcount might overflow if there are many competing readers, in which case
>> >read-locking will fail. Readers are expected to handle such failures.
>> >In summary:
>> >1. all readers increment the vm_refcnt;
>> >2. writer sets top usable (writer) bit of vm_refcnt;
>> >3. readers cannot increment the vm_refcnt if the writer bit is set;
>> >4. in the presence of readers, writer must wait for the vm_refcnt to drop
>> >to 1 (ignoring the writer bit), indicating an attached vma with no readers;
>>
>> It waits until to (VMA_LOCK_OFFSET + 1) as indicates in __vma_start_write(),
>> if I am right.
>
>Yeah, that's why I mentioned "(ignoring the writer bit)" but maybe
>that's too confusing. How about "drop to 1 (plus the VMA_LOCK_OFFSET
>writer bit)?
>

Hmm.. hard to say. It is a little confusing, but I don't have a better one :-(

>>
>> >5. vm_refcnt overflow is handled by the readers.
>> >
>> >While this vm_lock replacement does not yet result in a smaller
>> >vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
>> >allows for further size optimization by structure member regrouping
>> >to bring the size of vm_area_struct below 192 bytes.
>> >
>> --
>> Wei Yang
>> Help you, Help me

-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Mateusz Guzik 1 year ago

On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:

So there were quite a few iterations of the patch and I have not been
reading majority of the feedback, so it may be I missed something,
apologies upfront. :)

>  /*
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
> @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>   */
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  {
> +	int oldcnt;
> +
>  	/*
>  	 * Check before locking. A race might cause false locked result.
>  	 * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>  
> -	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> +	/*
> +	 * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> +	 * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> +	 */
> +	if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> +						      VMA_REF_LIMIT)))
>  		return false;
>  

Replacing down_read_trylock() with the new routine loses an acquire
fence. That alone is not a problem, but see below.

> +	rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
>  	/*
> -	 * Overflow might produce false locked result.
> +	 * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
>  	 * False unlocked result is impossible because we modify and check
> -	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> +	 * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
>  	 * modification invalidates all existing locks.
>  	 *
>  	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
>  	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {

The previous modification of this spot to raw_read_seqcount loses the
acquire fence, making the above comment not line up with the code.

I don't know if the stock code (with down_read_trylock()) is correct as
is -- looks fine for cursory reading fwiw. However, if it indeed works,
the acquire fence stemming from the lock routine is a mandatory part of
it afaics.

I think the best way forward is to add a new refcount routine which
ships with an acquire fence.

Otherwise I would suggest:
1. a comment above __refcount_inc_not_zero_limited saying there is an
   acq fence issued later
2. smp_rmb() slapped between that and seq accesses

If the now removed fence is somehow not needed, I think a comment
explaining it is necessary.

> @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>  
>  static inline void vma_assert_locked(struct vm_area_struct *vma)
>  {
> -	if (!rwsem_is_locked(&vma->vm_lock.lock))
> +	if (refcount_read(&vma->vm_refcnt) <= 1)
>  		vma_assert_write_locked(vma);
>  }
>  

This now forces the compiler to emit a load from vm_refcnt even if
vma_assert_write_locked expands to nothing. iow this wants to hide
behind the same stuff as vma_assert_write_locked.

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>
> So there were quite a few iterations of the patch and I have not been
> reading majority of the feedback, so it may be I missed something,
> apologies upfront. :)
>
> >  /*
> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >   * locked result to avoid performance overhead, in which case we fall back to
> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >   */
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >  {
> > +     int oldcnt;
> > +
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > +     /*
> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> > +      */
> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > +                                                   VMA_REF_LIMIT)))
> >               return false;
> >
>
> Replacing down_read_trylock() with the new routine loses an acquire
> fence. That alone is not a problem, but see below.

Hmm. I think this acquire fence is actually necessary. We don't want
the later vm_lock_seq check to be reordered and happen before we take
the refcount. Otherwise this might happen:

reader             writer
if (vm_lock_seq == mm_lock_seq) // check got reordered
        return false;
                       vm_refcnt += VMA_LOCK_OFFSET
                       vm_lock_seq == mm_lock_seq
                       vm_refcnt -= VMA_LOCK_OFFSET
if (!__refcount_inc_not_zero_limited())
        return false;

Both reader's checks will pass and the reader would read-lock a vma
that was write-locked.

>
> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> >       /*
> > -      * Overflow might produce false locked result.
> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        *
> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
>
> The previous modification of this spot to raw_read_seqcount loses the
> acquire fence, making the above comment not line up with the code.

Is it? From reading the seqcount code
(https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):

raw_read_seqcount()
    seqprop_sequence()
        __seqprop(s, sequence)
            __seqprop_sequence()
                smp_load_acquire()

smp_load_acquire() still provides the acquire fence. Am I missing something?

>
> I don't know if the stock code (with down_read_trylock()) is correct as
> is -- looks fine for cursory reading fwiw. However, if it indeed works,
> the acquire fence stemming from the lock routine is a mandatory part of
> it afaics.
>
> I think the best way forward is to add a new refcount routine which
> ships with an acquire fence.

I plan on replacing refcount_t usage here with an atomic since, as
Hillf noted, refcount is not designed to be used for locking. And will
make sure the down_read_trylock() replacement will provide an acquire
fence.

>
> Otherwise I would suggest:
> 1. a comment above __refcount_inc_not_zero_limited saying there is an
>    acq fence issued later
> 2. smp_rmb() slapped between that and seq accesses
>
> If the now removed fence is somehow not needed, I think a comment
> explaining it is necessary.
>
> > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >
> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >  {
> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> > +     if (refcount_read(&vma->vm_refcnt) <= 1)
> >               vma_assert_write_locked(vma);
> >  }
> >
>
> This now forces the compiler to emit a load from vm_refcnt even if
> vma_assert_write_locked expands to nothing. iow this wants to hide
> behind the same stuff as vma_assert_write_locked.

True. I guess I'll have to avoid using vma_assert_write_locked() like this:

static inline void vma_assert_locked(struct vm_area_struct *vma)
{
        unsigned int mm_lock_seq;

        VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
                                          !__is_vma_write_locked(vma,
&mm_lock_seq), vma);
}

Will make the change.

Thanks for the feedback!

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Peter Zijlstra 1 year ago

On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:

> > Replacing down_read_trylock() with the new routine loses an acquire
> > fence. That alone is not a problem, but see below.
> 
> Hmm. I think this acquire fence is actually necessary. We don't want
> the later vm_lock_seq check to be reordered and happen before we take
> the refcount. Otherwise this might happen:
> 
> reader             writer
> if (vm_lock_seq == mm_lock_seq) // check got reordered
>         return false;
>                        vm_refcnt += VMA_LOCK_OFFSET
>                        vm_lock_seq == mm_lock_seq
>                        vm_refcnt -= VMA_LOCK_OFFSET
> if (!__refcount_inc_not_zero_limited())
>         return false;
> 
> Both reader's checks will pass and the reader would read-lock a vma
> that was write-locked.

Hmm, you're right. That acquire does matter here.

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Peter Zijlstra 1 year ago

On Wed, Jan 15, 2025 at 11:48:41AM +0100, Peter Zijlstra wrote:
> On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> 
> > > Replacing down_read_trylock() with the new routine loses an acquire
> > > fence. That alone is not a problem, but see below.
> > 
> > Hmm. I think this acquire fence is actually necessary. We don't want
> > the later vm_lock_seq check to be reordered and happen before we take
> > the refcount. Otherwise this might happen:
> > 
> > reader             writer
> > if (vm_lock_seq == mm_lock_seq) // check got reordered
> >         return false;
> >                        vm_refcnt += VMA_LOCK_OFFSET
> >                        vm_lock_seq == mm_lock_seq
> >                        vm_refcnt -= VMA_LOCK_OFFSET
> > if (!__refcount_inc_not_zero_limited())
> >         return false;
> > 
> > Both reader's checks will pass and the reader would read-lock a vma
> > that was write-locked.
> 
> Hmm, you're right. That acquire does matter here.

Notably, it means refcount_t is entirely unsuitable for anything
SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
conditions after the refcount succeeds.

And this is probably fine, but let me ponder this all a little more.

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Wed, Jan 15, 2025 at 3:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 15, 2025 at 11:48:41AM +0100, Peter Zijlstra wrote:
> > On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> >
> > > > Replacing down_read_trylock() with the new routine loses an acquire
> > > > fence. That alone is not a problem, but see below.
> > >
> > > Hmm. I think this acquire fence is actually necessary. We don't want
> > > the later vm_lock_seq check to be reordered and happen before we take
> > > the refcount. Otherwise this might happen:
> > >
> > > reader             writer
> > > if (vm_lock_seq == mm_lock_seq) // check got reordered
> > >         return false;
> > >                        vm_refcnt += VMA_LOCK_OFFSET
> > >                        vm_lock_seq == mm_lock_seq
> > >                        vm_refcnt -= VMA_LOCK_OFFSET
> > > if (!__refcount_inc_not_zero_limited())
> > >         return false;
> > >
> > > Both reader's checks will pass and the reader would read-lock a vma
> > > that was write-locked.
> >
> > Hmm, you're right. That acquire does matter here.
>
> Notably, it means refcount_t is entirely unsuitable for anything
> SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> conditions after the refcount succeeds.

Thanks for reviewing, Peter!
Yes, I'm changing the code to use atomic_t instead of refcount_t and
it comes out quite nicely I think. I had to add two small helper
functions:
vm_refcount_inc() - similar to refcount_add_not_zero() but with an
acquired fence.
vm_refcnt_sub() - similar to refcount_sub_and_test(). I could use
atomic_sub_and_test() but that would add unnecessary acquire fence in
the pagefault path, so I'm using refcount_sub_and_test() logic
instead.

For SLAB_TYPESAFE_BY_RCU I think we are ok with the
__vma_enter_locked()/__vma_exit_locked() transition in the
vma_mark_detached() before freeing the vma and would not need
secondary validation. In __vma_enter_locked(), vm_refcount gets
VMA_LOCK_OFFSET set, which prevents readers from taking the refcount.
In __vma_exit_locked() vm_refcnt transitions to 0, so again that
prevents readers from taking the refcount. IOW, the readers won't get
to the secondary validation and will fail early on
__refcount_inc_not_zero_limited(). I think this transition correctly
serves the purpose of waiting for current temporary readers to exit
and preventing new readers from read-locking and using the vma.

>
> And this is probably fine, but let me ponder this all a little more.

Thanks for taking the time!

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Peter Zijlstra 1 year ago

On Wed, Jan 15, 2025 at 07:00:37AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 15, 2025 at 3:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Jan 15, 2025 at 11:48:41AM +0100, Peter Zijlstra wrote:
> > > On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> > >
> > > > > Replacing down_read_trylock() with the new routine loses an acquire
> > > > > fence. That alone is not a problem, but see below.
> > > >
> > > > Hmm. I think this acquire fence is actually necessary. We don't want
> > > > the later vm_lock_seq check to be reordered and happen before we take
> > > > the refcount. Otherwise this might happen:
> > > >
> > > > reader             writer
> > > > if (vm_lock_seq == mm_lock_seq) // check got reordered
> > > >         return false;
> > > >                        vm_refcnt += VMA_LOCK_OFFSET
> > > >                        vm_lock_seq == mm_lock_seq
> > > >                        vm_refcnt -= VMA_LOCK_OFFSET
> > > > if (!__refcount_inc_not_zero_limited())
> > > >         return false;
> > > >
> > > > Both reader's checks will pass and the reader would read-lock a vma
> > > > that was write-locked.
> > >
> > > Hmm, you're right. That acquire does matter here.
> >
> > Notably, it means refcount_t is entirely unsuitable for anything
> > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > conditions after the refcount succeeds.
> 
> Thanks for reviewing, Peter!
> Yes, I'm changing the code to use atomic_t instead of refcount_t and
> it comes out quite nicely I think. I had to add two small helper
> functions:
> vm_refcount_inc() - similar to refcount_add_not_zero() but with an
> acquired fence.
> vm_refcnt_sub() - similar to refcount_sub_and_test(). I could use
> atomic_sub_and_test() but that would add unnecessary acquire fence in
> the pagefault path, so I'm using refcount_sub_and_test() logic
> instead.

Right.

> For SLAB_TYPESAFE_BY_RCU I think we are ok with the
> __vma_enter_locked()/__vma_exit_locked() transition in the
> vma_mark_detached() before freeing the vma and would not need
> secondary validation. In __vma_enter_locked(), vm_refcount gets
> VMA_LOCK_OFFSET set, which prevents readers from taking the refcount.
> In __vma_exit_locked() vm_refcnt transitions to 0, so again that
> prevents readers from taking the refcount. IOW, the readers won't get
> to the secondary validation and will fail early on
> __refcount_inc_not_zero_limited(). I think this transition correctly
> serves the purpose of waiting for current temporary readers to exit
> and preventing new readers from read-locking and using the vma.

Consider:

    CPU0				CPU1

    rcu_read_lock();
    vma = vma_lookup(mm, vaddr);

    ... cpu goes sleep for a *long time* ...

    					__vma_exit_locked();
					vma_area_free()
					..
					vma = vma_area_alloc();
					vma_mark_attached();

    ... comes back once vma is re-used ...

    vma_start_read()
      vm_refcount_inc(); // success!!

At which point we need to validate vma is for mm and covers vaddr, which
is what patch 15 does, no?



Also, I seem to have forgotten some braces back in 2008 :-)

---
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 10a971c2bde3..c1356b52f8ea 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -115,9 +115,10 @@ enum _slab_flag_bits {
  *   rcu_read_lock();
  *   obj = lockless_lookup(key);
  *   if (obj) {
- *     if (!try_get_ref(obj)) // might fail for free objects
+ *     if (!try_get_ref(obj)) { // might fail for free objects
  *       rcu_read_unlock();
  *       goto begin;
+ *     }
  *
  *     if (obj->key != key) { // not the object we expected
  *       put_ref(obj);

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Peter Zijlstra 1 year ago

On Wed, Jan 15, 2025 at 04:35:07PM +0100, Peter Zijlstra wrote:

> Consider:
> 
>     CPU0				CPU1
> 
>     rcu_read_lock();
>     vma = vma_lookup(mm, vaddr);
> 
>     ... cpu goes sleep for a *long time* ...
> 
>     					__vma_exit_locked();
> 					vma_area_free()
> 					..
> 					vma = vma_area_alloc();
> 					vma_mark_attached();
> 
>     ... comes back once vma is re-used ...
> 
>     vma_start_read()
>       vm_refcount_inc(); // success!!
> 
> At which point we need to validate vma is for mm and covers vaddr, which
> is what patch 15 does, no?

Also, critically, we want these reads to happen *after* the refcount
increment.

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Wed, Jan 15, 2025 at 7:38 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 15, 2025 at 04:35:07PM +0100, Peter Zijlstra wrote:
>
> > Consider:
> >
> >     CPU0                              CPU1
> >
> >     rcu_read_lock();
> >     vma = vma_lookup(mm, vaddr);
> >
> >     ... cpu goes sleep for a *long time* ...
> >
> >                                       __vma_exit_locked();
> >                                       vma_area_free()
> >                                       ..
> >                                       vma = vma_area_alloc();
> >                                       vma_mark_attached();
> >
> >     ... comes back once vma is re-used ...
> >
> >     vma_start_read()
> >       vm_refcount_inc(); // success!!
> >
> > At which point we need to validate vma is for mm and covers vaddr, which
> > is what patch 15 does, no?

Correct. Sorry, I thought by "secondary validation" you only meant
vm_lock_seq check in vma_start_read(). Now I understand your point.
Yes, if the vma we found gets reused before we read-lock it then the
checks for mm and address range should catch a possibly incorrect vma.
If these checks fail, we retry. If they succeed we have the correct
vma even if it was recycled since we found it.

>
> Also, critically, we want these reads to happen *after* the refcount
> increment.

Yes, and I think the acquire fence in the
refcount_add_not_zero_limited() replacement should guarantee that
ordering.

>

[PATCH] refcount: Strengthen inc_not_zero()

Posted by Peter Zijlstra 1 year ago

On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:

> Notably, it means refcount_t is entirely unsuitable for anything
> SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> conditions after the refcount succeeds.
> 
> And this is probably fine, but let me ponder this all a little more.

Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
fix this, these things are hard enough as they are.

Will, others, wdyt?

---
Subject: refcount: Strengthen inc_not_zero()

For speculative lookups where a successful inc_not_zero() pins the
object, but where we still need to double check if the object acquired
is indeed the one we set out to aquire, needs this validation to happen
*after* the increment.

Notably SLAB_TYPESAFE_BY_RCU is one such an example.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/refcount.h | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..340e7ffa445e 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -69,9 +69,10 @@
  * its the lock acquire, for RCU/lockless data structures its the dependent
  * load.
  *
- * Do note that inc_not_zero() provides a control dependency which will order
- * future stores against the inc, this ensures we'll never modify the object
- * if we did not in fact acquire a reference.
+ * Do note that inc_not_zero() does provide acquire order, which will order
+ * future load and stores against the inc, this ensures all subsequent accesses
+ * are from this object and not anything previously occupying this memory --
+ * consider SLAB_TYPESAFE_BY_RCU.
  *
  * The decrements will provide release order, such that all the prior loads and
  * stores will be issued before, it also provides a control dependency, which
@@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	do {
 		if (!old)
 			break;
-	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
+	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
 
 	if (oldp)
 		*oldp = old;
@@ -225,9 +226,9 @@ static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp
  * Similar to atomic_inc_not_zero(), but will saturate at REFCOUNT_SATURATED
  * and WARN.
  *
- * Provides no memory ordering, it is assumed the caller has guaranteed the
- * object memory to be stable (RCU, etc.). It does provide a control dependency
- * and thereby orders future stores. See the comment on top.
+ * Provides acquire ordering, such that subsequent accesses are after the
+ * increment. This is important for the cases where secondary validation is
+ * required, eg. SLAB_TYPESAFE_BY_RCU.
  *
  * Return: true if the increment was successful, false otherwise
  */

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Matthew Wilcox 1 year ago

On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> Subject: refcount: Strengthen inc_not_zero()
> 
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to aquire, needs this validation to happen
> *after* the increment.
> 
> Notably SLAB_TYPESAFE_BY_RCU is one such an example.

While you're looking at inc_not_zero(), have you already thought about
doing something like this?  ie failing rather than saturating since
all users of this already have to check for failure.  It looks like
two comparisons per call rather than three.

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..3ef7d316e870 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -142,16 +142,13 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
 	int old = refcount_read(r);
 
 	do {
-		if (!old)
+		if (old <= 0 || old + i < 0)
 			break;
 	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
 
 	if (oldp)
 		*oldp = old;
 
-	if (unlikely(old < 0 || old + i < 0))
-		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
-
 	return old;
 }
 

$ ./scripts/bloat-o-meter before.o after.o 
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-91 (-91)
Function                                     old     new   delta
io_wq_for_each_worker.isra                   162     158      -4
io_worker_handle_work                       1403    1387     -16
io_wq_activate_free_worker                   187     158     -29
io_queue_worker_create                       367     325     -42
Total: Before=10068, After=9977, chg -0.90%

(that's io_uring/io-wq.o as an example user of refcount_inc_not_zero())

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Will Deacon 1 year ago

On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> 
> > Notably, it means refcount_t is entirely unsuitable for anything
> > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > conditions after the refcount succeeds.
> > 
> > And this is probably fine, but let me ponder this all a little more.
> 
> Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> fix this, these things are hard enough as they are.
> 
> Will, others, wdyt?

We should also update the Documentation (atomic_t.txt and
refcount-vs-atomic.rst) if we strengthen this.

> ---
> Subject: refcount: Strengthen inc_not_zero()
> 
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to aquire, needs this validation to happen
> *after* the increment.
> 
> Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/refcount.h | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> index 35f039ecb272..340e7ffa445e 100644
> --- a/include/linux/refcount.h
> +++ b/include/linux/refcount.h
> @@ -69,9 +69,10 @@
>   * its the lock acquire, for RCU/lockless data structures its the dependent
>   * load.
>   *
> - * Do note that inc_not_zero() provides a control dependency which will order
> - * future stores against the inc, this ensures we'll never modify the object
> - * if we did not in fact acquire a reference.
> + * Do note that inc_not_zero() does provide acquire order, which will order
> + * future load and stores against the inc, this ensures all subsequent accesses
> + * are from this object and not anything previously occupying this memory --
> + * consider SLAB_TYPESAFE_BY_RCU.
>   *
>   * The decrements will provide release order, such that all the prior loads and
>   * stores will be issued before, it also provides a control dependency, which
> @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
>  	do {
>  		if (!old)
>  			break;
> -	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> +	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));

Hmm, do the later memory accesses need to be ordered against the store
part of the increment or just the read? If it's the former, then I don't
think that _acquire is sufficient -- accesses can still get in-between
the read and write parts of the RmW.

Will

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Will Deacon 1 year ago

On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > 
> > > Notably, it means refcount_t is entirely unsuitable for anything
> > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > conditions after the refcount succeeds.
> > > 
> > > And this is probably fine, but let me ponder this all a little more.
> > 
> > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > fix this, these things are hard enough as they are.
> > 
> > Will, others, wdyt?
> 
> We should also update the Documentation (atomic_t.txt and
> refcount-vs-atomic.rst) if we strengthen this.
> 
> > ---
> > Subject: refcount: Strengthen inc_not_zero()
> > 
> > For speculative lookups where a successful inc_not_zero() pins the
> > object, but where we still need to double check if the object acquired
> > is indeed the one we set out to aquire, needs this validation to happen
> > *after* the increment.
> > 
> > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  include/linux/refcount.h | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > index 35f039ecb272..340e7ffa445e 100644
> > --- a/include/linux/refcount.h
> > +++ b/include/linux/refcount.h
> > @@ -69,9 +69,10 @@
> >   * its the lock acquire, for RCU/lockless data structures its the dependent
> >   * load.
> >   *
> > - * Do note that inc_not_zero() provides a control dependency which will order
> > - * future stores against the inc, this ensures we'll never modify the object
> > - * if we did not in fact acquire a reference.
> > + * Do note that inc_not_zero() does provide acquire order, which will order
> > + * future load and stores against the inc, this ensures all subsequent accesses
> > + * are from this object and not anything previously occupying this memory --
> > + * consider SLAB_TYPESAFE_BY_RCU.
> >   *
> >   * The decrements will provide release order, such that all the prior loads and
> >   * stores will be issued before, it also provides a control dependency, which
> > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> >  	do {
> >  		if (!old)
> >  			break;
> > -	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > +	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> 
> Hmm, do the later memory accesses need to be ordered against the store
> part of the increment or just the read? If it's the former, then I don't
> think that _acquire is sufficient -- accesses can still get in-between
> the read and write parts of the RmW.

I dug some more into this at the end of last week. For the
SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
dec_and_test(), then I think using _acquire() above is correct as the
later references can only move up into the critical section in the case
that we successfully obtained a reference.

However, if we're going to make the barriers implicit in the refcount
operations here then I think we also need to do something on the producer
side for when the object is re-initialised after being destroyed and
allocated again. I think that would necessitate release ordering for
refcount_set() so that whatever allows the consumer to validate the
object (e.g. sequence number) is published *before* the refcount.

Will

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Suren Baghdasaryan 1 year ago

On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
>
> On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > >
> > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > conditions after the refcount succeeds.
> > > >
> > > > And this is probably fine, but let me ponder this all a little more.
> > >
> > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > fix this, these things are hard enough as they are.
> > >
> > > Will, others, wdyt?
> >
> > We should also update the Documentation (atomic_t.txt and
> > refcount-vs-atomic.rst) if we strengthen this.
> >
> > > ---
> > > Subject: refcount: Strengthen inc_not_zero()
> > >
> > > For speculative lookups where a successful inc_not_zero() pins the
> > > object, but where we still need to double check if the object acquired
> > > is indeed the one we set out to aquire, needs this validation to happen
> > > *after* the increment.
> > >
> > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  include/linux/refcount.h | 15 ++++++++-------
> > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > index 35f039ecb272..340e7ffa445e 100644
> > > --- a/include/linux/refcount.h
> > > +++ b/include/linux/refcount.h
> > > @@ -69,9 +69,10 @@
> > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > >   * load.
> > >   *
> > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > - * future stores against the inc, this ensures we'll never modify the object
> > > - * if we did not in fact acquire a reference.
> > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > + * are from this object and not anything previously occupying this memory --
> > > + * consider SLAB_TYPESAFE_BY_RCU.
> > >   *
> > >   * The decrements will provide release order, such that all the prior loads and
> > >   * stores will be issued before, it also provides a control dependency, which
> > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > >     do {
> > >             if (!old)
> > >                     break;
> > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> >
> > Hmm, do the later memory accesses need to be ordered against the store
> > part of the increment or just the read? If it's the former, then I don't
> > think that _acquire is sufficient -- accesses can still get in-between
> > the read and write parts of the RmW.
>
> I dug some more into this at the end of last week. For the
> SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> dec_and_test(), then I think using _acquire() above is correct as the
> later references can only move up into the critical section in the case
> that we successfully obtained a reference.
>
> However, if we're going to make the barriers implicit in the refcount
> operations here then I think we also need to do something on the producer
> side for when the object is re-initialised after being destroyed and
> allocated again. I think that would necessitate release ordering for
> refcount_set() so that whatever allows the consumer to validate the
> object (e.g. sequence number) is published *before* the refcount.

Thanks Will!
I would like to expand on your answer to provide an example of the
race that would happen without release ordering in the producer. To
save reader's time I provide a simplified flow and reasoning first.
More detailed code of what I'm considering a typical
SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
reply (Addendum).
Simplified flow looks like this:

consumer:
    obj = lookup(collection, key);
    if (!refcount_inc_not_zero(&obj->ref))
        return;
    smp_rmb(); /* Peter's new acquire fence */
    if (READ_ONCE(obj->key) != key) {
        put_ref(obj);
        return;
    }
    use(obj->value);

producer:
    old_key = obj->key;
    remove(collection, old_key);
    if (!refcount_dec_and_test(&obj->ref))
        return;
    obj->key = KEY_INVALID;
    free(objj);
    ...
    obj = malloc(); /* obj is reused */
    obj->key = new_key;
    obj->value = new_value;
    smp_wmb(); /* Will's proposed release fence */
    refcount_set(obj->ref, 1);
    insert(collection, key, obj);

Let's consider a case when new_key == old_key. Will call both of them
"key". Without WIll's proposed fence the following reordering is
possible:

consumer:
    obj = lookup(collection, key);

                 producer:
                     key = obj->key
                     remove(collection, key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(objj);
                     obj = malloc(); /* obj is reused */
                     refcount_set(obj->ref, 1);
                     obj->key = key; /* same key */

    if (!refcount_inc_not_zero(&obj->ref))
        return;
    smp_rmb();
    if (READ_ONCE(obj->key) != key) {
        put_ref(obj);
        return;
    }
    use(obj->value);

                     obj->value = new_value; /* reordered store */
                     add(collection, key, obj);

So, the consumer finds the old object, successfully takes a refcount
and validates the key. It succeeds because the object is allocated and
has the same key, which is fine. However it proceeds to use stale
obj->value. Will's proposed release ordering would prevent that.

The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
I think it would be better to introduce two new functions:
refcount_add_not_zero_acquire() and refcount_set_release(), clearly
document that they should be used when a freed object can be recycled
and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
should also clarify that once it's called, the object can be accessed
by consumers even if it was not added yet into the collection used for
object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
then can explicitly use these new functions in the example code,
further clarifying their purpose and proper use.
WDYT?

ADDENDUM.
Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:

struct object {
    refcount_t ref;
    u64 key;
    u64 value;
};

void init(struct object *obj, u64 key, u64 value)
{
    obj->key = key;
    obj->value = value;
    smp_wmb(); /* Will's proposed release fence */
    refcount_set(obj->ref, 1);
}

bool get_ref(struct object *obj, u64 key)
{
    if (!refcount_inc_not_zero(&obj->ref))
        return false;
    smp_rmb(); /* Peter's new acquire fence */
    if (READ_ONCE(obj->key) != key) {
        put_ref(obj);
        return false;
    }
    return true;
}

void put_ref(struct object *obj)
{
    if (!refcount_dec_and_test(&obj->ref))
        return;
    obj->key = KEY_INVALID;
    free(obj);
}

consumer:
    obj = lookup(collection, key);
    if (!get_ref(obj, key)
        return;
    use(obj->value);

producer:
    remove(collection, old_obj->key);
    put_ref(old_obj);
    new_obj = malloc();
    init(new_obj, new_key, new_value);
    insert(collection, new_key, new_obj);

With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
equal to new_obj.


>
> Will

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Suren Baghdasaryan 1 year ago

On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> >
> > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > >
> > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > conditions after the refcount succeeds.
> > > > >
> > > > > And this is probably fine, but let me ponder this all a little more.
> > > >
> > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > fix this, these things are hard enough as they are.
> > > >
> > > > Will, others, wdyt?
> > >
> > > We should also update the Documentation (atomic_t.txt and
> > > refcount-vs-atomic.rst) if we strengthen this.
> > >
> > > > ---
> > > > Subject: refcount: Strengthen inc_not_zero()
> > > >
> > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > object, but where we still need to double check if the object acquired
> > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > *after* the increment.
> > > >
> > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > >
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > ---
> > > >  include/linux/refcount.h | 15 ++++++++-------
> > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > index 35f039ecb272..340e7ffa445e 100644
> > > > --- a/include/linux/refcount.h
> > > > +++ b/include/linux/refcount.h
> > > > @@ -69,9 +69,10 @@
> > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > >   * load.
> > > >   *
> > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > - * if we did not in fact acquire a reference.
> > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > + * are from this object and not anything previously occupying this memory --
> > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > >   *
> > > >   * The decrements will provide release order, such that all the prior loads and
> > > >   * stores will be issued before, it also provides a control dependency, which
> > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > >     do {
> > > >             if (!old)
> > > >                     break;
> > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > >
> > > Hmm, do the later memory accesses need to be ordered against the store
> > > part of the increment or just the read? If it's the former, then I don't
> > > think that _acquire is sufficient -- accesses can still get in-between
> > > the read and write parts of the RmW.
> >
> > I dug some more into this at the end of last week. For the
> > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > dec_and_test(), then I think using _acquire() above is correct as the
> > later references can only move up into the critical section in the case
> > that we successfully obtained a reference.
> >
> > However, if we're going to make the barriers implicit in the refcount
> > operations here then I think we also need to do something on the producer
> > side for when the object is re-initialised after being destroyed and
> > allocated again. I think that would necessitate release ordering for
> > refcount_set() so that whatever allows the consumer to validate the
> > object (e.g. sequence number) is published *before* the refcount.
>
> Thanks Will!
> I would like to expand on your answer to provide an example of the
> race that would happen without release ordering in the producer. To
> save reader's time I provide a simplified flow and reasoning first.
> More detailed code of what I'm considering a typical
> SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> reply (Addendum).
> Simplified flow looks like this:
>
> consumer:
>     obj = lookup(collection, key);
>     if (!refcount_inc_not_zero(&obj->ref))
>         return;
>     smp_rmb(); /* Peter's new acquire fence */
>     if (READ_ONCE(obj->key) != key) {
>         put_ref(obj);
>         return;
>     }
>     use(obj->value);
>
> producer:
>     old_key = obj->key;
>     remove(collection, old_key);
>     if (!refcount_dec_and_test(&obj->ref))
>         return;
>     obj->key = KEY_INVALID;
>     free(objj);
>     ...
>     obj = malloc(); /* obj is reused */
>     obj->key = new_key;
>     obj->value = new_value;
>     smp_wmb(); /* Will's proposed release fence */
>     refcount_set(obj->ref, 1);
>     insert(collection, key, obj);
>
> Let's consider a case when new_key == old_key. Will call both of them
> "key". Without WIll's proposed fence the following reordering is
> possible:
>
> consumer:
>     obj = lookup(collection, key);
>
>                  producer:
>                      key = obj->key
>                      remove(collection, key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(objj);
>                      obj = malloc(); /* obj is reused */
>                      refcount_set(obj->ref, 1);
>                      obj->key = key; /* same key */
>
>     if (!refcount_inc_not_zero(&obj->ref))
>         return;
>     smp_rmb();
>     if (READ_ONCE(obj->key) != key) {
>         put_ref(obj);
>         return;
>     }
>     use(obj->value);
>
>                      obj->value = new_value; /* reordered store */
>                      add(collection, key, obj);
>
> So, the consumer finds the old object, successfully takes a refcount
> and validates the key. It succeeds because the object is allocated and
> has the same key, which is fine. However it proceeds to use stale
> obj->value. Will's proposed release ordering would prevent that.
>
> The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> I think it would be better to introduce two new functions:
> refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> document that they should be used when a freed object can be recycled
> and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> should also clarify that once it's called, the object can be accessed
> by consumers even if it was not added yet into the collection used for
> object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> then can explicitly use these new functions in the example code,
> further clarifying their purpose and proper use.
> WDYT?

Hi Peter,
Should I take a stab at preparing a patch to add the two new
refcounting functions suggested above with updates to the
documentation and comments?
If you disagree with that or need more time to decide then I'll wait.
Please let me know.
Thanks,
Suren.


>
> ADDENDUM.
> Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
>
> struct object {
>     refcount_t ref;
>     u64 key;
>     u64 value;
> };
>
> void init(struct object *obj, u64 key, u64 value)
> {
>     obj->key = key;
>     obj->value = value;
>     smp_wmb(); /* Will's proposed release fence */
>     refcount_set(obj->ref, 1);
> }
>
> bool get_ref(struct object *obj, u64 key)
> {
>     if (!refcount_inc_not_zero(&obj->ref))
>         return false;
>     smp_rmb(); /* Peter's new acquire fence */
>     if (READ_ONCE(obj->key) != key) {
>         put_ref(obj);
>         return false;
>     }
>     return true;
> }
>
> void put_ref(struct object *obj)
> {
>     if (!refcount_dec_and_test(&obj->ref))
>         return;
>     obj->key = KEY_INVALID;
>     free(obj);
> }
>
> consumer:
>     obj = lookup(collection, key);
>     if (!get_ref(obj, key)
>         return;
>     use(obj->value);
>
> producer:
>     remove(collection, old_obj->key);
>     put_ref(old_obj);
>     new_obj = malloc();
>     init(new_obj, new_key, new_value);
>     insert(collection, new_key, new_obj);
>
> With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> equal to new_obj.
>
>
> >
> > Will

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Suren Baghdasaryan 1 year ago

On Tue, Jan 28, 2025 at 3:51 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> > >
> > > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > > >
> > > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > > conditions after the refcount succeeds.
> > > > > >
> > > > > > And this is probably fine, but let me ponder this all a little more.
> > > > >
> > > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > > fix this, these things are hard enough as they are.
> > > > >
> > > > > Will, others, wdyt?
> > > >
> > > > We should also update the Documentation (atomic_t.txt and
> > > > refcount-vs-atomic.rst) if we strengthen this.
> > > >
> > > > > ---
> > > > > Subject: refcount: Strengthen inc_not_zero()
> > > > >
> > > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > > object, but where we still need to double check if the object acquired
> > > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > > *after* the increment.
> > > > >
> > > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > > >
> > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > ---
> > > > >  include/linux/refcount.h | 15 ++++++++-------
> > > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > index 35f039ecb272..340e7ffa445e 100644
> > > > > --- a/include/linux/refcount.h
> > > > > +++ b/include/linux/refcount.h
> > > > > @@ -69,9 +69,10 @@
> > > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > > >   * load.
> > > > >   *
> > > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > > - * if we did not in fact acquire a reference.
> > > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > > + * are from this object and not anything previously occupying this memory --
> > > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > > >   *
> > > > >   * The decrements will provide release order, such that all the prior loads and
> > > > >   * stores will be issued before, it also provides a control dependency, which
> > > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > >     do {
> > > > >             if (!old)
> > > > >                     break;
> > > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > > >
> > > > Hmm, do the later memory accesses need to be ordered against the store
> > > > part of the increment or just the read? If it's the former, then I don't
> > > > think that _acquire is sufficient -- accesses can still get in-between
> > > > the read and write parts of the RmW.
> > >
> > > I dug some more into this at the end of last week. For the
> > > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > > dec_and_test(), then I think using _acquire() above is correct as the
> > > later references can only move up into the critical section in the case
> > > that we successfully obtained a reference.
> > >
> > > However, if we're going to make the barriers implicit in the refcount
> > > operations here then I think we also need to do something on the producer
> > > side for when the object is re-initialised after being destroyed and
> > > allocated again. I think that would necessitate release ordering for
> > > refcount_set() so that whatever allows the consumer to validate the
> > > object (e.g. sequence number) is published *before* the refcount.
> >
> > Thanks Will!
> > I would like to expand on your answer to provide an example of the
> > race that would happen without release ordering in the producer. To
> > save reader's time I provide a simplified flow and reasoning first.
> > More detailed code of what I'm considering a typical
> > SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> > reply (Addendum).
> > Simplified flow looks like this:
> >
> > consumer:
> >     obj = lookup(collection, key);
> >     if (!refcount_inc_not_zero(&obj->ref))
> >         return;
> >     smp_rmb(); /* Peter's new acquire fence */
> >     if (READ_ONCE(obj->key) != key) {
> >         put_ref(obj);
> >         return;
> >     }
> >     use(obj->value);
> >
> > producer:
> >     old_key = obj->key;
> >     remove(collection, old_key);
> >     if (!refcount_dec_and_test(&obj->ref))
> >         return;
> >     obj->key = KEY_INVALID;
> >     free(objj);
> >     ...
> >     obj = malloc(); /* obj is reused */
> >     obj->key = new_key;
> >     obj->value = new_value;
> >     smp_wmb(); /* Will's proposed release fence */
> >     refcount_set(obj->ref, 1);
> >     insert(collection, key, obj);
> >
> > Let's consider a case when new_key == old_key. Will call both of them
> > "key". Without WIll's proposed fence the following reordering is
> > possible:
> >
> > consumer:
> >     obj = lookup(collection, key);
> >
> >                  producer:
> >                      key = obj->key
> >                      remove(collection, key);
> >                      if (!refcount_dec_and_test(&obj->ref))
> >                          return;
> >                      obj->key = KEY_INVALID;
> >                      free(objj);
> >                      obj = malloc(); /* obj is reused */
> >                      refcount_set(obj->ref, 1);
> >                      obj->key = key; /* same key */
> >
> >     if (!refcount_inc_not_zero(&obj->ref))
> >         return;
> >     smp_rmb();
> >     if (READ_ONCE(obj->key) != key) {
> >         put_ref(obj);
> >         return;
> >     }
> >     use(obj->value);
> >
> >                      obj->value = new_value; /* reordered store */
> >                      add(collection, key, obj);
> >
> > So, the consumer finds the old object, successfully takes a refcount
> > and validates the key. It succeeds because the object is allocated and
> > has the same key, which is fine. However it proceeds to use stale
> > obj->value. Will's proposed release ordering would prevent that.
> >
> > The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> > I think it would be better to introduce two new functions:
> > refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> > document that they should be used when a freed object can be recycled
> > and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> > should also clarify that once it's called, the object can be accessed
> > by consumers even if it was not added yet into the collection used for
> > object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> > comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > then can explicitly use these new functions in the example code,
> > further clarifying their purpose and proper use.
> > WDYT?
>
> Hi Peter,
> Should I take a stab at preparing a patch to add the two new
> refcounting functions suggested above with updates to the
> documentation and comments?
> If you disagree with that or need more time to decide then I'll wait.
> Please let me know.

Not sure if "--in-reply-to" worked but I just posted a patch adding
new refcounting APIs for SLAB_TYPESAFE_BY_RCU here:
https://lore.kernel.org/all/20250206025201.979573-1-surenb@google.com/
Since Peter seems to be busy I discussed these ordering requirements
for SLAB_TYPESAFE_BY_RCU with Paul McKenney and he was leaning towards
having separate functions with the additional fences for this case.
That's what I provided in my patch.
Another possible option is to add acquire ordering in the
__refcount_add_not_zero() as Peter suggested and add
refcount_set_release() function.
Thanks,
Suren.


> Thanks,
> Suren.
>
>
> >
> > ADDENDUM.
> > Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
> >
> > struct object {
> >     refcount_t ref;
> >     u64 key;
> >     u64 value;
> > };
> >
> > void init(struct object *obj, u64 key, u64 value)
> > {
> >     obj->key = key;
> >     obj->value = value;
> >     smp_wmb(); /* Will's proposed release fence */
> >     refcount_set(obj->ref, 1);
> > }
> >
> > bool get_ref(struct object *obj, u64 key)
> > {
> >     if (!refcount_inc_not_zero(&obj->ref))
> >         return false;
> >     smp_rmb(); /* Peter's new acquire fence */
> >     if (READ_ONCE(obj->key) != key) {
> >         put_ref(obj);
> >         return false;
> >     }
> >     return true;
> > }
> >
> > void put_ref(struct object *obj)
> > {
> >     if (!refcount_dec_and_test(&obj->ref))
> >         return;
> >     obj->key = KEY_INVALID;
> >     free(obj);
> > }
> >
> > consumer:
> >     obj = lookup(collection, key);
> >     if (!get_ref(obj, key)
> >         return;
> >     use(obj->value);
> >
> > producer:
> >     remove(collection, old_obj->key);
> >     put_ref(old_obj);
> >     new_obj = malloc();
> >     init(new_obj, new_key, new_value);
> >     insert(collection, new_key, new_obj);
> >
> > With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> > equal to new_obj.
> >
> >
> > >
> > > Will

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Suren Baghdasaryan 11 months, 4 weeks ago

On Wed, Feb 5, 2025 at 7:03 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 28, 2025 at 3:51 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> > > >
> > > > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > > > >
> > > > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > > > conditions after the refcount succeeds.
> > > > > > >
> > > > > > > And this is probably fine, but let me ponder this all a little more.
> > > > > >
> > > > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > > > fix this, these things are hard enough as they are.
> > > > > >
> > > > > > Will, others, wdyt?
> > > > >
> > > > > We should also update the Documentation (atomic_t.txt and
> > > > > refcount-vs-atomic.rst) if we strengthen this.
> > > > >
> > > > > > ---
> > > > > > Subject: refcount: Strengthen inc_not_zero()
> > > > > >
> > > > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > > > object, but where we still need to double check if the object acquired
> > > > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > > > *after* the increment.
> > > > > >
> > > > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > > > >
> > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > > ---
> > > > > >  include/linux/refcount.h | 15 ++++++++-------
> > > > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > > index 35f039ecb272..340e7ffa445e 100644
> > > > > > --- a/include/linux/refcount.h
> > > > > > +++ b/include/linux/refcount.h
> > > > > > @@ -69,9 +69,10 @@
> > > > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > > > >   * load.
> > > > > >   *
> > > > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > > > - * if we did not in fact acquire a reference.
> > > > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > > > + * are from this object and not anything previously occupying this memory --
> > > > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > > > >   *
> > > > > >   * The decrements will provide release order, such that all the prior loads and
> > > > > >   * stores will be issued before, it also provides a control dependency, which
> > > > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > > >     do {
> > > > > >             if (!old)
> > > > > >                     break;
> > > > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > > > >
> > > > > Hmm, do the later memory accesses need to be ordered against the store
> > > > > part of the increment or just the read? If it's the former, then I don't
> > > > > think that _acquire is sufficient -- accesses can still get in-between
> > > > > the read and write parts of the RmW.
> > > >
> > > > I dug some more into this at the end of last week. For the
> > > > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > > > dec_and_test(), then I think using _acquire() above is correct as the
> > > > later references can only move up into the critical section in the case
> > > > that we successfully obtained a reference.
> > > >
> > > > However, if we're going to make the barriers implicit in the refcount
> > > > operations here then I think we also need to do something on the producer
> > > > side for when the object is re-initialised after being destroyed and
> > > > allocated again. I think that would necessitate release ordering for
> > > > refcount_set() so that whatever allows the consumer to validate the
> > > > object (e.g. sequence number) is published *before* the refcount.
> > >
> > > Thanks Will!
> > > I would like to expand on your answer to provide an example of the
> > > race that would happen without release ordering in the producer. To
> > > save reader's time I provide a simplified flow and reasoning first.
> > > More detailed code of what I'm considering a typical
> > > SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> > > reply (Addendum).
> > > Simplified flow looks like this:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > > producer:
> > >     old_key = obj->key;
> > >     remove(collection, old_key);
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(objj);
> > >     ...
> > >     obj = malloc(); /* obj is reused */
> > >     obj->key = new_key;
> > >     obj->value = new_value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > >     insert(collection, key, obj);
> > >
> > > Let's consider a case when new_key == old_key. Will call both of them
> > > "key". Without WIll's proposed fence the following reordering is
> > > possible:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >
> > >                  producer:
> > >                      key = obj->key
> > >                      remove(collection, key);
> > >                      if (!refcount_dec_and_test(&obj->ref))
> > >                          return;
> > >                      obj->key = KEY_INVALID;
> > >                      free(objj);
> > >                      obj = malloc(); /* obj is reused */
> > >                      refcount_set(obj->ref, 1);
> > >                      obj->key = key; /* same key */
> > >
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb();
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > >                      obj->value = new_value; /* reordered store */
> > >                      add(collection, key, obj);
> > >
> > > So, the consumer finds the old object, successfully takes a refcount
> > > and validates the key. It succeeds because the object is allocated and
> > > has the same key, which is fine. However it proceeds to use stale
> > > obj->value. Will's proposed release ordering would prevent that.
> > >
> > > The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> > > I think it would be better to introduce two new functions:
> > > refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> > > document that they should be used when a freed object can be recycled
> > > and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> > > should also clarify that once it's called, the object can be accessed
> > > by consumers even if it was not added yet into the collection used for
> > > object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> > > comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > then can explicitly use these new functions in the example code,
> > > further clarifying their purpose and proper use.
> > > WDYT?
> >
> > Hi Peter,
> > Should I take a stab at preparing a patch to add the two new
> > refcounting functions suggested above with updates to the
> > documentation and comments?
> > If you disagree with that or need more time to decide then I'll wait.
> > Please let me know.
>
> Not sure if "--in-reply-to" worked but I just posted a patch adding
> new refcounting APIs for SLAB_TYPESAFE_BY_RCU here:
> https://lore.kernel.org/all/20250206025201.979573-1-surenb@google.com/

Since I did not get any replies other than Vlastimil's Ack on the
above patch, I went ahead and posted v10 of my patchset [1] and
included the patch above in it [2]. Feedback is highly appreciated!

[1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
[2] https://lore.kernel.org/all/20250213224655.1680278-11-surenb@google.com/


> Since Peter seems to be busy I discussed these ordering requirements
> for SLAB_TYPESAFE_BY_RCU with Paul McKenney and he was leaning towards
> having separate functions with the additional fences for this case.
> That's what I provided in my patch.
> Another possible option is to add acquire ordering in the
> __refcount_add_not_zero() as Peter suggested and add
> refcount_set_release() function.
> Thanks,
> Suren.
>
>
> > Thanks,
> > Suren.
> >
> >
> > >
> > > ADDENDUM.
> > > Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
> > >
> > > struct object {
> > >     refcount_t ref;
> > >     u64 key;
> > >     u64 value;
> > > };
> > >
> > > void init(struct object *obj, u64 key, u64 value)
> > > {
> > >     obj->key = key;
> > >     obj->value = value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > > }
> > >
> > > bool get_ref(struct object *obj, u64 key)
> > > {
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return false;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return false;
> > >     }
> > >     return true;
> > > }
> > >
> > > void put_ref(struct object *obj)
> > > {
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(obj);
> > > }
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!get_ref(obj, key)
> > >         return;
> > >     use(obj->value);
> > >
> > > producer:
> > >     remove(collection, old_obj->key);
> > >     put_ref(old_obj);
> > >     new_obj = malloc();
> > >     init(new_obj, new_key, new_value);
> > >     insert(collection, new_key, new_obj);
> > >
> > > With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> > > equal to new_obj.
> > >
> > >
> > > >
> > > > Will

On Wed, Feb 5, 2025 at 7:03 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 28, 2025 at 3:51 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Jan 27, 2025 at 11:21 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Jan 27, 2025 at 6:09 AM Will Deacon <will@kernel.org> wrote:
> > > >
> > > > On Fri, Jan 17, 2025 at 03:41:36PM +0000, Will Deacon wrote:
> > > > > On Wed, Jan 15, 2025 at 05:00:11PM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
> > > > > >
> > > > > > > Notably, it means refcount_t is entirely unsuitable for anything
> > > > > > > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > > > > > > conditions after the refcount succeeds.
> > > > > > >
> > > > > > > And this is probably fine, but let me ponder this all a little more.
> > > > > >
> > > > > > Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> > > > > > fix this, these things are hard enough as they are.
> > > > > >
> > > > > > Will, others, wdyt?
> > > > >
> > > > > We should also update the Documentation (atomic_t.txt and
> > > > > refcount-vs-atomic.rst) if we strengthen this.
> > > > >
> > > > > > ---
> > > > > > Subject: refcount: Strengthen inc_not_zero()
> > > > > >
> > > > > > For speculative lookups where a successful inc_not_zero() pins the
> > > > > > object, but where we still need to double check if the object acquired
> > > > > > is indeed the one we set out to aquire, needs this validation to happen
> > > > > > *after* the increment.
> > > > > >
> > > > > > Notably SLAB_TYPESAFE_BY_RCU is one such an example.
> > > > > >
> > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > > ---
> > > > > >  include/linux/refcount.h | 15 ++++++++-------
> > > > > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> > > > > > index 35f039ecb272..340e7ffa445e 100644
> > > > > > --- a/include/linux/refcount.h
> > > > > > +++ b/include/linux/refcount.h
> > > > > > @@ -69,9 +69,10 @@
> > > > > >   * its the lock acquire, for RCU/lockless data structures its the dependent
> > > > > >   * load.
> > > > > >   *
> > > > > > - * Do note that inc_not_zero() provides a control dependency which will order
> > > > > > - * future stores against the inc, this ensures we'll never modify the object
> > > > > > - * if we did not in fact acquire a reference.
> > > > > > + * Do note that inc_not_zero() does provide acquire order, which will order
> > > > > > + * future load and stores against the inc, this ensures all subsequent accesses
> > > > > > + * are from this object and not anything previously occupying this memory --
> > > > > > + * consider SLAB_TYPESAFE_BY_RCU.
> > > > > >   *
> > > > > >   * The decrements will provide release order, such that all the prior loads and
> > > > > >   * stores will be issued before, it also provides a control dependency, which
> > > > > > @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > > > > >     do {
> > > > > >             if (!old)
> > > > > >                     break;
> > > > > > -   } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> > > > > > +   } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
> > > > >
> > > > > Hmm, do the later memory accesses need to be ordered against the store
> > > > > part of the increment or just the read? If it's the former, then I don't
> > > > > think that _acquire is sufficient -- accesses can still get in-between
> > > > > the read and write parts of the RmW.
> > > >
> > > > I dug some more into this at the end of last week. For the
> > > > SLAB_TYPESAFE_BY_RCU where we're racing inc_not_zero() with
> > > > dec_and_test(), then I think using _acquire() above is correct as the
> > > > later references can only move up into the critical section in the case
> > > > that we successfully obtained a reference.
> > > >
> > > > However, if we're going to make the barriers implicit in the refcount
> > > > operations here then I think we also need to do something on the producer
> > > > side for when the object is re-initialised after being destroyed and
> > > > allocated again. I think that would necessitate release ordering for
> > > > refcount_set() so that whatever allows the consumer to validate the
> > > > object (e.g. sequence number) is published *before* the refcount.
> > >
> > > Thanks Will!
> > > I would like to expand on your answer to provide an example of the
> > > race that would happen without release ordering in the producer. To
> > > save reader's time I provide a simplified flow and reasoning first.
> > > More detailed code of what I'm considering a typical
> > > SLAB_TYPESAFE_BY_RCU refcounting example is added at the end of my
> > > reply (Addendum).
> > > Simplified flow looks like this:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > > producer:
> > >     old_key = obj->key;
> > >     remove(collection, old_key);
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(objj);
> > >     ...
> > >     obj = malloc(); /* obj is reused */
> > >     obj->key = new_key;
> > >     obj->value = new_value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > >     insert(collection, key, obj);
> > >
> > > Let's consider a case when new_key == old_key. Will call both of them
> > > "key". Without WIll's proposed fence the following reordering is
> > > possible:
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >
> > >                  producer:
> > >                      key = obj->key
> > >                      remove(collection, key);
> > >                      if (!refcount_dec_and_test(&obj->ref))
> > >                          return;
> > >                      obj->key = KEY_INVALID;
> > >                      free(objj);
> > >                      obj = malloc(); /* obj is reused */
> > >                      refcount_set(obj->ref, 1);
> > >                      obj->key = key; /* same key */
> > >
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return;
> > >     smp_rmb();
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return;
> > >     }
> > >     use(obj->value);
> > >
> > >                      obj->value = new_value; /* reordered store */
> > >                      add(collection, key, obj);
> > >
> > > So, the consumer finds the old object, successfully takes a refcount
> > > and validates the key. It succeeds because the object is allocated and
> > > has the same key, which is fine. However it proceeds to use stale
> > > obj->value. Will's proposed release ordering would prevent that.
> > >
> > > The example in https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > omits all these ordering issues for SLAB_TYPESAFE_BY_RCU.
> > > I think it would be better to introduce two new functions:
> > > refcount_add_not_zero_acquire() and refcount_set_release(), clearly
> > > document that they should be used when a freed object can be recycled
> > > and reused, like in SLAB_TYPESAFE_BY_RCU case. refcount_set_release()
> > > should also clarify that once it's called, the object can be accessed
> > > by consumers even if it was not added yet into the collection used for
> > > object lookup (like in the example above). SLAB_TYPESAFE_BY_RCU
> > > comment at https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/slab.h#L102
> > > then can explicitly use these new functions in the example code,
> > > further clarifying their purpose and proper use.
> > > WDYT?
> >
> > Hi Peter,
> > Should I take a stab at preparing a patch to add the two new
> > refcounting functions suggested above with updates to the
> > documentation and comments?
> > If you disagree with that or need more time to decide then I'll wait.
> > Please let me know.
>
> Not sure if "--in-reply-to" worked but I just posted a patch adding
> new refcounting APIs for SLAB_TYPESAFE_BY_RCU here:
> https://lore.kernel.org/all/20250206025201.979573-1-surenb@google.com/
> Since Peter seems to be busy I discussed these ordering requirements
> for SLAB_TYPESAFE_BY_RCU with Paul McKenney and he was leaning towards
> having separate functions with the additional fences for this case.
> That's what I provided in my patch.
> Another possible option is to add acquire ordering in the
> __refcount_add_not_zero() as Peter suggested and add
> refcount_set_release() function.
> Thanks,
> Suren.
>
>
> > Thanks,
> > Suren.
> >
> >
> > >
> > > ADDENDUM.
> > > Detailed code for typical use of refcounting with SLAB_TYPESAFE_BY_RCU:
> > >
> > > struct object {
> > >     refcount_t ref;
> > >     u64 key;
> > >     u64 value;
> > > };
> > >
> > > void init(struct object *obj, u64 key, u64 value)
> > > {
> > >     obj->key = key;
> > >     obj->value = value;
> > >     smp_wmb(); /* Will's proposed release fence */
> > >     refcount_set(obj->ref, 1);
> > > }
> > >
> > > bool get_ref(struct object *obj, u64 key)
> > > {
> > >     if (!refcount_inc_not_zero(&obj->ref))
> > >         return false;
> > >     smp_rmb(); /* Peter's new acquire fence */
> > >     if (READ_ONCE(obj->key) != key) {
> > >         put_ref(obj);
> > >         return false;
> > >     }
> > >     return true;
> > > }
> > >
> > > void put_ref(struct object *obj)
> > > {
> > >     if (!refcount_dec_and_test(&obj->ref))
> > >         return;
> > >     obj->key = KEY_INVALID;
> > >     free(obj);
> > > }
> > >
> > > consumer:
> > >     obj = lookup(collection, key);
> > >     if (!get_ref(obj, key)
> > >         return;
> > >     use(obj->value);
> > >
> > > producer:
> > >     remove(collection, old_obj->key);
> > >     put_ref(old_obj);
> > >     new_obj = malloc();
> > >     init(new_obj, new_key, new_value);
> > >     insert(collection, new_key, new_obj);
> > >
> > > With SLAB_TYPESAFE_BY_RCU old_obj in the producer can be reused and be
> > > equal to new_obj.
> > >
> > >
> > > >
> > > > Will

[PATCH 1/1] refcount: provide ops for cases when object's memory can be reused

Posted by Suren Baghdasaryan 1 year ago

For speculative lookups where a successful inc_not_zero() pins the
object, but where we still need to double check if the object acquired
is indeed the one we set out to acquire (identity check), needs this
validation to happen *after* the increment.
Similarly, when a new object is initialized and its memory might have
been previously occupied by another object, all stores to initialize the
object should happen *before* refcount initialization.

Notably SLAB_TYPESAFE_BY_RCU is one such an example when this ordering
is required for reference counting.

Add refcount_{add|inc}_not_zero_acquire() to guarantee the proper ordering
between acquiring a reference count on an object and performing the
identity check for that object.
Add refcount_set_release() to guarantee proper ordering between stores
initializing object attributes and the store initializing the refcount.
refcount_set_release() should be done after all other object attributes
are initialized. Once refcount_set_release() is called, the object should
be considered visible to other tasks even if it was not yet added into an
object collection normally used to discover it. This is because other
tasks might have discovered the object previously occupying the same
memory and after memory reuse they can succeed in taking refcount for the
new object and start using it.

Object reuse example to consider:

consumer:
    obj = lookup(collection, key);
    if (!refcount_inc_not_zero_acquire(&obj->ref))
        return;
    if (READ_ONCE(obj->key) != key) { /* identity check */
        put_ref(obj);
        return;
    }
    use(obj->value);

                 producer:
                     remove(collection, obj->key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(obj);
                     obj = malloc(); /* obj is reused */
                     obj->key = new_key;
                     obj->value = new_value;
                     refcount_set_release(obj->ref, 1);
                     add(collection, new_key, obj);

refcount_{add|inc}_not_zero_acquire() is required to prevent the following
reordering when refcount_inc_not_zero() is used instead:

consumer:
    obj = lookup(collection, key);
    if (READ_ONCE(obj->key) != key) { /* reordered identity check */
        put_ref(obj);
        return;
    }
                 producer:
                     remove(collection, obj->key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(obj);
                     obj = malloc(); /* obj is reused */
                     obj->key = new_key;
                     obj->value = new_value;
                     refcount_set_release(obj->ref, 1);
                     add(collection, new_key, obj);

    if (!refcount_inc_not_zero(&obj->ref))
        return;
    use(obj->value); /* USING WRONG OBJECT */

refcount_set_release() is required to prevent the following reordering
when refcount_set() is used instead:

consumer:
    obj = lookup(collection, key);

                 producer:
                     remove(collection, obj->key);
                     if (!refcount_dec_and_test(&obj->ref))
                         return;
                     obj->key = KEY_INVALID;
                     free(obj);
                     obj = malloc(); /* obj is reused */
                     obj->key = new_key; /* new_key == old_key */
                     refcount_set(obj->ref, 1);

    if (!refcount_inc_not_zero_acquire(&obj->ref))
        return;
    if (READ_ONCE(obj->key) != key) { /* pass since new_key == old_key */
        put_ref(obj);
        return;
    }
    use(obj->value); /* USING STALE obj->value */

                     obj->value = new_value; /* reordered store */
                     add(collection, key, obj);

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/RCU/whatisRCU.rst               |  10 ++
 Documentation/core-api/refcount-vs-atomic.rst |  37 +++++-
 include/linux/refcount.h                      | 106 ++++++++++++++++++
 include/linux/slab.h                          |   9 ++
 4 files changed, 156 insertions(+), 6 deletions(-)

diff --git a/Documentation/RCU/whatisRCU.rst b/Documentation/RCU/whatisRCU.rst
index 1ef5784c1b84..53faeed7c190 100644
--- a/Documentation/RCU/whatisRCU.rst
+++ b/Documentation/RCU/whatisRCU.rst
@@ -971,6 +971,16 @@ unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
 initialized after each and every call to kmem_cache_alloc(), which renders
 reference-free spinlock acquisition completely unsafe.  Therefore, when
 using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
+If using refcount_t, the specialized refcount_{add|inc}_not_zero_acquire()
+and refcount_set_release() APIs should be used to ensure correct operation
+ordering when verifying object identity and when initializing newly
+allocated objects. Acquire fence in refcount_{add|inc}_not_zero_acquire()
+ensures that identity checks happen *after* reference count is taken.
+refcount_set_release() should be called after a newly allocated object is
+fully initialized and release fence ensures that new values are visible
+*before* refcount can be successfully taken by other users. Once
+refcount_set_release() is called, the object should be considered visible
+by other tasks.
 (Those willing to initialize their locks in a kmem_cache constructor
 may also use locking, including cache-friendly sequence locking.)
 
diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst
index 79a009ce11df..9551a7bbfd38 100644
--- a/Documentation/core-api/refcount-vs-atomic.rst
+++ b/Documentation/core-api/refcount-vs-atomic.rst
@@ -86,7 +86,19 @@ Memory ordering guarantee changes:
  * none (both fully unordered)
 
 
-case 2) - increment-based ops that return no value
+case 2) - non-"Read/Modify/Write" (RMW) ops with release ordering
+-------------------------------------------
+
+Function changes:
+
+ * atomic_set_release() --> refcount_set_release()
+
+Memory ordering guarantee changes:
+
+ * none (both provide RELEASE ordering)
+
+
+case 3) - increment-based ops that return no value
 --------------------------------------------------
 
 Function changes:
@@ -98,7 +110,7 @@ Memory ordering guarantee changes:
 
  * none (both fully unordered)
 
-case 3) - decrement-based RMW ops that return no value
+case 4) - decrement-based RMW ops that return no value
 ------------------------------------------------------
 
 Function changes:
@@ -110,7 +122,7 @@ Memory ordering guarantee changes:
  * fully unordered --> RELEASE ordering
 
 
-case 4) - increment-based RMW ops that return a value
+case 5) - increment-based RMW ops that return a value
 -----------------------------------------------------
 
 Function changes:
@@ -126,7 +138,20 @@ Memory ordering guarantees changes:
    result of obtaining pointer to the object!
 
 
-case 5) - generic dec/sub decrement-based RMW ops that return a value
+case 6) - increment-based RMW ops with acquire ordering that return a value
+-----------------------------------------------------
+
+Function changes:
+
+ * atomic_inc_not_zero() --> refcount_inc_not_zero_acquire()
+ * no atomic counterpart --> refcount_add_not_zero_acquire()
+
+Memory ordering guarantees changes:
+
+ * fully ordered --> ACQUIRE ordering on success
+
+
+case 7) - generic dec/sub decrement-based RMW ops that return a value
 ---------------------------------------------------------------------
 
 Function changes:
@@ -139,7 +164,7 @@ Memory ordering guarantees changes:
  * fully ordered --> RELEASE ordering + ACQUIRE ordering on success
 
 
-case 6) other decrement-based RMW ops that return a value
+case 8) other decrement-based RMW ops that return a value
 ---------------------------------------------------------
 
 Function changes:
@@ -154,7 +179,7 @@ Memory ordering guarantees changes:
 .. note:: atomic_add_unless() only provides full order on success.
 
 
-case 7) - lock-based RMW
+case 9) - lock-based RMW
 ------------------------
 
 Function changes:
diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 35f039ecb272..4589d2e7bfea 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -87,6 +87,15 @@
  * The decrements dec_and_test() and sub_and_test() also provide acquire
  * ordering on success.
  *
+ * refcount_{add|inc}_not_zero_acquire() and refcount_set_release() provide
+ * acquire and release ordering for cases when the memory occupied by the
+ * object might be reused to store another object. This is important for the
+ * cases where secondary validation is required to detect such reuse, e.g.
+ * SLAB_TYPESAFE_BY_RCU. The secondary validation checks have to happen after
+ * the refcount is taken, hence acquire order is necessary. Similarly, when the
+ * object is initialized, all stores to its attributes should be visible before
+ * the refcount is set, otherwise a stale attribute value might be used by
+ * another task which succeeds in taking a refcount to the new object.
  */
 
 #ifndef _LINUX_REFCOUNT_H
@@ -125,6 +134,31 @@ static inline void refcount_set(refcount_t *r, int n)
 	atomic_set(&r->refs, n);
 }
 
+/**
+ * refcount_set_release - set a refcount's value with release ordering
+ * @r: the refcount
+ * @n: value to which the refcount will be set
+ *
+ * This function should be used when memory occupied by the object might be
+ * reused to store another object -- consider SLAB_TYPESAFE_BY_RCU.
+ *
+ * Provides release memory ordering which will order previous memory operations
+ * against this store. This ensures all updates to this object are visible
+ * once the refcount is set and stale values from the object previously
+ * occupying this memory are overwritten with new ones.
+ *
+ * This function should be called only after new object is fully initialized.
+ * After this call the object should be considered visible to other tasks even
+ * if it was not yet added into an object collection normally used to discover
+ * it. This is because other tasks might have discovered the object previously
+ * occupying the same memory and after memory reuse they can succeed in taking
+ * refcount to the new object and start using it.
+ */
+static inline void refcount_set_release(refcount_t *r, int n)
+{
+	atomic_set_release(&r->refs, n);
+}
+
 /**
  * refcount_read - get a refcount's value
  * @r: the refcount
@@ -178,6 +212,52 @@ static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
 	return __refcount_add_not_zero(i, r, NULL);
 }
 
+static inline __must_check __signed_wrap
+bool __refcount_add_not_zero_acquire(int i, refcount_t *r, int *oldp)
+{
+	int old = refcount_read(r);
+
+	do {
+		if (!old)
+			break;
+	} while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
+
+	if (oldp)
+		*oldp = old;
+
+	if (unlikely(old < 0 || old + i < 0))
+		refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
+
+	return old;
+}
+
+/**
+ * refcount_add_not_zero_acquire - add a value to a refcount with acquire ordering unless it is 0
+ *
+ * @i: the value to add to the refcount
+ * @r: the refcount
+ *
+ * Will saturate at REFCOUNT_SATURATED and WARN.
+ *
+ * This function should be used when memory occupied by the object might be
+ * reused to store another object -- consider SLAB_TYPESAFE_BY_RCU.
+ *
+ * Provides acquire memory ordering on success, it is assumed the caller has
+ * guaranteed the object memory to be stable (RCU, etc.). It does provide a
+ * control dependency and thereby orders future stores. See the comment on top.
+ *
+ * Use of this function is not recommended for the normal reference counting
+ * use case in which references are taken and released one at a time.  In these
+ * cases, refcount_inc_not_zero_acquire() should instead be used to increment a
+ * reference count.
+ *
+ * Return: false if the passed refcount is 0, true otherwise
+ */
+static inline __must_check bool refcount_add_not_zero_acquire(int i, refcount_t *r)
+{
+	return __refcount_add_not_zero_acquire(i, r, NULL);
+}
+
 static inline __signed_wrap
 void __refcount_add(int i, refcount_t *r, int *oldp)
 {
@@ -236,6 +316,32 @@ static inline __must_check bool refcount_inc_not_zero(refcount_t *r)
 	return __refcount_inc_not_zero(r, NULL);
 }
 
+static inline __must_check bool __refcount_inc_not_zero_acquire(refcount_t *r, int *oldp)
+{
+	return __refcount_add_not_zero_acquire(1, r, oldp);
+}
+
+/**
+ * refcount_inc_not_zero_acquire - increment a refcount with acquire ordering unless it is 0
+ * @r: the refcount to increment
+ *
+ * Similar to refcount_inc_not_zero(), but provides acquire memory ordering on
+ * success.
+ *
+ * This function should be used when memory occupied by the object might be
+ * reused to store another object -- consider SLAB_TYPESAFE_BY_RCU.
+ *
+ * Provides acquire memory ordering on success, it is assumed the caller has
+ * guaranteed the object memory to be stable (RCU, etc.). It does provide a
+ * control dependency and thereby orders future stores. See the comment on top.
+ *
+ * Return: true if the increment was successful, false otherwise
+ */
+static inline __must_check bool refcount_inc_not_zero_acquire(refcount_t *r)
+{
+	return __refcount_inc_not_zero_acquire(r, NULL);
+}
+
 static inline void __refcount_inc(refcount_t *r, int *oldp)
 {
 	__refcount_add(1, r, oldp);
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 09eedaecf120..ad902a2d692b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -136,6 +136,15 @@ enum _slab_flag_bits {
  * rcu_read_lock before reading the address, then rcu_read_unlock after
  * taking the spinlock within the structure expected at that address.
  *
+ * Note that object identity check has to be done *after* acquiring a
+ * reference, therefore user has to ensure proper ordering for loads.
+ * Similarly, when initializing objects allocated with SLAB_TYPESAFE_BY_RCU,
+ * the newly allocated object has to be fully initialized *before* its
+ * refcount gets initialized and proper ordering for stores is required.
+ * refcount_{add|inc}_not_zero_acquire() and refcount_set_release() are
+ * designed with the proper fences required for reference counting objects
+ * allocated with SLAB_TYPESAFE_BY_RCU.
+ *
  * Note that it is not possible to acquire a lock within a structure
  * allocated with SLAB_TYPESAFE_BY_RCU without first acquiring a reference
  * as described above.  The reason is that SLAB_TYPESAFE_BY_RCU pages

base-commit: 92514ef226f511f2ca1fb1b8752966097518edc0
-- 
2.48.1.362.g079036d154-goog

Re: [PATCH 1/1] refcount: provide ops for cases when object's memory can be reused

Posted by Vlastimil Babka 1 year ago

On 2/6/25 03:52, Suren Baghdasaryan wrote:
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to acquire (identity check), needs this
> validation to happen *after* the increment.
> Similarly, when a new object is initialized and its memory might have
> been previously occupied by another object, all stores to initialize the
> object should happen *before* refcount initialization.
> 
> Notably SLAB_TYPESAFE_BY_RCU is one such an example when this ordering
> is required for reference counting.
> 
> Add refcount_{add|inc}_not_zero_acquire() to guarantee the proper ordering
> between acquiring a reference count on an object and performing the
> identity check for that object.
> Add refcount_set_release() to guarantee proper ordering between stores
> initializing object attributes and the store initializing the refcount.
> refcount_set_release() should be done after all other object attributes
> are initialized. Once refcount_set_release() is called, the object should
> be considered visible to other tasks even if it was not yet added into an
> object collection normally used to discover it. This is because other
> tasks might have discovered the object previously occupying the same
> memory and after memory reuse they can succeed in taking refcount for the
> new object and start using it.
> 
> Object reuse example to consider:
> 
> consumer:
>     obj = lookup(collection, key);
>     if (!refcount_inc_not_zero_acquire(&obj->ref))
>         return;
>     if (READ_ONCE(obj->key) != key) { /* identity check */
>         put_ref(obj);
>         return;
>     }
>     use(obj->value);
> 
>                  producer:
>                      remove(collection, obj->key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(obj);
>                      obj = malloc(); /* obj is reused */
>                      obj->key = new_key;
>                      obj->value = new_value;
>                      refcount_set_release(obj->ref, 1);
>                      add(collection, new_key, obj);
> 
> refcount_{add|inc}_not_zero_acquire() is required to prevent the following
> reordering when refcount_inc_not_zero() is used instead:
> 
> consumer:
>     obj = lookup(collection, key);
>     if (READ_ONCE(obj->key) != key) { /* reordered identity check */
>         put_ref(obj);
>         return;
>     }
>                  producer:
>                      remove(collection, obj->key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(obj);
>                      obj = malloc(); /* obj is reused */
>                      obj->key = new_key;
>                      obj->value = new_value;
>                      refcount_set_release(obj->ref, 1);
>                      add(collection, new_key, obj);
> 
>     if (!refcount_inc_not_zero(&obj->ref))
>         return;
>     use(obj->value); /* USING WRONG OBJECT */
> 
> refcount_set_release() is required to prevent the following reordering
> when refcount_set() is used instead:
> 
> consumer:
>     obj = lookup(collection, key);
> 
>                  producer:
>                      remove(collection, obj->key);
>                      if (!refcount_dec_and_test(&obj->ref))
>                          return;
>                      obj->key = KEY_INVALID;
>                      free(obj);
>                      obj = malloc(); /* obj is reused */
>                      obj->key = new_key; /* new_key == old_key */
>                      refcount_set(obj->ref, 1);
> 
>     if (!refcount_inc_not_zero_acquire(&obj->ref))
>         return;
>     if (READ_ONCE(obj->key) != key) { /* pass since new_key == old_key */
>         put_ref(obj);
>         return;
>     }
>     use(obj->value); /* USING STALE obj->value */
> 
>                      obj->value = new_value; /* reordered store */
>                      add(collection, key, obj);
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz> #slab

Re: [PATCH] refcount: Strengthen inc_not_zero()

Posted by Suren Baghdasaryan 1 year ago

On Wed, Jan 15, 2025 at 8:00 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 15, 2025 at 12:13:34PM +0100, Peter Zijlstra wrote:
>
> > Notably, it means refcount_t is entirely unsuitable for anything
> > SLAB_TYPESAFE_BY_RCU, since they all will need secondary validation
> > conditions after the refcount succeeds.
> >
> > And this is probably fine, but let me ponder this all a little more.
>
> Even though SLAB_TYPESAFE_BY_RCU is relatively rare, I think we'd better
> fix this, these things are hard enough as they are.
>
> Will, others, wdyt?

I'll wait for the verdict on this patch before proceeding with my
series. It obviously simplifies my job. Thanks Peter!

>
> ---
> Subject: refcount: Strengthen inc_not_zero()
>
> For speculative lookups where a successful inc_not_zero() pins the
> object, but where we still need to double check if the object acquired
> is indeed the one we set out to aquire, needs this validation to happen
> *after* the increment.
>
> Notably SLAB_TYPESAFE_BY_RCU is one such an example.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/refcount.h | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/refcount.h b/include/linux/refcount.h
> index 35f039ecb272..340e7ffa445e 100644
> --- a/include/linux/refcount.h
> +++ b/include/linux/refcount.h
> @@ -69,9 +69,10 @@
>   * its the lock acquire, for RCU/lockless data structures its the dependent
>   * load.
>   *
> - * Do note that inc_not_zero() provides a control dependency which will order
> - * future stores against the inc, this ensures we'll never modify the object
> - * if we did not in fact acquire a reference.
> + * Do note that inc_not_zero() does provide acquire order, which will order
> + * future load and stores against the inc, this ensures all subsequent accesses
> + * are from this object and not anything previously occupying this memory --
> + * consider SLAB_TYPESAFE_BY_RCU.
>   *
>   * The decrements will provide release order, such that all the prior loads and
>   * stores will be issued before, it also provides a control dependency, which
> @@ -144,7 +145,7 @@ bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
>         do {
>                 if (!old)
>                         break;
> -       } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> +       } while (!atomic_try_cmpxchg_acquire(&r->refs, &old, old + i));
>
>         if (oldp)
>                 *oldp = old;
> @@ -225,9 +226,9 @@ static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp
>   * Similar to atomic_inc_not_zero(), but will saturate at REFCOUNT_SATURATED
>   * and WARN.
>   *
> - * Provides no memory ordering, it is assumed the caller has guaranteed the
> - * object memory to be stable (RCU, etc.). It does provide a control dependency
> - * and thereby orders future stores. See the comment on top.
> + * Provides acquire ordering, such that subsequent accesses are after the
> + * increment. This is important for the cases where secondary validation is
> + * required, eg. SLAB_TYPESAFE_BY_RCU.
>   *
>   * Return: true if the increment was successful, false otherwise
>   */

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
>On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>>
>> So there were quite a few iterations of the patch and I have not been
>> reading majority of the feedback, so it may be I missed something,
>> apologies upfront. :)
>>

Hi, I am new to memory barriers. Hope not bothering.

>> >  /*
>> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
>> >   * locked result to avoid performance overhead, in which case we fall back to
>> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>> >   */
>> >  static inline bool vma_start_read(struct vm_area_struct *vma)
>> >  {
>> > +     int oldcnt;
>> > +
>> >       /*
>> >        * Check before locking. A race might cause false locked result.
>> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
>> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>> >               return false;
>> >
>> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>> > +     /*
>> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
>> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
>> > +      */
>> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
>> > +                                                   VMA_REF_LIMIT)))
>> >               return false;
>> >
>>
>> Replacing down_read_trylock() with the new routine loses an acquire
>> fence. That alone is not a problem, but see below.
>
>Hmm. I think this acquire fence is actually necessary. We don't want
>the later vm_lock_seq check to be reordered and happen before we take
>the refcount. Otherwise this might happen:
>
>reader             writer
>if (vm_lock_seq == mm_lock_seq) // check got reordered
>        return false;
>                       vm_refcnt += VMA_LOCK_OFFSET
>                       vm_lock_seq == mm_lock_seq
>                       vm_refcnt -= VMA_LOCK_OFFSET
>if (!__refcount_inc_not_zero_limited())
>        return false;
>
>Both reader's checks will pass and the reader would read-lock a vma
>that was write-locked.
>

Here what we plan to do is define __refcount_inc_not_zero_limited() with
acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?

>>
>> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
>> >       /*
>> > -      * Overflow might produce false locked result.
>> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
>> >        * False unlocked result is impossible because we modify and check
>> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
>> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
>> >        * modification invalidates all existing locks.
>> >        *
>> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
>> > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>> >        * This pairs with RELEASE semantics in vma_end_write_all().
>> >        */
>> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {

One question here is would compiler optimize the read of vm_lock_seq here,
since we have read it at the beginning?

Or with the acquire fence added above, compiler won't optimize it.
Or we should use REACE_ONCE(vma->vm_lock_seq) here?

>>
>> The previous modification of this spot to raw_read_seqcount loses the
>> acquire fence, making the above comment not line up with the code.
>
>Is it? From reading the seqcount code
>(https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
>
>raw_read_seqcount()
>    seqprop_sequence()
>        __seqprop(s, sequence)
>            __seqprop_sequence()
>                smp_load_acquire()
>
>smp_load_acquire() still provides the acquire fence. Am I missing something?
>
>>
>> I don't know if the stock code (with down_read_trylock()) is correct as
>> is -- looks fine for cursory reading fwiw. However, if it indeed works,
>> the acquire fence stemming from the lock routine is a mandatory part of
>> it afaics.
>>
>> I think the best way forward is to add a new refcount routine which
>> ships with an acquire fence.
>
>I plan on replacing refcount_t usage here with an atomic since, as
>Hillf noted, refcount is not designed to be used for locking. And will
>make sure the down_read_trylock() replacement will provide an acquire
>fence.
>

Hmm.. refcount_t is defined with atomic_t. I am lost why replacing refcount_t
with atomic_t would help.

>>
>> Otherwise I would suggest:
>> 1. a comment above __refcount_inc_not_zero_limited saying there is an
>>    acq fence issued later
>> 2. smp_rmb() slapped between that and seq accesses
>>
>> If the now removed fence is somehow not needed, I think a comment
>> explaining it is necessary.
>>
>> > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>> >
>> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
>> >  {
>> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
>> > +     if (refcount_read(&vma->vm_refcnt) <= 1)
>> >               vma_assert_write_locked(vma);
>> >  }
>> >
>>
>> This now forces the compiler to emit a load from vm_refcnt even if
>> vma_assert_write_locked expands to nothing. iow this wants to hide
>> behind the same stuff as vma_assert_write_locked.
>
>True. I guess I'll have to avoid using vma_assert_write_locked() like this:
>
>static inline void vma_assert_locked(struct vm_area_struct *vma)
>{
>        unsigned int mm_lock_seq;
>
>        VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
>                                          !__is_vma_write_locked(vma,
>&mm_lock_seq), vma);
>}
>
>Will make the change.
>
>Thanks for the feedback!

-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Sun, Jan 12, 2025 at 5:47 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> >On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>
> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >>
> >> So there were quite a few iterations of the patch and I have not been
> >> reading majority of the feedback, so it may be I missed something,
> >> apologies upfront. :)
> >>
>
> Hi, I am new to memory barriers. Hope not bothering.
>
> >> >  /*
> >> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >> >   * locked result to avoid performance overhead, in which case we fall back to
> >> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >> >   */
> >> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >> >  {
> >> > +     int oldcnt;
> >> > +
> >> >       /*
> >> >        * Check before locking. A race might cause false locked result.
> >> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> >> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >> >               return false;
> >> >
> >> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> >> > +     /*
> >> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> >> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> >> > +      */
> >> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> >> > +                                                   VMA_REF_LIMIT)))
> >> >               return false;
> >> >
> >>
> >> Replacing down_read_trylock() with the new routine loses an acquire
> >> fence. That alone is not a problem, but see below.
> >
> >Hmm. I think this acquire fence is actually necessary. We don't want
> >the later vm_lock_seq check to be reordered and happen before we take
> >the refcount. Otherwise this might happen:
> >
> >reader             writer
> >if (vm_lock_seq == mm_lock_seq) // check got reordered
> >        return false;
> >                       vm_refcnt += VMA_LOCK_OFFSET
> >                       vm_lock_seq == mm_lock_seq
> >                       vm_refcnt -= VMA_LOCK_OFFSET
> >if (!__refcount_inc_not_zero_limited())
> >        return false;
> >
> >Both reader's checks will pass and the reader would read-lock a vma
> >that was write-locked.
> >
>
> Here what we plan to do is define __refcount_inc_not_zero_limited() with
> acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?

Correct. __refcount_inc_not_zero_limited() does not do that in this
version but I'll fix that.

>
> >>
> >> > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> >> >       /*
> >> > -      * Overflow might produce false locked result.
> >> > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> >> >        * False unlocked result is impossible because we modify and check
> >> > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >> > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> >> >        * modification invalidates all existing locks.
> >> >        *
> >> >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> >> > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >> >        */
> >> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
>
> One question here is would compiler optimize the read of vm_lock_seq here,
> since we have read it at the beginning?
>
> Or with the acquire fence added above, compiler won't optimize it.

Correct. See "ACQUIRE operations" section in
https://www.kernel.org/doc/Documentation/memory-barriers.txt,
specifically this: "It guarantees that all memory operations after the
ACQUIRE operation will appear to happen after the ACQUIRE operation
with respect to the other components of the system.".

> Or we should use REACE_ONCE(vma->vm_lock_seq) here?
>
> >>
> >> The previous modification of this spot to raw_read_seqcount loses the
> >> acquire fence, making the above comment not line up with the code.
> >
> >Is it? From reading the seqcount code
> >(https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
> >
> >raw_read_seqcount()
> >    seqprop_sequence()
> >        __seqprop(s, sequence)
> >            __seqprop_sequence()
> >                smp_load_acquire()
> >
> >smp_load_acquire() still provides the acquire fence. Am I missing something?
> >
> >>
> >> I don't know if the stock code (with down_read_trylock()) is correct as
> >> is -- looks fine for cursory reading fwiw. However, if it indeed works,
> >> the acquire fence stemming from the lock routine is a mandatory part of
> >> it afaics.
> >>
> >> I think the best way forward is to add a new refcount routine which
> >> ships with an acquire fence.
> >
> >I plan on replacing refcount_t usage here with an atomic since, as
> >Hillf noted, refcount is not designed to be used for locking. And will
> >make sure the down_read_trylock() replacement will provide an acquire
> >fence.
> >
>
> Hmm.. refcount_t is defined with atomic_t. I am lost why replacing refcount_t
> with atomic_t would help.

My point is that refcount_t is not designed for locking, so changing
refcount-related functions and adding fences there would be wrong. So,
I'll move to using more generic atomic_t and will implement the
functionality I need without affecting refcounting functions.

>
> >>
> >> Otherwise I would suggest:
> >> 1. a comment above __refcount_inc_not_zero_limited saying there is an
> >>    acq fence issued later
> >> 2. smp_rmb() slapped between that and seq accesses
> >>
> >> If the now removed fence is somehow not needed, I think a comment
> >> explaining it is necessary.
> >>
> >> > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >> >
> >> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >> >  {
> >> > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> >> > +     if (refcount_read(&vma->vm_refcnt) <= 1)
> >> >               vma_assert_write_locked(vma);
> >> >  }
> >> >
> >>
> >> This now forces the compiler to emit a load from vm_refcnt even if
> >> vma_assert_write_locked expands to nothing. iow this wants to hide
> >> behind the same stuff as vma_assert_write_locked.
> >
> >True. I guess I'll have to avoid using vma_assert_write_locked() like this:
> >
> >static inline void vma_assert_locked(struct vm_area_struct *vma)
> >{
> >        unsigned int mm_lock_seq;
> >
> >        VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
> >                                          !__is_vma_write_locked(vma,
> >&mm_lock_seq), vma);
> >}
> >
> >Will make the change.
> >
> >Thanks for the feedback!
>
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Wei Yang 1 year ago

On Mon, Jan 13, 2025 at 01:47:29AM +0000, Wei Yang wrote:
>On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
>>On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
>>>
>>> So there were quite a few iterations of the patch and I have not been
>>> reading majority of the feedback, so it may be I missed something,
>>> apologies upfront. :)
>>>
>
>Hi, I am new to memory barriers. Hope not bothering.
>
>>> >  /*
>>> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
>>> >   * locked result to avoid performance overhead, in which case we fall back to
>>> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>>> >   */
>>> >  static inline bool vma_start_read(struct vm_area_struct *vma)
>>> >  {
>>> > +     int oldcnt;
>>> > +
>>> >       /*
>>> >        * Check before locking. A race might cause false locked result.
>>> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
>>> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>>> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>>> >               return false;
>>> >
>>> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>>> > +     /*
>>> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
>>> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
>>> > +      */
>>> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
>>> > +                                                   VMA_REF_LIMIT)))
>>> >               return false;
>>> >
>>>
>>> Replacing down_read_trylock() with the new routine loses an acquire
>>> fence. That alone is not a problem, but see below.
>>
>>Hmm. I think this acquire fence is actually necessary. We don't want
>>the later vm_lock_seq check to be reordered and happen before we take
>>the refcount. Otherwise this might happen:
>>
>>reader             writer
>>if (vm_lock_seq == mm_lock_seq) // check got reordered
>>        return false;
>>                       vm_refcnt += VMA_LOCK_OFFSET
>>                       vm_lock_seq == mm_lock_seq
>>                       vm_refcnt -= VMA_LOCK_OFFSET
>>if (!__refcount_inc_not_zero_limited())
>>        return false;
>>
>>Both reader's checks will pass and the reader would read-lock a vma
>>that was write-locked.
>>
>
>Here what we plan to do is define __refcount_inc_not_zero_limited() with
>acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?
>

BTW, usually we pair acquire with release.

The __vma_start_write() provide release fence when locked, so for this part
we are ok, right?  


-- 
Wei Yang
Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Sun, Jan 12, 2025 at 6:25 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, Jan 13, 2025 at 01:47:29AM +0000, Wei Yang wrote:
> >On Sat, Jan 11, 2025 at 12:14:47PM -0800, Suren Baghdasaryan wrote:
> >>On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>>
> >>> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >>>
> >>> So there were quite a few iterations of the patch and I have not been
> >>> reading majority of the feedback, so it may be I missed something,
> >>> apologies upfront. :)
> >>>
> >
> >Hi, I am new to memory barriers. Hope not bothering.
> >
> >>> >  /*
> >>> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >>> >   * locked result to avoid performance overhead, in which case we fall back to
> >>> > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >>> >   */
> >>> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >>> >  {
> >>> > +     int oldcnt;
> >>> > +
> >>> >       /*
> >>> >        * Check before locking. A race might cause false locked result.
> >>> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> >>> > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >>> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >>> >               return false;
> >>> >
> >>> > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> >>> > +     /*
> >>> > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> >>> > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> >>> > +      */
> >>> > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> >>> > +                                                   VMA_REF_LIMIT)))
> >>> >               return false;
> >>> >
> >>>
> >>> Replacing down_read_trylock() with the new routine loses an acquire
> >>> fence. That alone is not a problem, but see below.
> >>
> >>Hmm. I think this acquire fence is actually necessary. We don't want
> >>the later vm_lock_seq check to be reordered and happen before we take
> >>the refcount. Otherwise this might happen:
> >>
> >>reader             writer
> >>if (vm_lock_seq == mm_lock_seq) // check got reordered
> >>        return false;
> >>                       vm_refcnt += VMA_LOCK_OFFSET
> >>                       vm_lock_seq == mm_lock_seq
> >>                       vm_refcnt -= VMA_LOCK_OFFSET
> >>if (!__refcount_inc_not_zero_limited())
> >>        return false;
> >>
> >>Both reader's checks will pass and the reader would read-lock a vma
> >>that was write-locked.
> >>
> >
> >Here what we plan to do is define __refcount_inc_not_zero_limited() with
> >acquire fence, e.g. with atomic_try_cmpxchg_acquire(), right?
> >
>
> BTW, usually we pair acquire with release.
>
> The __vma_start_write() provide release fence when locked, so for this part
> we are ok, right?

Yes, __vma_start_write() -> __vma_exit_locked() ->
refcount_sub_and_test() and this function provides release memory
ordering, see https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/refcount.h#L289

>
>
> --
> Wei Yang
> Help you, Help me

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Vlastimil Babka 1 year ago

On 1/11/25 21:14, Suren Baghdasaryan wrote:
> I plan on replacing refcount_t usage here with an atomic since, as
> Hillf noted, refcount is not designed to be used for locking. And will
> make sure the down_read_trylock() replacement will provide an acquire
> fence.

Could Hillf stop reducing the Cc list on his replies? The whole subthread
went to only few people :(

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Mateusz Guzik 1 year ago

On Sat, Jan 11, 2025 at 9:14 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > The previous modification of this spot to raw_read_seqcount loses the
> > acquire fence, making the above comment not line up with the code.
>
> Is it? From reading the seqcount code
> (https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
>
> raw_read_seqcount()
>     seqprop_sequence()
>         __seqprop(s, sequence)
>             __seqprop_sequence()
>                 smp_load_acquire()
>
> smp_load_acquire() still provides the acquire fence. Am I missing something?
>

That's fine indeed.

In a different project there is an equivalent API which skips
barriers, too quick glance suggested this is what's going on here. My
bad, sorry for false alarm on this front. :)

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Sat, Jan 11, 2025 at 12:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Sat, Jan 11, 2025 at 9:14 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > The previous modification of this spot to raw_read_seqcount loses the
> > > acquire fence, making the above comment not line up with the code.
> >
> > Is it? From reading the seqcount code
> > (https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
> >
> > raw_read_seqcount()
> >     seqprop_sequence()
> >         __seqprop(s, sequence)
> >             __seqprop_sequence()
> >                 smp_load_acquire()
> >
> > smp_load_acquire() still provides the acquire fence. Am I missing something?
> >
>
> That's fine indeed.
>
> In a different project there is an equivalent API which skips
> barriers, too quick glance suggested this is what's going on here. My
> bad, sorry for false alarm on this front. :)

No worries. Better to double-check than to merge a bug.
Thanks,
Suren.

>
> --
> Mateusz Guzik <mjguzik gmail.com>

Re: [PATCH v9 11/17] mm: replace vm_lock and detached flag with a reference count

Posted by Suren Baghdasaryan 1 year ago

On Sat, Jan 11, 2025 at 12:14 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sat, Jan 11, 2025 at 3:24 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote:
> >
> > So there were quite a few iterations of the patch and I have not been
> > reading majority of the feedback, so it may be I missed something,
> > apologies upfront. :)
> >
> > >  /*
> > >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> > >   * locked result to avoid performance overhead, in which case we fall back to
> > > @@ -710,6 +742,8 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > >   */
> > >  static inline bool vma_start_read(struct vm_area_struct *vma)
> > >  {
> > > +     int oldcnt;
> > > +
> > >       /*
> > >        * Check before locking. A race might cause false locked result.
> > >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > > @@ -720,13 +754,19 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > >               return false;
> > >
> > > -     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > > +     /*
> > > +      * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited() will fail
> > > +      * because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET.
> > > +      */
> > > +     if (unlikely(!__refcount_inc_not_zero_limited(&vma->vm_refcnt, &oldcnt,
> > > +                                                   VMA_REF_LIMIT)))
> > >               return false;
> > >
> >
> > Replacing down_read_trylock() with the new routine loses an acquire
> > fence. That alone is not a problem, but see below.
>
> Hmm. I think this acquire fence is actually necessary. We don't want
> the later vm_lock_seq check to be reordered and happen before we take
> the refcount. Otherwise this might happen:
>
> reader             writer
> if (vm_lock_seq == mm_lock_seq) // check got reordered
>         return false;
>                        vm_refcnt += VMA_LOCK_OFFSET
>                        vm_lock_seq == mm_lock_seq

s/vm_lock_seq == mm_lock_seq/vm_lock_seq = mm_lock_seq

>                        vm_refcnt -= VMA_LOCK_OFFSET
> if (!__refcount_inc_not_zero_limited())
>         return false;
>
> Both reader's checks will pass and the reader would read-lock a vma
> that was write-locked.
>
> >
> > > +     rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> > >       /*
> > > -      * Overflow might produce false locked result.
> > > +      * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> > >        * False unlocked result is impossible because we modify and check
> > > -      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > +      * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq
> > >        * modification invalidates all existing locks.
> > >        *
> > >        * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> > > @@ -735,9 +775,10 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >        * This pairs with RELEASE semantics in vma_end_write_all().
> > >        */
> > >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> >
> > The previous modification of this spot to raw_read_seqcount loses the
> > acquire fence, making the above comment not line up with the code.
>
> Is it? From reading the seqcount code
> (https://elixir.bootlin.com/linux/v6.13-rc3/source/include/linux/seqlock.h#L211):
>
> raw_read_seqcount()
>     seqprop_sequence()
>         __seqprop(s, sequence)
>             __seqprop_sequence()
>                 smp_load_acquire()
>
> smp_load_acquire() still provides the acquire fence. Am I missing something?
>
> >
> > I don't know if the stock code (with down_read_trylock()) is correct as
> > is -- looks fine for cursory reading fwiw. However, if it indeed works,
> > the acquire fence stemming from the lock routine is a mandatory part of
> > it afaics.
> >
> > I think the best way forward is to add a new refcount routine which
> > ships with an acquire fence.
>
> I plan on replacing refcount_t usage here with an atomic since, as
> Hillf noted, refcount is not designed to be used for locking. And will
> make sure the down_read_trylock() replacement will provide an acquire
> fence.
>
> >
> > Otherwise I would suggest:
> > 1. a comment above __refcount_inc_not_zero_limited saying there is an
> >    acq fence issued later
> > 2. smp_rmb() slapped between that and seq accesses
> >
> > If the now removed fence is somehow not needed, I think a comment
> > explaining it is necessary.
> >
> > > @@ -813,36 +856,33 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > >
> > >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> > >  {
> > > -     if (!rwsem_is_locked(&vma->vm_lock.lock))
> > > +     if (refcount_read(&vma->vm_refcnt) <= 1)
> > >               vma_assert_write_locked(vma);
> > >  }
> > >
> >
> > This now forces the compiler to emit a load from vm_refcnt even if
> > vma_assert_write_locked expands to nothing. iow this wants to hide
> > behind the same stuff as vma_assert_write_locked.
>
> True. I guess I'll have to avoid using vma_assert_write_locked() like this:
>
> static inline void vma_assert_locked(struct vm_area_struct *vma)
> {
>         unsigned int mm_lock_seq;
>
>         VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 &&
>                                           !__is_vma_write_locked(vma,
> &mm_lock_seq), vma);
> }
>
> Will make the change.
>
> Thanks for the feedback!