From nobody Sat Nov 23 21:09:08 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 772881F6665 for ; Mon, 11 Nov 2024 20:55:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358520; cv=none; b=lVGO5txrwHq/PdU3Hr3aMby/A1AXe34XA6ni6qEi0Dtt12Il1P1DfIYqQBURXeON+pMfBGz7/v4aJMVtFkVFesf1yDY0Pj7l8SZoD/766F0josgbwWbJ+1NFQoIbWvuxO2MLKxbFsVpTytQJySKaXodvt6xvqp6pisk1ZOkXmlM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358520; c=relaxed/simple; bh=x2DwPfktU+N2hAPypNzw01OFrB7f7HEUMaxrxZN82Lc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=HNes/WO8N1tHQFLe68rfojH8ZGUorG52NE8E4g9N9OeL6eG8SxDAoedrA7SFkgqeGFgzHAjPV9b69ayh/ghg2Wxg3qBaX/sHOV3F2daYKBUdoELn4PsbQJA+h7BtT8+2hKtI8adtdEUPdQ0Zy/QzIRUhVQtXkLattIuz3CEW6Qk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=YehisEEE; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="YehisEEE" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e2904d0cad0so8295000276.1 for ; Mon, 11 Nov 2024 12:55:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731358517; x=1731963317; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ifdJ2eX7iGI1j7/6aWcHQxmdsxhJaJa3q49onQy2Ll4=; b=YehisEEEprtOQgVrhHAlraTM4YAXQzIy9ctxqVb9xPdhJ8m8k1BuClwIvdNQSTVhKH 9jEcyqcNtcjYjO/32tnVLggZkVhXEJgqglNCKr0nymkp7jlR0CN2vSMxIlrpgakHdvf1 L91BVc01YNWa63dD4Ilzqm7HBE2PwQFW7fMvrcACoCaQjKXV5N+nxY9lhJngMvmcMl81 eX23eEaNh4twBjeJQvYI1HUjSyHgn91EkPIHrMzO+j3XJKDcyX+LjKw+2T1ukp52u0eW ZrWpt8EosFHRbNbEM+np97EJ0RJKdkaJ+VgQEer+qQFsm2UAjib4NXz95xjZpcgBX2nF kbcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731358517; x=1731963317; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ifdJ2eX7iGI1j7/6aWcHQxmdsxhJaJa3q49onQy2Ll4=; b=o+kKPX38yXihScYjAhk64ha2fP1Udy+uOk0RcLD8bEV7fhkYRc11H7I3iyMWS6MTPR NoPdnsZULKf/+w6ug7K/trcSCMpNSlWoKNkGfYnAQmDLTZFV19U0jaTV0IwIsCe9QcZ2 2ONQMv6m7FXOrQvPmVbGljRoTksOfwOBCqDdsjCet6IgvEm6okHkUgHEwiTLKiqupw7F RfgdmdVEr7BSYFDsDMvZVXvdAdbKrEUDK+LOLsi7VowGM0VQdkWv/GWcqG0UlUnWKfDk YAWYGhOfLmSEOm9SRsgEMjaHJ1z6+cs7j1LDmCS6yOAmpD1tH/OYl/uVMHt51RL77IAi aQpA== X-Forwarded-Encrypted: i=1; AJvYcCXy7toAjggWjpWcG16t8UNuIeceJUqCnztR0MajZJIUor/c3jW9OYM2Drq6P2SC/YkgATivwJixYCwWp9Q=@vger.kernel.org X-Gm-Message-State: AOJu0YxAcVmoL3BPGsuX45ox52OevyusTM/7B8+rbZzWjnRRvApRn9W/ 1CToXLF1g8SSasRFJBbe7hiu8uZpoDPxVC7AMWtnmRUM5IFgy3L/TNsApNwAuAkXwu1aTeu5FCg 8zw== X-Google-Smtp-Source: AGHT+IHUDRefIenXEALrWCUUFASKqSkv9c9zYj40iRYol9Wb1hfXY9ZKD83QbQvFvfVT3Q58PHh5dYLSl+c= X-Received: from surenb-desktop.mtv.corp.google.com ([2a00:79e0:2e3f:8:53af:d9fa:522d:99b1]) (user=surenb job=sendgmr) by 2002:a25:7144:0:b0:e30:d518:30f1 with SMTP id 3f1490d57ef6-e337f83cceemr4619276.1.1731358516486; Mon, 11 Nov 2024 12:55:16 -0800 (PST) Date: Mon, 11 Nov 2024 12:55:05 -0800 In-Reply-To: <20241111205506.3404479-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241111205506.3404479-1-surenb@google.com> X-Mailer: git-send-email 2.47.0.277.g8800431eea-goog Message-ID: <20241111205506.3404479-4-surenb@google.com> Subject: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two important specifics which can be used to replace rw_semaphore with a simpler structure: 1. Readers never wait. They try to take the vma_lock and fall back to mmap_lock if that fails. 2. Only one writer at a time will ever try to write-lock a vma_lock because writers first take mmap_lock in write mode. Because of these requirements, full rw_semaphore functionality is not needed and we can replace rw_semaphore with an atomic variable. When a reader takes read lock, it increments the atomic, unless the top two bits are set indicating a writer is present. When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock and puts itself onto newly introduced mm.vma_writer_wait. Since all writers take mmap_lock in write mode first, there can be only one writer at a time. The last reader to release the lock will signal the writer to wake up. atomic_t might overflow if there are many competing readers, therefore vma_start_read() implements an overflow check and if that occurs it exits with a failure to lock. vma_start_read_locked{_nested} may cause an overflow but it is later handled by __vma_end_read(). Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 142 ++++++++++++++++++++++++++++++++++---- include/linux/mm_types.h | 18 ++++- include/linux/mmap_lock.h | 3 + kernel/fork.c | 2 +- mm/init-mm.c | 2 + 5 files changed, 151 insertions(+), 16 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index c1c2899464db..27c0e9ba81c4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -686,7 +686,41 @@ static inline void vma_numab_state_free(struct vm_area= _struct *vma) {} #ifdef CONFIG_PER_VMA_LOCK static inline void vma_lock_init(struct vma_lock *vm_lock) { - init_rwsem(&vm_lock->lock); +#ifdef CONFIG_DEBUG_LOCK_ALLOC + static struct lock_class_key lockdep_key; + + lockdep_init_map(&vm_lock->dep_map, "vm_lock", &lockdep_key, 0); +#endif + atomic_set(&vm_lock->count, VMA_LOCK_UNLOCKED); +} + +static inline unsigned int vma_lock_reader_count(unsigned int counter) +{ + return counter & VMA_LOCK_RD_COUNT_MASK; +} + +static inline void __vma_end_read(struct mm_struct *mm, struct vm_area_str= uct *vma) +{ + unsigned int count, prev, new; + + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + if (unlikely(vma_lock_reader_count(count) =3D=3D 0)) { + /* + * Overflow was possible in vma_start_read_locked(). + * When detected, wrap around preserving writer bits. + */ + new =3D count | ~(VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT); + } else + new =3D count - 1; + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + if (vma_lock_reader_count(new) =3D=3D 0 && (new & VMA_LOCK_WR_WAIT)) + wake_up(&mm->vma_writer_wait); } =20 /* @@ -696,6 +730,9 @@ static inline void vma_lock_init(struct vma_lock *vm_lo= ck) */ static inline bool vma_start_read(struct vm_area_struct *vma) { + struct mm_struct *mm =3D vma->vm_mm; + unsigned int count, prev, new; + /* * Check before locking. A race might cause false locked result. * We can use READ_ONCE() for the mm_lock_seq here, and don't need @@ -703,11 +740,35 @@ static inline bool vma_start_read(struct vm_area_stru= ct *vma) * we don't rely on for anything - the mm_lock_seq read against which we * need ordering is below. */ - if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(vma->vm_mm->mm_lock_seq.= sequence)) + if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(mm->mm_lock_seq.sequence= )) return false; =20 - if (unlikely(down_read_trylock(&vma->vm_lock.lock) =3D=3D 0)) - return false; + rwsem_acquire_read(&vma->vm_lock.dep_map, 0, 0, _RET_IP_); + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + /* Is VMA is write-locked or writer is waiting? */ + if (count & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT)) { + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + return false; + } + + new =3D count + 1; + /* If atomic_t overflows, fail to lock. */ + if (new & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT)) { + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + return false; + } + + /* + * Atomic RMW will provide implicit mb on success to pair with smp_wmb in + * vma_start_write, on failure we retry. + */ + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + lock_acquired(&vma->vm_lock.dep_map, _RET_IP_); =20 /* * Overflow might produce false locked result. @@ -720,8 +781,8 @@ static inline bool vma_start_read(struct vm_area_struct= *vma) * after it has been unlocked. * This pairs with RELEASE semantics in vma_end_write_all(). */ - if (unlikely(vma->vm_lock_seq =3D=3D raw_read_seqcount(&vma->vm_mm->mm_lo= ck_seq))) { - up_read(&vma->vm_lock.lock); + if (unlikely(vma->vm_lock_seq =3D=3D raw_read_seqcount(&mm->mm_lock_seq))= ) { + __vma_end_read(mm, vma); return false; } return true; @@ -733,8 +794,30 @@ static inline bool vma_start_read(struct vm_area_struc= t *vma) */ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma= , int subclass) { - mmap_assert_locked(vma->vm_mm); - down_read_nested(&vma->vm_lock.lock, subclass); + struct mm_struct *mm =3D vma->vm_mm; + unsigned int count, prev, new; + + mmap_assert_locked(mm); + + rwsem_acquire_read(&vma->vm_lock.dep_map, subclass, 0, _RET_IP_); + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + /* We are holding mmap_lock, no active or waiting writers are possible. = */ + VM_BUG_ON_VMA(count & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT), vma); + new =3D count + 1; + /* Unlikely but if atomic_t overflows, wrap around to. */ + if (WARN_ON(new & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT))) + new =3D 0; + /* + * Atomic RMW will provide implicit mb on success to pair with smp_wmb in + * vma_start_write, on failure we retry. + */ + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + lock_acquired(&vma->vm_lock.dep_map, _RET_IP_); } =20 /* @@ -743,14 +826,15 @@ static inline void vma_start_read_locked_nested(struc= t vm_area_struct *vma, int */ static inline void vma_start_read_locked(struct vm_area_struct *vma) { - mmap_assert_locked(vma->vm_mm); - down_read(&vma->vm_lock.lock); + vma_start_read_locked_nested(vma, 0); } =20 static inline void vma_end_read(struct vm_area_struct *vma) { + struct mm_struct *mm =3D vma->vm_mm; + rcu_read_lock(); /* keeps vma alive till the end of up_read */ - up_read(&vma->vm_lock.lock); + __vma_end_read(mm, vma); rcu_read_unlock(); } =20 @@ -774,12 +858,34 @@ static bool __is_vma_write_locked(struct vm_area_stru= ct *vma, unsigned int *mm_l */ static inline void vma_start_write(struct vm_area_struct *vma) { + unsigned int count, prev, new; unsigned int mm_lock_seq; =20 + might_sleep(); if (__is_vma_write_locked(vma, &mm_lock_seq)) return; =20 - down_write(&vma->vm_lock.lock); + rwsem_acquire(&vma->vm_lock.dep_map, 0, 0, _RET_IP_); + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + if (vma_lock_reader_count(count) > 0) + new =3D count | VMA_LOCK_WR_WAIT; + else + new =3D count | VMA_LOCK_WR_LOCKED; + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + if (new & VMA_LOCK_WR_WAIT) { + lock_contended(&vma->vm_lock.dep_map, _RET_IP_); + wait_event(vma->vm_mm->vma_writer_wait, + atomic_cmpxchg(&vma->vm_lock.count, VMA_LOCK_WR_WAIT, + VMA_LOCK_WR_LOCKED) =3D=3D VMA_LOCK_WR_WAIT); + + } + lock_acquired(&vma->vm_lock.dep_map, _RET_IP_); + /* * We should use WRITE_ONCE() here because we can have concurrent reads * from the early lockless pessimistic check in vma_start_read(). @@ -787,7 +893,10 @@ static inline void vma_start_write(struct vm_area_stru= ct *vma) * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. */ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); - up_write(&vma->vm_lock.lock); + /* Write barrier to ensure vm_lock_seq change is visible before count */ + smp_wmb(); + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + atomic_set(&vma->vm_lock.count, VMA_LOCK_UNLOCKED); } =20 static inline void vma_assert_write_locked(struct vm_area_struct *vma) @@ -797,9 +906,14 @@ static inline void vma_assert_write_locked(struct vm_a= rea_struct *vma) VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); } =20 +static inline bool is_vma_read_locked(struct vm_area_struct *vma) +{ + return vma_lock_reader_count((unsigned int)atomic_read(&vma->vm_lock.coun= t)) > 0; +} + static inline void vma_assert_locked(struct vm_area_struct *vma) { - if (!rwsem_is_locked(&vma->vm_lock.lock)) + if (!is_vma_read_locked(vma)) vma_assert_write_locked(vma); } =20 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5c4bfdcfac72..789bccc05520 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -615,8 +615,23 @@ static inline struct anon_vma_name *anon_vma_name_allo= c(const char *name) } #endif =20 +#define VMA_LOCK_UNLOCKED 0 +#define VMA_LOCK_WR_LOCKED (1 << 31) +#define VMA_LOCK_WR_WAIT (1 << 30) + +#define VMA_LOCK_RD_COUNT_MASK (VMA_LOCK_WR_WAIT - 1) + struct vma_lock { - struct rw_semaphore lock; + /* + * count & VMA_LOCK_RD_COUNT_MASK > 0 =3D=3D> read-locked with 'count' nu= mber of readers + * count & VMA_LOCK_WR_LOCKED !=3D 0 =3D=3D> write-locked + * count & VMA_LOCK_WR_WAIT !=3D 0 =3D=3D> writer is waiting + * count =3D 0 =3D=3D> unlocked + */ + atomic_t count; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif }; =20 struct vma_numab_state { @@ -883,6 +898,7 @@ struct mm_struct { * by mmlist_lock */ #ifdef CONFIG_PER_VMA_LOCK + struct wait_queue_head vma_writer_wait; /* * This field has lock-like semantics, meaning it is sometimes * accessed with ACQUIRE/RELEASE semantics. diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 58dde2e35f7e..769ab97fff3e 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -121,6 +121,9 @@ static inline void mmap_init_lock(struct mm_struct *mm) { init_rwsem(&mm->mmap_lock); mm_lock_seqcount_init(mm); +#ifdef CONFIG_PER_VMA_LOCK + init_waitqueue_head(&mm->vma_writer_wait); +#endif } =20 static inline void mmap_write_lock(struct mm_struct *mm) diff --git a/kernel/fork.c b/kernel/fork.c index 9e504105f24f..726050c557e2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -486,7 +486,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head) vm_rcu); =20 /* The vma should not be locked while being destroyed. */ - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma); + VM_BUG_ON_VMA(is_vma_read_locked(vma), vma); __vm_area_free(vma); } #endif diff --git a/mm/init-mm.c b/mm/init-mm.c index 6af3ad675930..db058873ba18 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -40,6 +40,8 @@ struct mm_struct init_mm =3D { .arg_lock =3D __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), .mmlist =3D LIST_HEAD_INIT(init_mm.mmlist), #ifdef CONFIG_PER_VMA_LOCK + .vma_writer_wait =3D + __WAIT_QUEUE_HEAD_INITIALIZER(init_mm.vma_writer_wait), .mm_lock_seq =3D SEQCNT_ZERO(init_mm.mm_lock_seq), #endif .user_ns =3D &init_user_ns, --=20 2.47.0.277.g8800431eea-goog