From nobody Sat Nov 23 18:21:27 2024 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9E42C1EBA05 for ; Mon, 11 Nov 2024 20:55:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358515; cv=none; b=Z9NUlp62LFqkImJEnHDpWDwpRkG1Pmgux1lP0dYtfe8zt81FY4Ekg2A7poWolvLb/ebWeRwkRKi12zxnneWZCkCG5kZyHs36SrMhZsQFBK71jw3rV4/TfmtpxkANepBX8MPGAtKZvYUQG0CNDgnS6gTw6KHqhfGjdOa4DXio80o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358515; c=relaxed/simple; bh=C8mwSaQ8Its/aWvWsxyeJgOzWgBQWhPS8Ji8jwToFzg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=f7MmPHQy2+WQwojX2l03kgGY70l54sAFOkQ4ic5OWmHUkwTvoCDiYwmIHLp7Yk8ICMX04it3NNNd59syvvqpeXFufppXMNQWYgQv9hInv+0E/9TCCZFXWjkWbfO3f/W0o63GXTEKW1CfILjyFuw+HI6v6ztIIPiMHYqteUkPscc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Roi810DX; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Roi810DX" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e2971589916so7842872276.3 for ; Mon, 11 Nov 2024 12:55:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731358512; x=1731963312; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=c6L2TsxMW3xdptYp+SxjhxiQb9Rn3imOm+Hs2oG6vZc=; b=Roi810DXtrY8vb58LPyGPngm13kyemggIfcBJzu7McA/b6bvgpq8uEckqfwGzvBtvb Si7LB2B7bxTjQYq253/eGRr9AdfzWNUTiotNOzhNeL8fh1fog9Gq0hGA3pP3ACbKQUJF +txc0DouLWBgHXwhLwc098OiRDmqLYubuVgRuxWdZNN2M+O03ylXC94QXVfN39sTyG9M nh56ffqokO7GKEYDsxi9TO/mLR4QOdowXiJ1puRLzkaPF/aLgMea9HC27zcTMQuDSRiv k2ag6N/DUD3J5JPTPLK8TJGF2eup18dhwGIFLDFScmAiqR0FwS+bLWGqpJyAgUFo4uOJ Gq7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731358512; x=1731963312; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=c6L2TsxMW3xdptYp+SxjhxiQb9Rn3imOm+Hs2oG6vZc=; b=sANzXGlLORA+lBOIVLv4hrFwQku4sy6UYLaDgC9O7ZibOpENMgHFG6v537Kcm1z4Xf YfmSHz0oysXOGQBQJOVOM88O2FlIBDZbvGCe2tM1odghY4Xknd+l6W2QdfXQgD0FIhkh yezff3Tl1rN0gpUhYcU7Fcr6Aei9A5dex/OwZsMJDSBkybmdmgYLEPYOqnoeAeWvFynK jbUAUr+rhLRtnuzSXVtHD7Eaz++pdgMeXqqb7YnZdc3kKZCUXwse0X1xBINhpGHBkOyZ KParE2i1sa/gfyZvcKXbr8gXwJQ8uR+tXsGcHYFb5MsvZllaVA3sO2T/R4Au4P5aUzX0 g2LQ== X-Forwarded-Encrypted: i=1; AJvYcCW12dHDUbmt+mVZV/VyrnF62qUR9B7iRSFnlgCQA4j+ytMIumSNfAgLE/fZb/De65sgxRMjUEikZVOvaH0=@vger.kernel.org X-Gm-Message-State: AOJu0YzCIB60b8n9Az1TIClY2kAM3/qioHq9YPpTxy9t01TiEH/aZTQb 8vRHxzSQ26PmlNg39Ip3OnsHg9wm/G2yc/4tpU9IZZ8Se0oZ5VUQkNuM3PqgkBJWV2jjlCUDoX2 qHg== X-Google-Smtp-Source: AGHT+IFtRvBKCg21ZonSdl0i6oAaNOlWcK8osbIXD1VUy6Zm+UF4tEaZT/nKAdEJtdzHD4gxQkKcsjXP+Js= X-Received: from surenb-desktop.mtv.corp.google.com ([2a00:79e0:2e3f:8:53af:d9fa:522d:99b1]) (user=surenb job=sendgmr) by 2002:a25:af4b:0:b0:e33:104c:fac8 with SMTP id 3f1490d57ef6-e337f8d7427mr11316276.7.1731358512081; Mon, 11 Nov 2024 12:55:12 -0800 (PST) Date: Mon, 11 Nov 2024 12:55:03 -0800 In-Reply-To: <20241111205506.3404479-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241111205506.3404479-1-surenb@google.com> X-Mailer: git-send-email 2.47.0.277.g8800431eea-goog Message-ID: <20241111205506.3404479-2-surenb@google.com> Subject: [PATCH 1/4] mm: introduce vma_start_read_locked{_nested} helpers From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce helper functions which can be used to read-lock a VMA when holding mmap_lock for read. Replace direct accesses to vma->vm_lock with these new helpers. Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 20 ++++++++++++++++++++ mm/userfaultfd.c | 14 ++++++-------- 2 files changed, 26 insertions(+), 8 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fecd47239fa9..01ce619f3d17 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -722,6 +722,26 @@ static inline bool vma_start_read(struct vm_area_struc= t *vma) return true; } =20 +/* + * Use only while holding mmap_read_lock which guarantees that nobody can = lock + * the vma for write (vma_start_write()) from under us. + */ +static inline void vma_start_read_locked_nested(struct vm_area_struct *vma= , int subclass) +{ + mmap_assert_locked(vma->vm_mm); + down_read_nested(&vma->vm_lock->lock, subclass); +} + +/* + * Use only while holding mmap_read_lock which guarantees that nobody can = lock + * the vma for write (vma_start_write()) from under us. + */ +static inline void vma_start_read_locked(struct vm_area_struct *vma) +{ + mmap_assert_locked(vma->vm_mm); + down_read(&vma->vm_lock->lock); +} + static inline void vma_end_read(struct vm_area_struct *vma) { rcu_read_lock(); /* keeps vma alive till the end of up_read */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 60a0be33766f..55019c11b5a8 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -86,13 +86,11 @@ static struct vm_area_struct *uffd_lock_vma(struct mm_s= truct *mm, vma =3D find_vma_and_prepare_anon(mm, address); if (!IS_ERR(vma)) { /* + * While holding mmap_lock we can't fail * We cannot use vma_start_read() as it may fail due to - * false locked (see comment in vma_start_read()). We - * can avoid that by directly locking vm_lock under - * mmap_lock, which guarantees that nobody can lock the - * vma for write (vma_start_write()) under us. + * false locked (see comment in vma_start_read()). */ - down_read(&vma->vm_lock->lock); + vma_start_read_locked(vma); } =20 mmap_read_unlock(mm); @@ -1480,10 +1478,10 @@ static int uffd_move_lock(struct mm_struct *mm, * See comment in uffd_lock_vma() as to why not using * vma_start_read() here. */ - down_read(&(*dst_vmap)->vm_lock->lock); + vma_start_read_locked(*dst_vmap); if (*dst_vmap !=3D *src_vmap) - down_read_nested(&(*src_vmap)->vm_lock->lock, - SINGLE_DEPTH_NESTING); + vma_start_read_locked_nested(*src_vmap, + SINGLE_DEPTH_NESTING); } mmap_read_unlock(mm); return err; --=20 2.47.0.277.g8800431eea-goog From nobody Sat Nov 23 18:21:27 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99CE71F26E5 for ; Mon, 11 Nov 2024 20:55:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358517; cv=none; b=SoqNeKo8wayDbK1feE/zihNXJE0N5R+AsdpQ5psPJaOxbTWSultl5wue6oL4z32p8PGuN3IBEp3PKINnMzI05LnxgWq2naPMki3gQizetU+TmL7OJuU9o2nVhRdf4q1+Dix0oXUAOUbSld8dozutMdRc0AIEL2FftyaZC0GDd5E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358517; c=relaxed/simple; bh=i0otyTtW+AxnnzQXj1+ysEAu0ebKzOlI/9Vlt+Q/1pU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=KbSfThS4ajaVuM69gP1/FwIjdd+CkQaILq8i4SVz9nT11tcRNIaIYq6bUek4j828uzYgNlzbmFe3GxAPdlucWsJbqwCyNYgMKtPxi5nQIZ0b/az7Ffijx7dkpkwgeymIPzPBSApXBrrpXD5xgZYXguUAflOHYYGdssHbXOh0s1w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=t5wbO79a; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="t5wbO79a" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ea8794f354so86215457b3.2 for ; Mon, 11 Nov 2024 12:55:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731358514; x=1731963314; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ykW1gbgekt/rdC2PDe+Q7K6R2D2rbM+L22MrFdz7vw8=; b=t5wbO79adqNDLQh2mE4hQHHw0cq19lG1n8BfloH2Y2h74Jn9nrv0QzVEzzVKw69web LTsjuYW7oRh/2URUxtTEnsO/v/x/lBrCQzfBfQfpwRa0JXoCDglZwkH0wDOjWowKKTF0 USaytxofR62sTjRJ/RpOEvibu+RuFmNP+1Kkf00IP7bhkjE7MDBtxsQL+imPJNMt2L7J TjJX34C9tgWR4Y5Obby8l2mF1ouAsEbZOu0r+BDGH98J9DcyuyqJUFUJu8+gHTKeaOSk t9QNWtFAFWcbdrgGnWIn8SiIPIjzbQwd/eoeAAab8S46YjnEJYdPnh6BDA9dpmJZXC1H l7lA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731358514; x=1731963314; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ykW1gbgekt/rdC2PDe+Q7K6R2D2rbM+L22MrFdz7vw8=; b=XsrRxKMGQUqqpjihzuqqng3Z3BuQ3dV8dWtUu8eCE7+DTWWZ11xFEIvxbdetyDbFr4 kSMfazZJ0Et0SbaT+oGnGLprnKEVR1WcGGTThaV3pPwGbZLvRnwgBLlUoFMab7ibqSj4 sok/8+pULCt85LMv9oO6OO407ck5UjJ7fdZWjgWFRE+zhwEkBbhy4byXRisGDw7Vc4At al5Rw//sg7G3KMd3HNAD9OFYJSchZa8bh+v+WIqMcDjm5REDzBUs7UDjdn/MkJoAv1wJ NunGgJyC091fGmE7rWkbwDQ+bVloyFzGNRxKBDVSWjHegFB6/S/v+3paEKU1CdSbkwHg VWVg== X-Forwarded-Encrypted: i=1; AJvYcCV/scVlYq9ZJ/DoopZOuEwr+niY0cnuG0XiaWP/EmgVHwTIK9Ztz626deRBKsIs1yNfmagnf/PykYfbhS8=@vger.kernel.org X-Gm-Message-State: AOJu0YyEv2O/SBV/9Tw7mQe8D56ggb7bIODOdZnf97ADIJBKLmG1OgH2 bMmfyiUuD2ENyYksZKILUXYNYrqEfI8wa6i3Pcg7zv8wwMWZzCmytaBjB7c36CR0CSCSBbm9y3n 49w== X-Google-Smtp-Source: AGHT+IH1frtfP2Lf4sm0W9DSf94AOlw0snxCLYaHFUcao7FdY9FAQlPyHOnZbHHIWvjf/NOxp7z7fX74rks= X-Received: from surenb-desktop.mtv.corp.google.com ([2a00:79e0:2e3f:8:53af:d9fa:522d:99b1]) (user=surenb job=sendgmr) by 2002:a05:690c:887:b0:6b2:6cd4:7f9a with SMTP id 00721157ae682-6eaddff31bcmr1462977b3.8.1731358514433; Mon, 11 Nov 2024 12:55:14 -0800 (PST) Date: Mon, 11 Nov 2024 12:55:04 -0800 In-Reply-To: <20241111205506.3404479-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241111205506.3404479-1-surenb@google.com> X-Mailer: git-send-email 2.47.0.277.g8800431eea-goog Message-ID: <20241111205506.3404479-3-surenb@google.com> Subject: [PATCH 2/4] mm: move per-vma lock into vm_area_struct From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline sharing. Recent investigation [2] revealed that the regressions is limited to a rather old Broadwell microarchitecture and even there it can be mitigated by disabling adjacent cacheline prefetching, see [3]. This patchset moves vm_lock back into vm_area_struct, aligning it at the cacheline boundary and changing the cache to be cache-aligned as well. This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes: slabinfo before: ... : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 160 51 2 : ... slabinfo after moving vm_lock: ... : ... vm_area_struct ... 256 32 2 : ... Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages, which is 5.5MB per 100000 VMAs. This memory consumption growth will be addressed in the patches that follow. [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com= /T/#m861679f3fe0e22c945d6334b88dc996fef5ea6cc [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbf= P_pR+-2g@mail.gmail.com/ Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 27 ++++++++++++---------- include/linux/mm_types.h | 6 +++-- kernel/fork.c | 50 +++++----------------------------------- 3 files changed, 25 insertions(+), 58 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 01ce619f3d17..c1c2899464db 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -684,6 +684,11 @@ static inline void vma_numab_state_free(struct vm_area= _struct *vma) {} #endif /* CONFIG_NUMA_BALANCING */ =20 #ifdef CONFIG_PER_VMA_LOCK +static inline void vma_lock_init(struct vma_lock *vm_lock) +{ + init_rwsem(&vm_lock->lock); +} + /* * Try to read-lock a vma. The function is allowed to occasionally yield f= alse * locked result to avoid performance overhead, in which case we fall back= to @@ -701,7 +706,7 @@ static inline bool vma_start_read(struct vm_area_struct= *vma) if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(vma->vm_mm->mm_lock_seq.= sequence)) return false; =20 - if (unlikely(down_read_trylock(&vma->vm_lock->lock) =3D=3D 0)) + if (unlikely(down_read_trylock(&vma->vm_lock.lock) =3D=3D 0)) return false; =20 /* @@ -716,7 +721,7 @@ static inline bool vma_start_read(struct vm_area_struct= *vma) * This pairs with RELEASE semantics in vma_end_write_all(). */ if (unlikely(vma->vm_lock_seq =3D=3D raw_read_seqcount(&vma->vm_mm->mm_lo= ck_seq))) { - up_read(&vma->vm_lock->lock); + up_read(&vma->vm_lock.lock); return false; } return true; @@ -729,7 +734,7 @@ static inline bool vma_start_read(struct vm_area_struct= *vma) static inline void vma_start_read_locked_nested(struct vm_area_struct *vma= , int subclass) { mmap_assert_locked(vma->vm_mm); - down_read_nested(&vma->vm_lock->lock, subclass); + down_read_nested(&vma->vm_lock.lock, subclass); } =20 /* @@ -739,13 +744,13 @@ static inline void vma_start_read_locked_nested(struc= t vm_area_struct *vma, int static inline void vma_start_read_locked(struct vm_area_struct *vma) { mmap_assert_locked(vma->vm_mm); - down_read(&vma->vm_lock->lock); + down_read(&vma->vm_lock.lock); } =20 static inline void vma_end_read(struct vm_area_struct *vma) { rcu_read_lock(); /* keeps vma alive till the end of up_read */ - up_read(&vma->vm_lock->lock); + up_read(&vma->vm_lock.lock); rcu_read_unlock(); } =20 @@ -774,7 +779,7 @@ static inline void vma_start_write(struct vm_area_struc= t *vma) if (__is_vma_write_locked(vma, &mm_lock_seq)) return; =20 - down_write(&vma->vm_lock->lock); + down_write(&vma->vm_lock.lock); /* * We should use WRITE_ONCE() here because we can have concurrent reads * from the early lockless pessimistic check in vma_start_read(). @@ -782,7 +787,7 @@ static inline void vma_start_write(struct vm_area_struc= t *vma) * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. */ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); - up_write(&vma->vm_lock->lock); + up_write(&vma->vm_lock.lock); } =20 static inline void vma_assert_write_locked(struct vm_area_struct *vma) @@ -794,7 +799,7 @@ static inline void vma_assert_write_locked(struct vm_ar= ea_struct *vma) =20 static inline void vma_assert_locked(struct vm_area_struct *vma) { - if (!rwsem_is_locked(&vma->vm_lock->lock)) + if (!rwsem_is_locked(&vma->vm_lock.lock)) vma_assert_write_locked(vma); } =20 @@ -861,10 +866,6 @@ static inline void assert_fault_locked(struct vm_fault= *vmf) =20 extern const struct vm_operations_struct vma_dummy_vm_ops; =20 -/* - * WARNING: vma_init does not initialize vma->vm_lock. - * Use vm_area_alloc()/vm_area_free() if vma needs locking. - */ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *= mm) { memset(vma, 0, sizeof(*vma)); @@ -873,6 +874,8 @@ static inline void vma_init(struct vm_area_struct *vma,= struct mm_struct *mm) INIT_LIST_HEAD(&vma->anon_vma_chain); vma_mark_detached(vma, false); vma_numab_state_init(vma); + vma_lock_init(&vma->vm_lock); + vma->vm_lock_seq =3D UINT_MAX; } =20 /* Use when VMA is not part of the VMA tree and needs no locking */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 80fef38d9d64..5c4bfdcfac72 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -716,8 +716,6 @@ struct vm_area_struct { * slowpath. */ unsigned int vm_lock_seq; - /* Unstable RCU readers are allowed to read this. */ - struct vma_lock *vm_lock; #endif =20 /* @@ -770,6 +768,10 @@ struct vm_area_struct { struct vma_numab_state *numab_state; /* NUMA Balancing state */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; +#ifdef CONFIG_PER_VMA_LOCK + /* Unstable RCU readers are allowed to read this. */ + struct vma_lock vm_lock ____cacheline_aligned_in_smp; +#endif } __randomize_layout; =20 #ifdef CONFIG_NUMA diff --git a/kernel/fork.c b/kernel/fork.c index 0061cf2450ef..9e504105f24f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep; /* SLAB cache for mm_struct structures (tsk->mm) */ static struct kmem_cache *mm_cachep; =20 -#ifdef CONFIG_PER_VMA_LOCK - -/* SLAB cache for vm_area_struct.lock */ -static struct kmem_cache *vma_lock_cachep; - -static bool vma_lock_alloc(struct vm_area_struct *vma) -{ - vma->vm_lock =3D kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL); - if (!vma->vm_lock) - return false; - - init_rwsem(&vma->vm_lock->lock); - vma->vm_lock_seq =3D UINT_MAX; - - return true; -} - -static inline void vma_lock_free(struct vm_area_struct *vma) -{ - kmem_cache_free(vma_lock_cachep, vma->vm_lock); -} - -#else /* CONFIG_PER_VMA_LOCK */ - -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return tru= e; } -static inline void vma_lock_free(struct vm_area_struct *vma) {} - -#endif /* CONFIG_PER_VMA_LOCK */ - struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) { struct vm_area_struct *vma; @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct = *mm) return NULL; =20 vma_init(vma, mm); - if (!vma_lock_alloc(vma)) { - kmem_cache_free(vm_area_cachep, vma); - return NULL; - } =20 return vma; } @@ -496,10 +463,8 @@ struct vm_area_struct *vm_area_dup(struct vm_area_stru= ct *orig) * will be reinitialized. */ data_race(memcpy(new, orig, sizeof(*new))); - if (!vma_lock_alloc(new)) { - kmem_cache_free(vm_area_cachep, new); - return NULL; - } + vma_lock_init(&new->vm_lock); + new->vm_lock_seq =3D UINT_MAX; INIT_LIST_HEAD(&new->anon_vma_chain); vma_numab_state_init(new); dup_anon_vma_name(orig, new); @@ -511,7 +476,6 @@ void __vm_area_free(struct vm_area_struct *vma) { vma_numab_state_free(vma); free_anon_vma_name(vma); - vma_lock_free(vma); kmem_cache_free(vm_area_cachep, vma); } =20 @@ -522,7 +486,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head) vm_rcu); =20 /* The vma should not be locked while being destroyed. */ - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma); + VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma); __vm_area_free(vma); } #endif @@ -3168,11 +3132,9 @@ void __init proc_caches_init(void) sizeof(struct fs_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); - - vm_area_cachep =3D KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); -#ifdef CONFIG_PER_VMA_LOCK - vma_lock_cachep =3D KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT); -#endif + vm_area_cachep =3D KMEM_CACHE(vm_area_struct, + SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC| + SLAB_ACCOUNT); mmap_init(); nsproxy_cache_init(); } --=20 2.47.0.277.g8800431eea-goog From nobody Sat Nov 23 18:21:27 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 772881F6665 for ; Mon, 11 Nov 2024 20:55:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358520; cv=none; b=lVGO5txrwHq/PdU3Hr3aMby/A1AXe34XA6ni6qEi0Dtt12Il1P1DfIYqQBURXeON+pMfBGz7/v4aJMVtFkVFesf1yDY0Pj7l8SZoD/766F0josgbwWbJ+1NFQoIbWvuxO2MLKxbFsVpTytQJySKaXodvt6xvqp6pisk1ZOkXmlM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358520; c=relaxed/simple; bh=x2DwPfktU+N2hAPypNzw01OFrB7f7HEUMaxrxZN82Lc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=HNes/WO8N1tHQFLe68rfojH8ZGUorG52NE8E4g9N9OeL6eG8SxDAoedrA7SFkgqeGFgzHAjPV9b69ayh/ghg2Wxg3qBaX/sHOV3F2daYKBUdoELn4PsbQJA+h7BtT8+2hKtI8adtdEUPdQ0Zy/QzIRUhVQtXkLattIuz3CEW6Qk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=YehisEEE; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="YehisEEE" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e2904d0cad0so8295000276.1 for ; Mon, 11 Nov 2024 12:55:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731358517; x=1731963317; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ifdJ2eX7iGI1j7/6aWcHQxmdsxhJaJa3q49onQy2Ll4=; b=YehisEEEprtOQgVrhHAlraTM4YAXQzIy9ctxqVb9xPdhJ8m8k1BuClwIvdNQSTVhKH 9jEcyqcNtcjYjO/32tnVLggZkVhXEJgqglNCKr0nymkp7jlR0CN2vSMxIlrpgakHdvf1 L91BVc01YNWa63dD4Ilzqm7HBE2PwQFW7fMvrcACoCaQjKXV5N+nxY9lhJngMvmcMl81 eX23eEaNh4twBjeJQvYI1HUjSyHgn91EkPIHrMzO+j3XJKDcyX+LjKw+2T1ukp52u0eW ZrWpt8EosFHRbNbEM+np97EJ0RJKdkaJ+VgQEer+qQFsm2UAjib4NXz95xjZpcgBX2nF kbcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731358517; x=1731963317; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ifdJ2eX7iGI1j7/6aWcHQxmdsxhJaJa3q49onQy2Ll4=; b=o+kKPX38yXihScYjAhk64ha2fP1Udy+uOk0RcLD8bEV7fhkYRc11H7I3iyMWS6MTPR NoPdnsZULKf/+w6ug7K/trcSCMpNSlWoKNkGfYnAQmDLTZFV19U0jaTV0IwIsCe9QcZ2 2ONQMv6m7FXOrQvPmVbGljRoTksOfwOBCqDdsjCet6IgvEm6okHkUgHEwiTLKiqupw7F RfgdmdVEr7BSYFDsDMvZVXvdAdbKrEUDK+LOLsi7VowGM0VQdkWv/GWcqG0UlUnWKfDk YAWYGhOfLmSEOm9SRsgEMjaHJ1z6+cs7j1LDmCS6yOAmpD1tH/OYl/uVMHt51RL77IAi aQpA== X-Forwarded-Encrypted: i=1; AJvYcCXy7toAjggWjpWcG16t8UNuIeceJUqCnztR0MajZJIUor/c3jW9OYM2Drq6P2SC/YkgATivwJixYCwWp9Q=@vger.kernel.org X-Gm-Message-State: AOJu0YxAcVmoL3BPGsuX45ox52OevyusTM/7B8+rbZzWjnRRvApRn9W/ 1CToXLF1g8SSasRFJBbe7hiu8uZpoDPxVC7AMWtnmRUM5IFgy3L/TNsApNwAuAkXwu1aTeu5FCg 8zw== X-Google-Smtp-Source: AGHT+IHUDRefIenXEALrWCUUFASKqSkv9c9zYj40iRYol9Wb1hfXY9ZKD83QbQvFvfVT3Q58PHh5dYLSl+c= X-Received: from surenb-desktop.mtv.corp.google.com ([2a00:79e0:2e3f:8:53af:d9fa:522d:99b1]) (user=surenb job=sendgmr) by 2002:a25:7144:0:b0:e30:d518:30f1 with SMTP id 3f1490d57ef6-e337f83cceemr4619276.1.1731358516486; Mon, 11 Nov 2024 12:55:16 -0800 (PST) Date: Mon, 11 Nov 2024 12:55:05 -0800 In-Reply-To: <20241111205506.3404479-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241111205506.3404479-1-surenb@google.com> X-Mailer: git-send-email 2.47.0.277.g8800431eea-goog Message-ID: <20241111205506.3404479-4-surenb@google.com> Subject: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two important specifics which can be used to replace rw_semaphore with a simpler structure: 1. Readers never wait. They try to take the vma_lock and fall back to mmap_lock if that fails. 2. Only one writer at a time will ever try to write-lock a vma_lock because writers first take mmap_lock in write mode. Because of these requirements, full rw_semaphore functionality is not needed and we can replace rw_semaphore with an atomic variable. When a reader takes read lock, it increments the atomic, unless the top two bits are set indicating a writer is present. When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock and puts itself onto newly introduced mm.vma_writer_wait. Since all writers take mmap_lock in write mode first, there can be only one writer at a time. The last reader to release the lock will signal the writer to wake up. atomic_t might overflow if there are many competing readers, therefore vma_start_read() implements an overflow check and if that occurs it exits with a failure to lock. vma_start_read_locked{_nested} may cause an overflow but it is later handled by __vma_end_read(). Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 142 ++++++++++++++++++++++++++++++++++---- include/linux/mm_types.h | 18 ++++- include/linux/mmap_lock.h | 3 + kernel/fork.c | 2 +- mm/init-mm.c | 2 + 5 files changed, 151 insertions(+), 16 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index c1c2899464db..27c0e9ba81c4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -686,7 +686,41 @@ static inline void vma_numab_state_free(struct vm_area= _struct *vma) {} #ifdef CONFIG_PER_VMA_LOCK static inline void vma_lock_init(struct vma_lock *vm_lock) { - init_rwsem(&vm_lock->lock); +#ifdef CONFIG_DEBUG_LOCK_ALLOC + static struct lock_class_key lockdep_key; + + lockdep_init_map(&vm_lock->dep_map, "vm_lock", &lockdep_key, 0); +#endif + atomic_set(&vm_lock->count, VMA_LOCK_UNLOCKED); +} + +static inline unsigned int vma_lock_reader_count(unsigned int counter) +{ + return counter & VMA_LOCK_RD_COUNT_MASK; +} + +static inline void __vma_end_read(struct mm_struct *mm, struct vm_area_str= uct *vma) +{ + unsigned int count, prev, new; + + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + if (unlikely(vma_lock_reader_count(count) =3D=3D 0)) { + /* + * Overflow was possible in vma_start_read_locked(). + * When detected, wrap around preserving writer bits. + */ + new =3D count | ~(VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT); + } else + new =3D count - 1; + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + if (vma_lock_reader_count(new) =3D=3D 0 && (new & VMA_LOCK_WR_WAIT)) + wake_up(&mm->vma_writer_wait); } =20 /* @@ -696,6 +730,9 @@ static inline void vma_lock_init(struct vma_lock *vm_lo= ck) */ static inline bool vma_start_read(struct vm_area_struct *vma) { + struct mm_struct *mm =3D vma->vm_mm; + unsigned int count, prev, new; + /* * Check before locking. A race might cause false locked result. * We can use READ_ONCE() for the mm_lock_seq here, and don't need @@ -703,11 +740,35 @@ static inline bool vma_start_read(struct vm_area_stru= ct *vma) * we don't rely on for anything - the mm_lock_seq read against which we * need ordering is below. */ - if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(vma->vm_mm->mm_lock_seq.= sequence)) + if (READ_ONCE(vma->vm_lock_seq) =3D=3D READ_ONCE(mm->mm_lock_seq.sequence= )) return false; =20 - if (unlikely(down_read_trylock(&vma->vm_lock.lock) =3D=3D 0)) - return false; + rwsem_acquire_read(&vma->vm_lock.dep_map, 0, 0, _RET_IP_); + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + /* Is VMA is write-locked or writer is waiting? */ + if (count & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT)) { + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + return false; + } + + new =3D count + 1; + /* If atomic_t overflows, fail to lock. */ + if (new & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT)) { + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + return false; + } + + /* + * Atomic RMW will provide implicit mb on success to pair with smp_wmb in + * vma_start_write, on failure we retry. + */ + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + lock_acquired(&vma->vm_lock.dep_map, _RET_IP_); =20 /* * Overflow might produce false locked result. @@ -720,8 +781,8 @@ static inline bool vma_start_read(struct vm_area_struct= *vma) * after it has been unlocked. * This pairs with RELEASE semantics in vma_end_write_all(). */ - if (unlikely(vma->vm_lock_seq =3D=3D raw_read_seqcount(&vma->vm_mm->mm_lo= ck_seq))) { - up_read(&vma->vm_lock.lock); + if (unlikely(vma->vm_lock_seq =3D=3D raw_read_seqcount(&mm->mm_lock_seq))= ) { + __vma_end_read(mm, vma); return false; } return true; @@ -733,8 +794,30 @@ static inline bool vma_start_read(struct vm_area_struc= t *vma) */ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma= , int subclass) { - mmap_assert_locked(vma->vm_mm); - down_read_nested(&vma->vm_lock.lock, subclass); + struct mm_struct *mm =3D vma->vm_mm; + unsigned int count, prev, new; + + mmap_assert_locked(mm); + + rwsem_acquire_read(&vma->vm_lock.dep_map, subclass, 0, _RET_IP_); + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + /* We are holding mmap_lock, no active or waiting writers are possible. = */ + VM_BUG_ON_VMA(count & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT), vma); + new =3D count + 1; + /* Unlikely but if atomic_t overflows, wrap around to. */ + if (WARN_ON(new & (VMA_LOCK_WR_LOCKED | VMA_LOCK_WR_WAIT))) + new =3D 0; + /* + * Atomic RMW will provide implicit mb on success to pair with smp_wmb in + * vma_start_write, on failure we retry. + */ + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + lock_acquired(&vma->vm_lock.dep_map, _RET_IP_); } =20 /* @@ -743,14 +826,15 @@ static inline void vma_start_read_locked_nested(struc= t vm_area_struct *vma, int */ static inline void vma_start_read_locked(struct vm_area_struct *vma) { - mmap_assert_locked(vma->vm_mm); - down_read(&vma->vm_lock.lock); + vma_start_read_locked_nested(vma, 0); } =20 static inline void vma_end_read(struct vm_area_struct *vma) { + struct mm_struct *mm =3D vma->vm_mm; + rcu_read_lock(); /* keeps vma alive till the end of up_read */ - up_read(&vma->vm_lock.lock); + __vma_end_read(mm, vma); rcu_read_unlock(); } =20 @@ -774,12 +858,34 @@ static bool __is_vma_write_locked(struct vm_area_stru= ct *vma, unsigned int *mm_l */ static inline void vma_start_write(struct vm_area_struct *vma) { + unsigned int count, prev, new; unsigned int mm_lock_seq; =20 + might_sleep(); if (__is_vma_write_locked(vma, &mm_lock_seq)) return; =20 - down_write(&vma->vm_lock.lock); + rwsem_acquire(&vma->vm_lock.dep_map, 0, 0, _RET_IP_); + count =3D (unsigned int)atomic_read(&vma->vm_lock.count); + for (;;) { + if (vma_lock_reader_count(count) > 0) + new =3D count | VMA_LOCK_WR_WAIT; + else + new =3D count | VMA_LOCK_WR_LOCKED; + prev =3D atomic_cmpxchg(&vma->vm_lock.count, count, new); + if (prev =3D=3D count) + break; + count =3D prev; + } + if (new & VMA_LOCK_WR_WAIT) { + lock_contended(&vma->vm_lock.dep_map, _RET_IP_); + wait_event(vma->vm_mm->vma_writer_wait, + atomic_cmpxchg(&vma->vm_lock.count, VMA_LOCK_WR_WAIT, + VMA_LOCK_WR_LOCKED) =3D=3D VMA_LOCK_WR_WAIT); + + } + lock_acquired(&vma->vm_lock.dep_map, _RET_IP_); + /* * We should use WRITE_ONCE() here because we can have concurrent reads * from the early lockless pessimistic check in vma_start_read(). @@ -787,7 +893,10 @@ static inline void vma_start_write(struct vm_area_stru= ct *vma) * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. */ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); - up_write(&vma->vm_lock.lock); + /* Write barrier to ensure vm_lock_seq change is visible before count */ + smp_wmb(); + rwsem_release(&vma->vm_lock.dep_map, _RET_IP_); + atomic_set(&vma->vm_lock.count, VMA_LOCK_UNLOCKED); } =20 static inline void vma_assert_write_locked(struct vm_area_struct *vma) @@ -797,9 +906,14 @@ static inline void vma_assert_write_locked(struct vm_a= rea_struct *vma) VM_BUG_ON_VMA(!__is_vma_write_locked(vma, &mm_lock_seq), vma); } =20 +static inline bool is_vma_read_locked(struct vm_area_struct *vma) +{ + return vma_lock_reader_count((unsigned int)atomic_read(&vma->vm_lock.coun= t)) > 0; +} + static inline void vma_assert_locked(struct vm_area_struct *vma) { - if (!rwsem_is_locked(&vma->vm_lock.lock)) + if (!is_vma_read_locked(vma)) vma_assert_write_locked(vma); } =20 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5c4bfdcfac72..789bccc05520 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -615,8 +615,23 @@ static inline struct anon_vma_name *anon_vma_name_allo= c(const char *name) } #endif =20 +#define VMA_LOCK_UNLOCKED 0 +#define VMA_LOCK_WR_LOCKED (1 << 31) +#define VMA_LOCK_WR_WAIT (1 << 30) + +#define VMA_LOCK_RD_COUNT_MASK (VMA_LOCK_WR_WAIT - 1) + struct vma_lock { - struct rw_semaphore lock; + /* + * count & VMA_LOCK_RD_COUNT_MASK > 0 =3D=3D> read-locked with 'count' nu= mber of readers + * count & VMA_LOCK_WR_LOCKED !=3D 0 =3D=3D> write-locked + * count & VMA_LOCK_WR_WAIT !=3D 0 =3D=3D> writer is waiting + * count =3D 0 =3D=3D> unlocked + */ + atomic_t count; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif }; =20 struct vma_numab_state { @@ -883,6 +898,7 @@ struct mm_struct { * by mmlist_lock */ #ifdef CONFIG_PER_VMA_LOCK + struct wait_queue_head vma_writer_wait; /* * This field has lock-like semantics, meaning it is sometimes * accessed with ACQUIRE/RELEASE semantics. diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 58dde2e35f7e..769ab97fff3e 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -121,6 +121,9 @@ static inline void mmap_init_lock(struct mm_struct *mm) { init_rwsem(&mm->mmap_lock); mm_lock_seqcount_init(mm); +#ifdef CONFIG_PER_VMA_LOCK + init_waitqueue_head(&mm->vma_writer_wait); +#endif } =20 static inline void mmap_write_lock(struct mm_struct *mm) diff --git a/kernel/fork.c b/kernel/fork.c index 9e504105f24f..726050c557e2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -486,7 +486,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head) vm_rcu); =20 /* The vma should not be locked while being destroyed. */ - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma); + VM_BUG_ON_VMA(is_vma_read_locked(vma), vma); __vm_area_free(vma); } #endif diff --git a/mm/init-mm.c b/mm/init-mm.c index 6af3ad675930..db058873ba18 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -40,6 +40,8 @@ struct mm_struct init_mm =3D { .arg_lock =3D __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), .mmlist =3D LIST_HEAD_INIT(init_mm.mmlist), #ifdef CONFIG_PER_VMA_LOCK + .vma_writer_wait =3D + __WAIT_QUEUE_HEAD_INITIALIZER(init_mm.vma_writer_wait), .mm_lock_seq =3D SEQCNT_ZERO(init_mm.mm_lock_seq), #endif .user_ns =3D &init_user_ns, --=20 2.47.0.277.g8800431eea-goog From nobody Sat Nov 23 18:21:27 2024 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1C0E31F6692 for ; Mon, 11 Nov 2024 20:55:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358521; cv=none; b=oQYUSpo4ObqfWEA7+0KzuXaitD28UU2St01xZuwX6NsoS2UNLl2HFVt4O59BDcqOAI/IuotCXDJWpfIMTi0mc/2K0SgjrMsE5QJnj2TL9iAcrAWYLt7kWbvXL5Fkc9MbQBuDbcNSHI11ZzD2dZEeKE2JWBeb1oH2ZaAjlt5E0ec= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731358521; c=relaxed/simple; bh=uX6t1+V+VWD9TPr8MaOqU9mwvP8DvU6pbc0DsbnErqI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=XmWpD4Txr8xqkHxx6MoRH3uBuSs/9n6+5TJk2MrykX7TeoiBY6/9cgzqoEZAA7dDit4q8lmniL/ryarR0Rf5X/ZiqsBT17aYpuS6TGM+HAQCyV0z+opLiI8/9ClfBOZ1yI1gT7OoladgdLpnmIhTIU/hgN6d3blcFYwA9WK5CFI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=0Ep/fl7n; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="0Ep/fl7n" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e33152c8225so10007122276.0 for ; Mon, 11 Nov 2024 12:55:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1731358519; x=1731963319; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=akNUNHRHwFgLbLvzRySURA4LLcUajIyMqiu+hcl7K7Q=; b=0Ep/fl7n+5y5UYg5kNDRcQfH7L0ChD/RLAJMwgzG8n09KSJim+JgsGNR4cHFyaO6AX f8x7w0yHc0Hx5z5Qh4ybw4/i2bRK/ZVhFkZlqCYrI8EEiEC1x/BZhBKF82d04XdPc0SK rDwilq0tQinXEqi2dxySBQtUWiHN+5FEbRNuQx+jLgt+lUMQPIuPirzy0hwlhiRXcEGE VHkGGpO5jmjZHuTVArFdTHljY5KpRMqY87VK+2V8HoDwE6/jHQcwvlIV4BfPjXdzleyz dgEe4+sXrDMQP250AafQJQjZombVbA4JmsPtQgxGV972JXFMDDMQJQemjNGer3vc2Ybv iwqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731358519; x=1731963319; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=akNUNHRHwFgLbLvzRySURA4LLcUajIyMqiu+hcl7K7Q=; b=mNWG/BJJZkyn9As7NAL5bFLUs8Q6HR2s6+e1+C8m/K5xV9pkuw/+Sd2jwh1deu3863 2Y0BfBUH585uSD4SRQ+Hxk5Q0fk/8otvIQyBTEyduxLh73E+UmkF3KTEPLTyDgX/zi4T tYHR5u5wVDMTqbrUNF2o5fuQvE2UfrP9saRkyB4iN0gwFafu7jMfictqbSPqvdAOBW4P 2m4SN13zsUo06/drg0I3MATrzVXA25SY851oNYtDt9yPIEtQmEsIoRfhxUtjkdHoq4ow Ks73H4KzWohTMRpnQ+2uMqN7VBjxO7CCOEMm3+d5/UUY400Av2pYxrMmpvoI9iDAfZNE wx7g== X-Forwarded-Encrypted: i=1; AJvYcCU5wiu/OYEekiOBNWXKdyzfQPdch5KJeUDg3EcgdT2hgbVyHjuN8LBqzQthqSC2NiC3tSM6/CqNFC8GqJw=@vger.kernel.org X-Gm-Message-State: AOJu0YwVyPyvpaHhI230+ygWG0deL/LL965pnEjJHamOquc7Jm3DNj7M MELiUeBxbpuoivSK0iSP8amhLDX7NVNwX9KnkK2ttLlNSXh/EHNIrWHNnzytd4QNhiSPE4b5LnV B2Q== X-Google-Smtp-Source: AGHT+IHpj1lpqX7UFCEsgQvhYlvsKEu/LqdCTRcWojtEX3DucdiIWzMOj6Fj6hWiR/dHK1I4p9BRUvnrioQ= X-Received: from surenb-desktop.mtv.corp.google.com ([2a00:79e0:2e3f:8:53af:d9fa:522d:99b1]) (user=surenb job=sendgmr) by 2002:a25:bcc2:0:b0:e2b:e955:d58a with SMTP id 3f1490d57ef6-e337f8fa240mr13216276.7.1731358518983; Mon, 11 Nov 2024 12:55:18 -0800 (PST) Date: Mon, 11 Nov 2024 12:55:06 -0800 In-Reply-To: <20241111205506.3404479-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241111205506.3404479-1-surenb@google.com> X-Mailer: git-send-email 2.47.0.277.g8800431eea-goog Message-ID: <20241111205506.3404479-5-surenb@google.com> Subject: [PATCH 4/4] mm: move lesser used vma_area_struct members into the last cacheline From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Move several vma_area_struct members which are rarely or never used during page fault handling into the last cacheline to better pack vm_area_struct. As a result vm_area_struct will fit into 3 cachelines as opposed to 4 cachelines before this change. New vm_area_struct layout: struct vm_area_struct { union { struct { long unsigned int vm_start; /* 0 8 */ long unsigned int vm_end; /* 8 8 */ }; /* 0 16 */ struct callback_head vm_rcu ; /* 0 16 */ } __attribute__((__aligned__(8))); /* 0 16 */ struct mm_struct * vm_mm; /* 16 8 */ pgprot_t vm_page_prot; /* 24 8 */ union { const vm_flags_t vm_flags; /* 32 8 */ vm_flags_t __vm_flags; /* 32 8 */ }; /* 32 8 */ bool detached; /* 40 1 */ /* XXX 3 bytes hole, try to pack */ unsigned int vm_lock_seq; /* 44 4 */ struct list_head anon_vma_chain; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct anon_vma * anon_vma; /* 64 8 */ const struct vm_operations_struct * vm_ops; /* 72 8 */ long unsigned int vm_pgoff; /* 80 8 */ struct file * vm_file; /* 88 8 */ void * vm_private_data; /* 96 8 */ atomic_long_t swap_readahead_info; /* 104 8 */ struct mempolicy * vm_policy; /* 112 8 */ /* XXX 8 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ struct vma_lock vm_lock (__aligned__(64)); /* 128 4 */ /* XXX 4 bytes hole, try to pack */ struct { struct rb_node rb (__aligned__(8)); /* 136 24 */ long unsigned int rb_subtree_last; /* 160 8 */ } __attribute__((__aligned__(8))) shared; /* 136 32 */ struct vm_userfaultfd_ctx vm_userfaultfd_ctx; /* 168 0 */ /* size: 192, cachelines: 3, members: 17 */ /* sum members: 153, holes: 3, sum holes: 15 */ /* padding: 24 */ /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */ } __attribute__((__aligned__(64))); Memory consumption per 1000 VMAs becomes 48 pages: slabinfo after vm_area_struct changes: ... : ... vm_area_struct ... 192 42 2 : ... Signed-off-by: Suren Baghdasaryan --- include/linux/mm_types.h | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 789bccc05520..c3755b680911 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -733,16 +733,6 @@ struct vm_area_struct { unsigned int vm_lock_seq; #endif =20 - /* - * For areas with an address space and backing store, - * linkage into the address_space->i_mmap interval tree. - * - */ - struct { - struct rb_node rb; - unsigned long rb_subtree_last; - } shared; - /* * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma * list, after a COW of one of the file pages. A MAP_SHARED vma @@ -762,14 +752,6 @@ struct vm_area_struct { struct file * vm_file; /* File we map to (can be NULL). */ void * vm_private_data; /* was vm_pte (shared mem) */ =20 -#ifdef CONFIG_ANON_VMA_NAME - /* - * For private and shared anonymous mappings, a pointer to a null - * terminated string containing the name given to the vma, or NULL if - * unnamed. Serialized by mmap_lock. Use anon_vma_name to access. - */ - struct anon_vma_name *anon_name; -#endif #ifdef CONFIG_SWAP atomic_long_t swap_readahead_info; #endif @@ -782,11 +764,28 @@ struct vm_area_struct { #ifdef CONFIG_NUMA_BALANCING struct vma_numab_state *numab_state; /* NUMA Balancing state */ #endif - struct vm_userfaultfd_ctx vm_userfaultfd_ctx; #ifdef CONFIG_PER_VMA_LOCK /* Unstable RCU readers are allowed to read this. */ struct vma_lock vm_lock ____cacheline_aligned_in_smp; #endif + /* + * For areas with an address space and backing store, + * linkage into the address_space->i_mmap interval tree. + * + */ + struct { + struct rb_node rb; + unsigned long rb_subtree_last; + } shared; +#ifdef CONFIG_ANON_VMA_NAME + /* + * For private and shared anonymous mappings, a pointer to a null + * terminated string containing the name given to the vma, or NULL if + * unnamed. Serialized by mmap_lock. Use anon_vma_name to access. + */ + struct anon_vma_name *anon_name; +#endif + struct vm_userfaultfd_ctx vm_userfaultfd_ctx; } __randomize_layout; =20 #ifdef CONFIG_NUMA --=20 2.47.0.277.g8800431eea-goog