From nobody Fri Dec 19 17:54:50 2025 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B22F225415 for ; Fri, 18 Apr 2025 17:50:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744998621; cv=none; b=BXzd5IC+3udOPjBx8K+NpxtbGNcY16u8T+6Y8z4dC15OnerXbXto0SWmfuhZGVKPEHgAKI2YevR9UF4ykzRa8RfU3yn5w33hlJIxYpUaKq+q8oNIwPP4C1gmNpo35VFzrlfBVQmS5FOrxNPeIjVktPWpt0pTnTvSIwXfGpxLmj0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744998621; c=relaxed/simple; bh=20+DFZgTng6ZYblv4okOjrUyBQ5mxE92qM420ijlexM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ODPn9l7M0qWflKEIWWjyJbJBjgBShTX4r7/ary50N4Mf6wtO20mucb6kv3AP725p4lDrYWDquzkTQEsFW8QpXNDlz63o0SyddSt8qlT1BuBISG+r0oRVUy4Wjot9XSBiif2uvJvBX29XaGr//O6QNMdfqJECYhMfti/yxi48psU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KygVUAlx; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KygVUAlx" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-b090c7c2c6aso1306033a12.0 for ; Fri, 18 Apr 2025 10:50:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1744998618; x=1745603418; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/zYtkrY7HUMXj6I83iq2T2dcIR3++4RWvz/bo+0QsbU=; b=KygVUAlxQS1mQjAnX7J0juIs8PRjKB4JOuOMYE3GwrDNuu1Jvqt8xvy+6QQR4lt1KA /CF85Dds4ZVc5+O/TPD5C9dniUOzdO70Tu3Z5o1sKPV7xQBXBEyucKDE1qA24B+sgskq suT9T2Ua1so9sAaRknrOfIz9Cht8EjjWHWA30s9B1FarjSUk3zuxjARV4I6tzM4nnvjJ +jo5r+RWj0fyKeHLMsQw3Fs9XdDRPJpjFfk3ckJh4trZR/D/M77ffiJVEMcoyWcP4WcZ +tYQvlR7ll3W5Oe+7bcyX3OZemHFSYKuHNzU5Q8fhXNLQbyClccKjfv/+WDlDQ7fferR 9WBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744998618; x=1745603418; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/zYtkrY7HUMXj6I83iq2T2dcIR3++4RWvz/bo+0QsbU=; b=w/ngkw+Ht2HxR3DliwtzoQ2+34S1JabnX6dVLPbUK0MA8WE1teOSLYB3EC3xwfNN0t XE55Gds12KxFUHT5lSL6m/MsEHoorZIbe5RjB95fxzDHKvBAslPDFtKMJ9M6h4OzlR5h OaTjH66nAagjcvEEE5UfC4a2BDhhccXU/WYpSlY8E8CI7+RkgoU3bYaNbRyt9N/r5oxj CPryfGk1QaielfHF4KsX1a2L/IJINm2Hv8Mj3FAH9kKcKKUl6dk2Ict4Ee9IUQY2FYRT fj56WI1kta4EI/TYHfZddoGh8ULv3PXZQVhync8P849hmFAeN9f9/gsVUqbHvS9278Rf 07dw== X-Forwarded-Encrypted: i=1; AJvYcCXqhzw7bCvjdj9WXGfO9GOaW9daTLr/BzsXnqfopfMd6+kD8+APTjfq74wS1oC7T5aDJrjtJ5QqKw/vtII=@vger.kernel.org X-Gm-Message-State: AOJu0YxB0ErgGJ3VYWuX8BolIzgFdQmARAv678VLa+HZWlxquEXkxkV8 jQoU7qv3dfyorSe9reJcYqGcpp0eZzgX2wBexPcFeHQQpfJa0z8qU+YUPOTnsrWzvgLNukvsTw2 hHw== X-Google-Smtp-Source: AGHT+IFtA4TMsuStPYu7aoBjj//DZ8se4BNHvSXu+W66hYuPWTAds876LFCU1b60VeWLvlCjpqZ1/KyDDUw= X-Received: from pjbpa2.prod.google.com ([2002:a17:90b:2642:b0:2ef:7af4:5e8e]) (user=surenb job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:528a:b0:2ff:7ad4:77b1 with SMTP id 98e67ed59e1d1-3087bb3973fmr6045262a91.2.1744998618622; Fri, 18 Apr 2025 10:50:18 -0700 (PDT) Date: Fri, 18 Apr 2025 10:49:58 -0700 In-Reply-To: <20250418174959.1431962-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> X-Mailer: git-send-email 2.49.0.805.g082f7c87e0-goog Message-ID: <20250418174959.1431962-8-surenb@google.com> Subject: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With maple_tree supporting vma tree traversal under RCU and vma and its important members being RCU-safe, /proc/pid/maps can be read under RCU and without the need to read-lock mmap_lock. However vma content can change from under us, therefore we make a copy of the vma and we pin pointer fields used when generating the output (currently only vm_file and anon_name). Afterwards we check for concurrent address space modifications, wait for them to end and retry. While we take the mmap_lock for reading during such contention, we do that momentarily only to record new mm_wr_seq counter. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Note that this change has a userspace visible disadvantage: it allows for sub-page data tearing as opposed to the previous mechanism where data tearing could happen only between pages of generated output data. Since current userspace considers data tearing between pages to be acceptable, we assume is will be able to handle sub-page data tearing as well. Signed-off-by: Suren Baghdasaryan --- fs/proc/internal.h | 6 ++ fs/proc/task_mmu.c | 170 ++++++++++++++++++++++++++++++++++---- include/linux/mm_inline.h | 18 ++++ 3 files changed, 177 insertions(+), 17 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 96122e91c645..6e1169c1f4df 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -379,6 +379,12 @@ struct proc_maps_private { struct task_struct *task; struct mm_struct *mm; struct vma_iterator iter; + bool mmap_locked; + loff_t last_pos; +#ifdef CONFIG_PER_VMA_LOCK + unsigned int mm_wr_seq; + struct vm_area_struct vma_copy; +#endif #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index b9e4fbbdf6e6..f9d50a61167c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_= private *priv) } #endif =20 -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, - loff_t *ppos) +#ifdef CONFIG_PER_VMA_LOCK + +static const struct seq_operations proc_pid_maps_op; + +/* + * Take VMA snapshot and pin vm_file and anon_name as they are used by + * show_map_vma. + */ +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area= _struct *vma) +{ + struct vm_area_struct *copy =3D &priv->vma_copy; + int ret =3D -EAGAIN; + + memcpy(copy, vma, sizeof(*vma)); + if (copy->vm_file && !get_file_rcu(©->vm_file)) + goto out; + + if (!anon_vma_name_get_if_valid(copy)) + goto put_file; + + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) + return 0; + + /* Address space got modified, vma might be stale. Re-lock and retry. */ + rcu_read_unlock(); + ret =3D mmap_read_lock_killable(priv->mm); + if (!ret) { + /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock = */ + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); + mmap_read_unlock(priv->mm); + ret =3D -EAGAIN; + } + + rcu_read_lock(); + + anon_vma_name_put_if_valid(copy); +put_file: + if (copy->vm_file) + fput(copy->vm_file); +out: + return ret; +} + +static void put_vma_snapshot(struct proc_maps_private *priv) +{ + struct vm_area_struct *vma =3D &priv->vma_copy; + + anon_vma_name_put_if_valid(vma); + if (vma->vm_file) + fput(vma->vm_file); +} + +static inline bool drop_mmap_lock(struct seq_file *m, struct proc_maps_pri= vate *priv) +{ + /* + * smaps and numa_maps perform page table walk, therefore require + * mmap_lock but maps can be read under RCU. + */ + if (m->op !=3D &proc_pid_maps_op) + return false; + + /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */ + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); + mmap_read_unlock(priv->mm); + rcu_read_lock(); + memset(&priv->vma_copy, 0, sizeof(priv->vma_copy)); + + return true; +} + +static struct vm_area_struct *get_stable_vma(struct vm_area_struct *vma, + struct proc_maps_private *priv, + loff_t last_pos) +{ + int ret; + + put_vma_snapshot(priv); + while ((ret =3D get_vma_snapshot(priv, vma)) =3D=3D -EAGAIN) { + /* lookup the vma at the last position again */ + vma_iter_init(&priv->iter, priv->mm, last_pos); + vma =3D vma_next(&priv->iter); + } + + return ret ? ERR_PTR(ret) : &priv->vma_copy; +} + +#else /* CONFIG_PER_VMA_LOCK */ + +/* Without per-vma locks VMA access is not RCU-safe */ +static inline bool drop_mmap_lock(struct seq_file *m, + struct proc_maps_private *priv) +{ + return false; +} + +static struct vm_area_struct *get_stable_vma(struct vm_area_struct *vma, + struct proc_maps_private *priv, + loff_t last_pos) +{ + return vma; +} + +#endif /* CONFIG_PER_VMA_LOCK */ + +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppo= s) { + struct proc_maps_private *priv =3D m->private; struct vm_area_struct *vma =3D vma_next(&priv->iter); =20 + if (vma && !priv->mmap_locked) + vma =3D get_stable_vma(vma, priv, *ppos); + + if (IS_ERR(vma)) + return vma; + if (vma) { - *ppos =3D vma->vm_start; + /* Store previous position to be able to restart if needed */ + priv->last_pos =3D *ppos; + /* + * Track the end of the reported vma to ensure position changes + * even if previous vma was merged with the next vma and we + * found the extended vma with the same vm_start. + */ + *ppos =3D vma->vm_end; } else { *ppos =3D -2UL; vma =3D get_gate_vma(priv->mm); @@ -148,6 +265,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos) unsigned long last_addr =3D *ppos; struct mm_struct *mm; =20 + priv->mmap_locked =3D true; /* See m_next(). Zero at the start or after lseek. */ if (last_addr =3D=3D -1UL) return NULL; @@ -170,12 +288,18 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return ERR_PTR(-EINTR); } =20 + /* Drop mmap_lock if possible */ + if (drop_mmap_lock(m, priv)) + priv->mmap_locked =3D false; + + if (last_addr > 0) + *ppos =3D last_addr =3D priv->last_pos; vma_iter_init(&priv->iter, mm, last_addr); hold_task_mempolicy(priv); if (last_addr =3D=3D -2UL) return get_gate_vma(mm); =20 - return proc_get_vma(priv, ppos); + return proc_get_vma(m, ppos); } =20 static void *m_next(struct seq_file *m, void *v, loff_t *ppos) @@ -184,7 +308,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t= *ppos) *ppos =3D -1UL; return NULL; } - return proc_get_vma(m->private, ppos); + return proc_get_vma(m, ppos); } =20 static void m_stop(struct seq_file *m, void *v) @@ -196,7 +320,10 @@ static void m_stop(struct seq_file *m, void *v) return; =20 release_task_mempolicy(priv); - mmap_read_unlock(mm); + if (priv->mmap_locked) + mmap_read_unlock(mm); + else + rcu_read_unlock(); mmput(mm); put_task_struct(priv->task); priv->task =3D NULL; @@ -243,14 +370,20 @@ static int do_maps_open(struct inode *inode, struct f= ile *file, static void get_vma_name(struct vm_area_struct *vma, const struct path **path, const char **name, - const char **name_fmt) + const char **name_fmt, bool mmap_locked) { - struct anon_vma_name *anon_name =3D vma->vm_mm ? anon_vma_name(vma) : NUL= L; + struct anon_vma_name *anon_name; =20 *name =3D NULL; *path =3D NULL; *name_fmt =3D NULL; =20 + if (vma->vm_mm) + anon_name =3D mmap_locked ? anon_vma_name(vma) : + anon_vma_name_get_rcu(vma); + else + anon_name =3D NULL; + /* * Print the dentry name for named mappings, and a * special [heap] marker for the heap: @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma, } else { *path =3D file_user_path(vma->vm_file); } - return; + goto out; } =20 if (vma->vm_ops && vma->vm_ops->name) { *name =3D vma->vm_ops->name(vma); if (*name) - return; + goto out; } =20 *name =3D arch_vma_name(vma); if (*name) - return; + goto out; =20 if (!vma->vm_mm) { *name =3D "[vdso]"; - return; + goto out; } =20 if (vma_is_initial_heap(vma)) { *name =3D "[heap]"; - return; + goto out; } =20 if (vma_is_initial_stack(vma)) { *name =3D "[stack]"; - return; + goto out; } =20 if (anon_name) { *name_fmt =3D "[anon:%s]"; *name =3D anon_name->name; - return; } +out: + if (anon_name && !mmap_locked) + anon_vma_name_put(anon_name); } =20 static void show_vma_header_prefix(struct seq_file *m, @@ -324,6 +459,7 @@ static void show_vma_header_prefix(struct seq_file *m, static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) { + struct proc_maps_private *priv =3D m->private; const struct path *path; const char *name_fmt, *name; vm_flags_t flags =3D vma->vm_flags; @@ -344,7 +480,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct = *vma) end =3D vma->vm_end; show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); =20 - get_vma_name(vma, &path, &name, &name_fmt); + get_vma_name(vma, &path, &name, &name_fmt, priv->mmap_locked); if (path) { seq_pad(m, ' '); seq_path(m, path, "\n"); @@ -549,7 +685,7 @@ static int do_procmap_query(struct proc_maps_private *p= riv, void __user *uarg) const char *name_fmt; size_t name_sz =3D 0; =20 - get_vma_name(vma, &path, &name, &name_fmt); + get_vma_name(vma, &path, &name, &name_fmt, true); =20 if (path || name_fmt || name) { name_buf =3D kmalloc(name_buf_sz, GFP_KERNEL); diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9ac2d92d7ede..436512f1e759 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -434,6 +434,21 @@ static inline bool anon_vma_name_eq(struct anon_vma_na= me *anon_name1, =20 struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma); =20 +/* + * Takes a reference if anon_vma is valid and stable (has references). + * Fails only if anon_vma is valid but we failed to get a reference. + */ +static inline bool anon_vma_name_get_if_valid(struct vm_area_struct *vma) +{ + return !vma->anon_name || anon_vma_name_get_rcu(vma); +} + +static inline void anon_vma_name_put_if_valid(struct vm_area_struct *vma) +{ + if (vma->anon_name) + anon_vma_name_put(vma->anon_name); +} + #else /* CONFIG_ANON_VMA_NAME */ static inline void anon_vma_name_get(struct anon_vma_name *anon_name) {} static inline void anon_vma_name_put(struct anon_vma_name *anon_name) {} @@ -453,6 +468,9 @@ struct anon_vma_name *anon_vma_name_get_rcu(struct vm_a= rea_struct *vma) return NULL; } =20 +static inline bool anon_vma_name_get_if_valid(struct vm_area_struct *vma) = { return true; } +static inline void anon_vma_name_put_if_valid(struct vm_area_struct *vma) = {} + #endif /* CONFIG_ANON_VMA_NAME */ =20 static inline void init_tlb_flush_pending(struct mm_struct *mm) --=20 2.49.0.805.g082f7c87e0-goog