From nobody Tue Dec 16 16:55:47 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D923918E1A for ; Mon, 15 Jan 2024 18:38:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="p74byPgz" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dbeba57a668so12636044276.3 for ; Mon, 15 Jan 2024 10:38:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705343923; x=1705948723; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=oLlMsVPwk9YFRfJ1xkZa3gURgHKOqqqaOLfJjGTa1Bw=; b=p74byPgz+dtfmyc2Mci897DKkR+m9CFVuXWEUMpI/9zbFHW+i3VJk1TmMonhR67NQP EA3jqiVrcAUnodwE/e/sqHUNWr+8WZsjqqeZOfBLV9leh4Aoh5Vp65L5mRtPZkCnd2Vr qQ/pRb9jy+L2tq5pwDaluDl4vsiGbEENgKDSH4n7nb4mOwgFnXU94kqMCKxOz1rZTqcI X6FhfuCQtAGWTby0uPZuFGwf89jUk45W7ycUVKanPQPlKF94By7TIN3VUaNqk6i02VRa M/kILqSzbLDbqWYkd//QJ+WFzzQzGuwqpoCQDaN/OtlvO3FaTf6sRR+8yRceHP4tEAH/ FWOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705343923; x=1705948723; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oLlMsVPwk9YFRfJ1xkZa3gURgHKOqqqaOLfJjGTa1Bw=; b=LEPoTjuyY8o1LSlpxprMT+praWACv8/MiLDWFUQqNTmr3NdynYbQZHmJfzo/W0h5GW ETzF8NXN9FI0+erBUIgEeUW9BoJo2hyNPrX7wgbN9UIvJM1t+agiyWEIKborHxsSv+5H nshYY8S/wqEMNalS2kNdUZeFNNcYSle9jsgHJ2ByRoA13hjr9SFlAnaJ0Fr0sXQviyIk HLrcVJlwZVtmHNnODTEk9lrC4S5Ci8UICWzWdQ8AlbW6tVAp8aFDEvddxPPWcBz9WXsr RZo9TwxFPDkjI8M+qfWG/SfnCPcns9cUHCU/T0ZHhkYggABr4ghLNmE+ItC9kof6iwgc 51vw== X-Gm-Message-State: AOJu0YwY6ksmRUdB9Y1koPu/X0MOegZ3oLQTazGahwsfNJ7cP2EoJSaW e0XKrw1wpSatmVpP5m2Ebj9bgiGFfVUyziN8SQ== X-Google-Smtp-Source: AGHT+IETYM+vP+GcA1bz5ffGAn7xpFLG8rQcpZyAKLVas4y2HqU4DVH4yHlOoP/EqERN0tEXwr857q4NqX0= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:201:3af2:e48e:2785:270]) (user=surenb job=sendgmr) by 2002:a25:a292:0:b0:dc1:f71f:a0ad with SMTP id c18-20020a25a292000000b00dc1f71fa0admr1244231ybi.13.1705343922947; Mon, 15 Jan 2024 10:38:42 -0800 (PST) Date: Mon, 15 Jan 2024 10:38:34 -0800 In-Reply-To: <20240115183837.205694-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240115183837.205694-1-surenb@google.com> X-Mailer: git-send-email 2.43.0.381.gb435a96ce8-goog Message-ID: <20240115183837.205694-2-surenb@google.com> Subject: [RFC 1/3] mm: make vm_area_struct anon_name field RCU-safe From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@Oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, vbabka@suse.cz, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For lockless /proc/pid/maps reading we have to ensure all the fields used when generating the output are RCU-safe. The only pointer fields in vm_area_struct which are used to generate that file's output are vm_file and anon_name. vm_file is RCU-safe but anon_name is not. Make anon_name RCU-safe as well. Signed-off-by: Suren Baghdasaryan --- include/linux/mm_inline.h | 10 +++++++++- include/linux/mm_types.h | 3 ++- mm/madvise.c | 30 ++++++++++++++++++++++++++---- 3 files changed, 37 insertions(+), 6 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index f4fe593c1400..bbdb0ca857f1 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -389,7 +389,7 @@ static inline void dup_anon_vma_name(struct vm_area_str= uct *orig_vma, struct anon_vma_name *anon_name =3D anon_vma_name(orig_vma); =20 if (anon_name) - new_vma->anon_name =3D anon_vma_name_reuse(anon_name); + rcu_assign_pointer(new_vma->anon_name, anon_vma_name_reuse(anon_name)); } =20 static inline void free_anon_vma_name(struct vm_area_struct *vma) @@ -411,6 +411,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_nam= e *anon_name1, !strcmp(anon_name1->name, anon_name2->name); } =20 +struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma); + #else /* CONFIG_ANON_VMA_NAME */ static inline void anon_vma_name_get(struct anon_vma_name *anon_name) {} static inline void anon_vma_name_put(struct anon_vma_name *anon_name) {} @@ -424,6 +426,12 @@ static inline bool anon_vma_name_eq(struct anon_vma_na= me *anon_name1, return true; } =20 +static inline +struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma) +{ + return NULL; +} + #endif /* CONFIG_ANON_VMA_NAME */ =20 static inline void init_tlb_flush_pending(struct mm_struct *mm) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b2d3a88a34d1..1f0a30c00795 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -545,6 +545,7 @@ struct vm_userfaultfd_ctx {}; =20 struct anon_vma_name { struct kref kref; + struct rcu_head rcu; /* The name needs to be at the end because it is dynamically sized. */ char name[]; }; @@ -699,7 +700,7 @@ struct vm_area_struct { * terminated string containing the name given to the vma, or NULL if * unnamed. Serialized by mmap_lock. Use anon_vma_name to access. */ - struct anon_vma_name *anon_name; + struct anon_vma_name __rcu *anon_name; #endif #ifdef CONFIG_SWAP atomic_long_t swap_readahead_info; diff --git a/mm/madvise.c b/mm/madvise.c index 912155a94ed5..0f222d464254 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -88,14 +88,15 @@ void anon_vma_name_free(struct kref *kref) { struct anon_vma_name *anon_name =3D container_of(kref, struct anon_vma_name, kref); - kfree(anon_name); + kfree_rcu(anon_name, rcu); } =20 struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma) { mmap_assert_locked(vma->vm_mm); =20 - return vma->anon_name; + return rcu_dereference_protected(vma->anon_name, + rwsem_is_locked(&vma->vm_mm->mmap_lock)); } =20 /* mmap_lock should be write-locked */ @@ -105,7 +106,7 @@ static int replace_anon_vma_name(struct vm_area_struct = *vma, struct anon_vma_name *orig_name =3D anon_vma_name(vma); =20 if (!anon_name) { - vma->anon_name =3D NULL; + rcu_assign_pointer(vma->anon_name, NULL); anon_vma_name_put(orig_name); return 0; } @@ -113,11 +114,32 @@ static int replace_anon_vma_name(struct vm_area_struc= t *vma, if (anon_vma_name_eq(orig_name, anon_name)) return 0; =20 - vma->anon_name =3D anon_vma_name_reuse(anon_name); + rcu_assign_pointer(vma->anon_name, anon_vma_name_reuse(anon_name)); anon_vma_name_put(orig_name); =20 return 0; } + +/* + * Returned anon_vma_name is stable due to elevated refcount but not guara= nteed + * to be assigned to the original VMA after the call. + */ +struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma) +{ + struct anon_vma_name __rcu *anon_name; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + anon_name =3D rcu_dereference(vma->anon_name); + if (!anon_name) + return NULL; + + if (unlikely(!kref_get_unless_zero(&anon_name->kref))) + return NULL; + + return anon_name; +} + #else /* CONFIG_ANON_VMA_NAME */ static int replace_anon_vma_name(struct vm_area_struct *vma, struct anon_vma_name *anon_name) --=20 2.43.0.381.gb435a96ce8-goog From nobody Tue Dec 16 16:55:47 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2628A18E33 for ; Mon, 15 Jan 2024 18:38:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="racXoa4T" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dbe9dacc912so10598302276.2 for ; Mon, 15 Jan 2024 10:38:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705343925; x=1705948725; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ch6T6Lsnf4fqmqV9A6xFsxD4eG+jCeNRaJgu+LlgQig=; b=racXoa4TnsLPy0OFLS+xzftOB6e28+Lm+G2K0Tue45QfDYhuBJN8zPqvCSUGQLIRGB Ex7Z6px6V8c6e9Z+zu1ZevP638iygpxyIbCeKxw+/zA1/m3DWs3ASyzMRqrxxKdRy5d2 wDUSFiLSWkykFnGsS1vGxi/zl/FfuDuWQ9fqAKrxd/AzlFxQhDZmsmgQtFs+EA3fu0AC UhGvm4g0XgBnaKtwq1QV/3UFtrDZHaafByjh3e66qJ3jjE7xnQcwWxNoEKsCr8pl5Mbz boLb+9tYXQq2eQtoiY8sAg8FH99i29PfnKLuRpSLH30hqUjWZm6ROwwLw6d7ZAfmClIa zW0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705343925; x=1705948725; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ch6T6Lsnf4fqmqV9A6xFsxD4eG+jCeNRaJgu+LlgQig=; b=EeQ8lqTg772sP3R347QcFA8htAQMFEDazVdjNl1Sg5nZW2iMH5A2uDcZV3W9FA3v1C VvUEJBYEsXgSad/lJ6dXZaZSo5nj/b7Muk81vpKa7bSwOp8WhJ0AHiFx3qO0VKozEgiJ pK7Kq9lIvq8VI61EBOo8VTkmp2LxGMnuLtFtQjsODYVRUynzGBXGI/gIWM3RCaoKifRn DHWAwUR6UmpgAkI5L6HNseMyRonR6jGbJiGPT6Kyk7k/s8hbyTUDUx1U921rWnVFqXlW awIEktXYA/NSuPJu/soToy2qh8FpTd1U/y80KSk73Gw0nHURPL7uivB+QSUco1vyrABB qWow== X-Gm-Message-State: AOJu0YxwmOJP+pQc4jGN9nWaImhKuPnCpEX1pQSlZQaFZIUpxyeaXyGA U/JKvTz0qXTebkPRAzVCuK48A29s3VC4igLptw== X-Google-Smtp-Source: AGHT+IGN2VvUUxNx3HDpn+LUP1maSJ793VbScOs1XRHuqJsGhFfM1RNtInfsiW1wKVri/L8s95oaaZe3g/g= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:201:3af2:e48e:2785:270]) (user=surenb job=sendgmr) by 2002:a05:6902:1364:b0:dbd:7149:a389 with SMTP id bt4-20020a056902136400b00dbd7149a389mr269675ybb.11.1705343925137; Mon, 15 Jan 2024 10:38:45 -0800 (PST) Date: Mon, 15 Jan 2024 10:38:35 -0800 In-Reply-To: <20240115183837.205694-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240115183837.205694-1-surenb@google.com> X-Mailer: git-send-email 2.43.0.381.gb435a96ce8-goog Message-ID: <20240115183837.205694-3-surenb@google.com> Subject: [RFC 2/3] seq_file: add validate() operation to seq_operations From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@Oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, vbabka@suse.cz, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" seq_file outputs data in chunks using seq_file.buf as the intermediate storage before outputting the generated data for the current chunk. It is possible for already buffered data to become stale before it gets reported. In certain situations it is desirable to regenerate that data instead of reporting the stale one. Provide a validate() operation called before outputting the buffered data to allow users to validate buffered data. To indicate valid data, user's validate callback should return 0, to request regeneration of the stale data it should return -EAGAIN, any other error will be considered fatal and read operation will be aborted. Signed-off-by: Suren Baghdasaryan --- fs/seq_file.c | 24 +++++++++++++++++++++++- include/linux/seq_file.h | 1 + 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index f5fdaf3b1572..77833bbe5909 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -172,6 +172,8 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_it= er *iter) { struct seq_file *m =3D iocb->ki_filp->private_data; size_t copied =3D 0; + loff_t orig_index; + size_t orig_count; size_t n; void *p; int err =3D 0; @@ -220,6 +222,10 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_i= ter *iter) if (m->count) // hadn't managed to copy everything goto Done; } + + orig_index =3D m->index; + orig_count =3D m->count; +Again: // get a non-empty record in the buffer m->from =3D 0; p =3D m->op->start(m, &m->index); @@ -278,6 +284,22 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_i= ter *iter) } } m->op->stop(m, p); + /* Note: we validate even if err<0 to prevent publishing copied data */ + if (m->op->validate) { + int val_err =3D m->op->validate(m, p); + + if (val_err) { + if (val_err =3D=3D -EAGAIN) { + m->index =3D orig_index; + m->count =3D orig_count; + // data is stale, retry + goto Again; + } + // data is invalid, return the last error + err =3D val_err; + goto Done; + } + } n =3D copy_to_iter(m->buf, m->count, iter); copied +=3D n; m->count -=3D n; @@ -572,7 +594,7 @@ static void single_stop(struct seq_file *p, void *v) int single_open(struct file *file, int (*show)(struct seq_file *, void *), void *data) { - struct seq_operations *op =3D kmalloc(sizeof(*op), GFP_KERNEL_ACCOUNT); + struct seq_operations *op =3D kzalloc(sizeof(*op), GFP_KERNEL_ACCOUNT); int res =3D -ENOMEM; =20 if (op) { diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 234bcdb1fba4..d0fefac2990f 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -34,6 +34,7 @@ struct seq_operations { void (*stop) (struct seq_file *m, void *v); void * (*next) (struct seq_file *m, void *v, loff_t *pos); int (*show) (struct seq_file *m, void *v); + int (*validate)(struct seq_file *m, void *v); }; =20 #define SEQ_SKIP 1 --=20 2.43.0.381.gb435a96ce8-goog From nobody Tue Dec 16 16:55:47 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1997118EC8 for ; Mon, 15 Jan 2024 18:38:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="UrYLCeRE" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5f6c12872fbso143492737b3.1 for ; Mon, 15 Jan 2024 10:38:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705343927; x=1705948727; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=21zH497B8+dg0O/C7EwYJfD0SYQoqr7syeq2JJc4d60=; b=UrYLCeREsFnwAxoJC+umUFKYt2DuQ3CNnswQf7APYA2WgQWK5V8XPnb95quE5NoFwE 83zrG5dhCYFJxsVLPJCIdm493/BXW7HXchO7TjTw9BCTqQBHsefT6v2e8qVHzj4efQmH 1HQOGvPo37nTrw4veIPGkRNtVEWJjYGNwqGA7jIpv/YiZjI2SwCApeEQX9z1V+tJ28j5 5nL+WaXh/9aBG3pq2ECFqqQxosXJFi8mzNiJ6AzT9jAlNSngk0Fa+kgsq5GB1TiBBvF7 cGFQoFgJmVvr3mtRpSxipoIfXN5Ao6V+4Rbqbgr8Gu6SgNKdb7u1MwQXHwUG1osMDJzG Ol4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705343927; x=1705948727; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=21zH497B8+dg0O/C7EwYJfD0SYQoqr7syeq2JJc4d60=; b=p2O8f67DNqJ6dg55+/dVnV8JY1LomGnMevCB3rWq9eGQqGqjeTyA+U2UgAUr0bWoTA Y5vVvecdMIQSXA+w0BI+zgC0gHrqxiaKEWp444ozEnaQD9kwRv+A7h/6pzNXqwbLW11r bsxluiIYKETfI4LD77mh5fvW5QJZDln7IpGDycYHcVA6/531PVc/46mcSgUd9rKo16Db y4HS5YCuv0KKPOBzD7XwWz6fuEJu4/6vthl7gFSPVfyEB8uiHMD/jcMQ4LF9/MHDQjnG Q9mXyeJh1AMFOOt0fEkYJNF0vFQeJQZ5ZISNb05gU3jvwsc8lrKobxYRvV8W766ML8u9 fX+w== X-Gm-Message-State: AOJu0Yw4zETgfTKblwa4JPGR7ltgWHvDfYaETaV1e0U+t/SRKDpin7QQ ULZ9wnKDXt51dAI4+5ekqHNsoflz4UX/lEeTMQ== X-Google-Smtp-Source: AGHT+IFB8/hnPbdwsoiv7tQuwUtxLgPQDLNW02byS+mXpmjbpqM8r5SRgtD+OQwpIXqVBxjlqXtVMNNKQrk= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:201:3af2:e48e:2785:270]) (user=surenb job=sendgmr) by 2002:a05:690c:805:b0:5fc:4ef9:9d6b with SMTP id bx5-20020a05690c080500b005fc4ef99d6bmr2038449ywb.9.1705343927162; Mon, 15 Jan 2024 10:38:47 -0800 (PST) Date: Mon, 15 Jan 2024 10:38:36 -0800 In-Reply-To: <20240115183837.205694-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240115183837.205694-1-surenb@google.com> X-Mailer: git-send-email 2.43.0.381.gb435a96ce8-goog Message-ID: <20240115183837.205694-4-surenb@google.com> Subject: [RFC 3/3] mm/maps: read proc/pid/maps under RCU From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@Oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, vbabka@suse.cz, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com, surenb@google.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With maple_tree supporting vma tree traversal under RCU and per-vma locks making vma access RCU-safe, /proc/pid/maps can be read under RCU and without the need to read-lock mmap_lock. However vma content can change from under us, therefore we need to pin pointer fields used when generating the output (currently only vm_file and anon_name). In addition, we validate data before publishing it to the user using new seq_file validate interface. This way we keep this mechanism consistent with the previous behavior where data tearing is possible only at page boundaries. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Signed-off-by: Suren Baghdasaryan --- fs/proc/internal.h | 3 ++ fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 120 insertions(+), 13 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index a71ac5379584..47233408550b 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -290,6 +290,9 @@ struct proc_maps_private { struct task_struct *task; struct mm_struct *mm; struct vma_iterator iter; + int mm_lock_seq; + struct anon_vma_name *anon_name; + struct file *vm_file; #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 62b16f42d5d2..d4305cfdca58 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -141,6 +141,22 @@ static struct vm_area_struct *proc_get_vma(struct proc= _maps_private *priv, return vma; } =20 +static const struct seq_operations proc_pid_maps_op; + +static inline bool needs_mmap_lock(struct seq_file *m) +{ +#ifdef CONFIG_PER_VMA_LOCK + /* + * smaps and numa_maps perform page table walk, therefore require + * mmap_lock but maps can be read under RCU. + */ + return m->op !=3D &proc_pid_maps_op; +#else + /* Without per-vma locks VMA access is not RCU-safe */ + return true; +#endif +} + static void *m_start(struct seq_file *m, loff_t *ppos) { struct proc_maps_private *priv =3D m->private; @@ -162,11 +178,17 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return NULL; } =20 - if (mmap_read_lock_killable(mm)) { - mmput(mm); - put_task_struct(priv->task); - priv->task =3D NULL; - return ERR_PTR(-EINTR); + if (needs_mmap_lock(m)) { + if (mmap_read_lock_killable(mm)) { + mmput(mm); + put_task_struct(priv->task); + priv->task =3D NULL; + return ERR_PTR(-EINTR); + } + } else { + /* For memory barrier see the comment for mm_lock_seq in mm_struct */ + priv->mm_lock_seq =3D smp_load_acquire(&priv->mm->mm_lock_seq); + rcu_read_lock(); } =20 vma_iter_init(&priv->iter, mm, last_addr); @@ -195,7 +217,10 @@ static void m_stop(struct seq_file *m, void *v) return; =20 release_task_mempolicy(priv); - mmap_read_unlock(mm); + if (needs_mmap_lock(m)) + mmap_read_unlock(mm); + else + rcu_read_unlock(); mmput(mm); put_task_struct(priv->task); priv->task =3D NULL; @@ -283,8 +308,10 @@ show_map_vma(struct seq_file *m, struct vm_area_struct= *vma) start =3D vma->vm_start; end =3D vma->vm_end; show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); - if (mm) - anon_name =3D anon_vma_name(vma); + if (mm) { + anon_name =3D needs_mmap_lock(m) ? anon_vma_name(vma) : + anon_vma_name_get_rcu(vma); + } =20 /* * Print the dentry name for named mappings, and a @@ -338,19 +365,96 @@ show_map_vma(struct seq_file *m, struct vm_area_struc= t *vma) seq_puts(m, name); } seq_putc(m, '\n'); + if (anon_name && !needs_mmap_lock(m)) + anon_vma_name_put(anon_name); +} + +/* + * Pin vm_area_struct fields used by show_map_vma. We also copy pinned fie= lds + * into proc_maps_private because by the time put_vma_fields() is called, = VMA + * might have changed and these fields might be pointing to different obje= cts. + */ +static bool get_vma_fields(struct vm_area_struct *vma, struct proc_maps_pr= ivate *priv) +{ + if (vma->vm_file) { + priv->vm_file =3D get_file_rcu(&vma->vm_file); + if (!priv->vm_file) + return false; + + } else + priv->vm_file =3D NULL; + + if (vma->anon_name) { + priv->anon_name =3D anon_vma_name_get_rcu(vma); + if (!priv->anon_name) { + if (priv->vm_file) { + fput(priv->vm_file); + return false; + } + } + } else + priv->anon_name =3D NULL; + + return true; +} + +static void put_vma_fields(struct proc_maps_private *priv) +{ + if (priv->anon_name) + anon_vma_name_put(priv->anon_name); + if (priv->vm_file) + fput(priv->vm_file); } =20 static int show_map(struct seq_file *m, void *v) { - show_map_vma(m, v); + struct proc_maps_private *priv =3D m->private; + + if (needs_mmap_lock(m)) + show_map_vma(m, v); + else { + /* + * Stop immediately if the VMA changed from under us. + * Validation step will prevent publishing already cached data. + */ + if (!get_vma_fields(v, priv)) + return -EAGAIN; + + show_map_vma(m, v); + put_vma_fields(priv); + } + return 0; } =20 +static int validate_map(struct seq_file *m, void *v) +{ + if (!needs_mmap_lock(m)) { + struct proc_maps_private *priv =3D m->private; + int mm_lock_seq; + + /* For memory barrier see the comment for mm_lock_seq in mm_struct */ + mm_lock_seq =3D smp_load_acquire(&priv->mm->mm_lock_seq); + if (mm_lock_seq !=3D priv->mm_lock_seq) { + /* + * mmap_lock contention is detected. Wait for mmap_lock + * write to be released, discard stale data and retry. + */ + mmap_read_lock(priv->mm); + mmap_read_unlock(priv->mm); + return -EAGAIN; + } + } + return 0; + +} + static const struct seq_operations proc_pid_maps_op =3D { - .start =3D m_start, - .next =3D m_next, - .stop =3D m_stop, - .show =3D show_map + .start =3D m_start, + .next =3D m_next, + .stop =3D m_stop, + .show =3D show_map, + .validate =3D validate_map, }; =20 static int pid_maps_open(struct inode *inode, struct file *file) --=20 2.43.0.381.gb435a96ce8-goog