From nobody Sat Feb 7 13:03:43 2026 Received: from out-170.mta0.migadu.com (out-170.mta0.migadu.com [91.218.175.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0AC8F225407 for ; Mon, 9 Jun 2025 22:56:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749509788; cv=none; b=N9OKWWnySjKQtIhBbA3fM12zJR5yB+p242M1m+zIW5/JtqsC2IQK0LiK6XuIz9P1bz/74uP9q085cQ6qfDwsSsKVsHXcbOTsq34BqPYxrOmavgKoi7KEtFMuDaxco4mZ4lcVwAkZdkYdFiiBtdo1W0u466VkqBXWk5S21CULGUg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749509788; c=relaxed/simple; bh=6g6iPNuguanRU8zfDKFIKWAIyMllrkBDSKchTQLvrFo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PB0PaUVYchlfN0FWw2nPclIbQWD7Ik0/+yBO6kCOZCDaBt9pc1xYNexycaElHIwCstiHwncCW2NApHxyaA18znjrj1yV5h9dkVF/ssYy7TYh8NxLuXFOXQHPvbpNqtBrlXNMT3p2flKlEcSVwnuyoOxVPjSDz+xEIF9eRV1otXs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=ukP1Hnw3; arc=none smtp.client-ip=91.218.175.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="ukP1Hnw3" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1749509784; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vjwAclPLPNB0cOL9u2mwbYWPRL/XnTNpPecdd20u+W8=; b=ukP1Hnw3lE2tb6d6IJxojqkr5GibDR6b2lbfMG8o+FPQKSs5saTqh6Blkbzlu4WZoQv1gm oB1mL/px5lNcl0rQQ3j/cQMrXM4WaTlFELDViUaOtAvwDugzHlhXr4OnNhIrDaQvoBN16c V4AmbNMSm2p9Tfqw+60SlWaHWLK0+qg= From: Shakeel Butt To: Tejun Heo , Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Vlastimil Babka , Alexei Starovoitov , Sebastian Andrzej Siewior , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Harry Yoo , Yosry Ahmed , bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH 1/3] cgroup: support to enable nmi-safe css_rstat_updated Date: Mon, 9 Jun 2025 15:56:09 -0700 Message-ID: <20250609225611.3967338-2-shakeel.butt@linux.dev> In-Reply-To: <20250609225611.3967338-1-shakeel.butt@linux.dev> References: <20250609225611.3967338-1-shakeel.butt@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add necessary infrastructure to enable the nmi-safe execution of css_rstat_updated(). Currently css_rstat_updated() takes a per-cpu per-css raw spinlock to add the given css in the per-cpu per-css update tree. However the kernel can not spin in nmi context, so we need to remove the spinning on the raw spinlock in css_rstat_updated(). To support lockless css_rstat_updated(), let's add necessary data structures in the css and ss structures. Signed-off-by: Shakeel Butt --- include/linux/cgroup-defs.h | 4 ++++ kernel/cgroup/rstat.c | 23 +++++++++++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index e61687d5e496..45860fe5dd0c 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -384,6 +384,9 @@ struct css_rstat_cpu { */ struct cgroup_subsys_state *updated_children; struct cgroup_subsys_state *updated_next; /* NULL if not on the list */ + + struct llist_node lnode; /* lockless list for update */ + struct cgroup_subsys_state *owner; /* back pointer */ }; =20 /* @@ -822,6 +825,7 @@ struct cgroup_subsys { =20 spinlock_t rstat_ss_lock; raw_spinlock_t __percpu *rstat_ss_cpu_lock; + struct llist_head __percpu *lhead; /* lockless update list head */ }; =20 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index cbeaa499a96a..a5608ae2be27 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -11,6 +11,7 @@ =20 static DEFINE_SPINLOCK(rstat_base_lock); static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock); +static DEFINE_PER_CPU(struct llist_head, rstat_backlog_list); =20 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); =20 @@ -45,6 +46,13 @@ static spinlock_t *ss_rstat_lock(struct cgroup_subsys *s= s) return &rstat_base_lock; } =20 +static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, in= t cpu) +{ + if (ss) + return per_cpu_ptr(ss->lhead, cpu); + return per_cpu_ptr(&rstat_backlog_list, cpu); +} + static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu) { if (ss) { @@ -468,7 +476,8 @@ int css_rstat_init(struct cgroup_subsys_state *css) for_each_possible_cpu(cpu) { struct css_rstat_cpu *rstatc =3D css_rstat_cpu(css, cpu); =20 - rstatc->updated_children =3D css; + rstatc->owner =3D rstatc->updated_children =3D css; + init_llist_node(&rstatc->lnode); =20 if (is_self) { struct cgroup_rstat_base_cpu *rstatbc; @@ -532,9 +541,19 @@ int __init ss_rstat_init(struct cgroup_subsys *ss) return -ENOMEM; } =20 + if (ss) { + ss->lhead =3D alloc_percpu(struct llist_head); + if (!ss->lhead) { + free_percpu(ss->rstat_ss_cpu_lock); + return -ENOMEM; + } + } + spin_lock_init(ss_rstat_lock(ss)); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { raw_spin_lock_init(ss_rstat_cpu_lock(ss, cpu)); + init_llist_head(ss_lhead_cpu(ss, cpu)); + } =20 return 0; } --=20 2.47.1 From nobody Sat Feb 7 13:03:43 2026 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DAD65225A50 for ; Mon, 9 Jun 2025 22:56:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749509791; cv=none; b=rt6BDHyLqoltjRAiAeaOdzR8iAvf4LDtdkG3pBQ1KHM0z3AtXcVG329bMWRxhGUaf09vfG3bJMV60zzvaKSyNyoDO1qbQ5uxMmAjaDyj5VRWTSO1a8dwqFYtTRzJpAMZmaU5AbrHYQCZRqOG+g+x02ykvWnvpPOZWWb75zwrD7M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749509791; c=relaxed/simple; bh=q/5uQgl9QRgYBR7ahdMseA6XxY7GRQsOpumksRP85aw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=S7aPKujWWMUOS1W6WJPp+Y5zU22t2GRqXIROlY2qsV+wyaT9A1ONN0UsyFVkYOw7HYw05ahhxSRYVcMGQDJFxQCV5mATzbGsHO9bYhcuXvKLSuMbXpW2UI6CeZJ07KcsBtJynu3XpL28VZeFt4UKf0pjupU1ie5uuiO2jOidnUM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=wpWU8hZ3; arc=none smtp.client-ip=91.218.175.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="wpWU8hZ3" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1749509788; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8ipQfALEF/dBZznaVw0Klj9I7jvOAPbti7+InyzJ+44=; b=wpWU8hZ36G9p/4DRgAXIGESz1EZxzTdj//O6Df64GuE54onSIdS0slIj7gKAPNCTbjrS5E 2KlHWmrflLMQaEGV9qhrhaOUUtKUHOWD1/b0kpDDB9P0Am8exWCDg1FM8+obWe67EKAW5A 2sbpiDjBNYfA05yQIppOJCGTFdOS1jI= From: Shakeel Butt To: Tejun Heo , Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Vlastimil Babka , Alexei Starovoitov , Sebastian Andrzej Siewior , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Harry Yoo , Yosry Ahmed , bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe Date: Mon, 9 Jun 2025 15:56:10 -0700 Message-ID: <20250609225611.3967338-3-shakeel.butt@linux.dev> In-Reply-To: <20250609225611.3967338-1-shakeel.butt@linux.dev> References: <20250609225611.3967338-1-shakeel.butt@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" To make css_rstat_updated() able to safely run in nmi context, let's move the rstat update tree creation at the flush side and use per-cpu lockless lists in struct cgroup_subsys to track the css whose stats are updated on that cpu. The struct cgroup_subsys_state now has per-cpu lnode which needs to be inserted into the corresponding per-cpu lhead of struct cgroup_subsys. Since we want the insertion to be nmi safe, there can be multiple inserters on the same cpu for the same lnode. The current llist does not provide function to protect against the scenario where multiple inserters can use the same lnode. So, using llist_node() out of the box is not safe for this scenario. However we can protect against multiple inserters using the same lnode by using the fact llist node points to itself when not on the llist and atomically reset it and select the winner as the single inserter. Signed-off-by: Shakeel Butt --- kernel/cgroup/rstat.c | 57 ++++++++++++++++++++++++++++++++++--------- 1 file changed, 45 insertions(+), 12 deletions(-) diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index a5608ae2be27..4fabd7973067 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -138,13 +138,15 @@ void _css_rstat_cpu_unlock(struct cgroup_subsys_state= *css, int cpu, * @css: target cgroup subsystem state * @cpu: cpu on which rstat_cpu was updated * - * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching - * rstat_cpu->updated_children list. See the comment on top of - * css_rstat_cpu definition for details. + * Atomically inserts the css in the ss's llist for the given cpu. This is= nmi + * safe. The ss's llist will be processed at the flush time to create the = update + * tree. */ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cp= u) { - unsigned long flags; + struct llist_head *lhead =3D ss_lhead_cpu(css->ss, cpu); + struct css_rstat_cpu *rstatc =3D css_rstat_cpu(css, cpu); + struct llist_node *self; =20 /* * Since bpf programs can call this function, prevent access to @@ -153,19 +155,37 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subs= ys_state *css, int cpu) if (!css_uses_rstat(css)) return; =20 + lockdep_assert_preemption_disabled(); + + /* + * For arch that does not support nmi safe cmpxchg, we ignore the + * requests from nmi context for rstat update llist additions. + */ + if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && in_nmi()) + return; + + /* If already on list return. */ + if (llist_on_list(&rstatc->lnode)) + return; + /* - * Speculative already-on-list test. This may race leading to - * temporary inaccuracies, which is fine. + * Make sure only one insert request can proceed on this cpu for this + * specific lnode and thus this needs to be safe against irqs and nmis. * - * Because @parent's updated_children is terminated with @parent - * instead of NULL, we can tell whether @css is on the list by - * testing the next pointer for NULL. + * Please note that llist_add() does not protect against multiple + * inserters for the same lnode. We use the fact that lnode points to + * itself when not on a list and then atomically set it to NULL to + * select the single inserter. */ - if (data_race(css_rstat_cpu(css, cpu)->updated_next)) + self =3D &rstatc->lnode; + if (!try_cmpxchg(&(rstatc->lnode.next), &self, NULL)) return; =20 - flags =3D _css_rstat_cpu_lock(css, cpu, true); + llist_add(&rstatc->lnode, lhead); +} =20 +static void __css_process_update_tree(struct cgroup_subsys_state *css, int= cpu) +{ /* put @css and all ancestors on the corresponding updated lists */ while (true) { struct css_rstat_cpu *rstatc =3D css_rstat_cpu(css, cpu); @@ -191,8 +211,19 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsy= s_state *css, int cpu) =20 css =3D parent; } +} + +static void css_process_update_tree(struct cgroup_subsys *ss, int cpu) +{ + struct llist_head *lhead =3D ss_lhead_cpu(ss, cpu); + struct llist_node *lnode; + + while ((lnode =3D llist_del_first_init(lhead))) { + struct css_rstat_cpu *rstatc; =20 - _css_rstat_cpu_unlock(css, cpu, flags, true); + rstatc =3D container_of(lnode, struct css_rstat_cpu, lnode); + __css_process_update_tree(rstatc->owner, cpu); + } } =20 /** @@ -300,6 +331,8 @@ static struct cgroup_subsys_state *css_rstat_updated_li= st( =20 flags =3D _css_rstat_cpu_lock(root, cpu, false); =20 + css_process_update_tree(root->ss, cpu); + /* Return NULL if this subtree is not on-list */ if (!rstatc->updated_next) goto unlock_ret; --=20 2.47.1 From nobody Sat Feb 7 13:03:43 2026 Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com [91.218.175.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A344226D09 for ; Mon, 9 Jun 2025 22:56:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749509794; cv=none; b=QfpFi7iwVrAINfBOonuJbKb/P8O9EI5flDIw3DINuME/6fzrV/4oC/prwZpvhae5pDhIHCgr5fKlUCghd5dCf2l807wOmWrveSVft/H3mcYvdKskOR+GbguggUFkOKdH1i6NKHEJXZxtoZgDPCFjihLi6CICn91suJ/a3SuvTM0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749509794; c=relaxed/simple; bh=oXikNS+kgN+2w3qPcZgZY0QITtEf/BerEbAt8eoZsE0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=h6P9O68W8gRgi8Mi8siULnOy0TB24tmy4cgtw5PJ7HXObH6g7AiZdRH4HCawB2hgMjqLcGIcRo8xMV7kt0/JRp+gDZr83bfDhQghrqolKyF0TLhcxboVDKFv2lCGYLBdkn7apuyQuGe8G6JnEkIFat2l12lwxxSxvgHyXfgbp2g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=UexM/xxs; arc=none smtp.client-ip=91.218.175.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="UexM/xxs" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1749509791; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ffjH9W3JevRTfPRrE/1Jq8fYf5w4/j5rXWB6dZVk6cA=; b=UexM/xxstuk7PO7WmNao3E6L5a2hvVZU3noOUdaCSK3vBl6w4LkpusCz4PyJwsMfLacYe6 c8w4nELwZor22Ckt10ptR/Q3VtHD9SZHcunTB+jqRx0Z6SEcQh8guT7Xn8VIJbglgM6A6/ fZhsbNRWvYFQuIOtue0oDlEmeuuDChI= From: Shakeel Butt To: Tejun Heo , Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Vlastimil Babka , Alexei Starovoitov , Sebastian Andrzej Siewior , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Harry Yoo , Yosry Ahmed , bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [PATCH 3/3] memcg: cgroup: call memcg_rstat_updated irrespective of in_nmi() Date: Mon, 9 Jun 2025 15:56:11 -0700 Message-ID: <20250609225611.3967338-4-shakeel.butt@linux.dev> In-Reply-To: <20250609225611.3967338-1-shakeel.butt@linux.dev> References: <20250609225611.3967338-1-shakeel.butt@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" css_rstat_updated() is nmi safe, so there is no need to avoid it in in_nmi(), so remove the check. Signed-off-by: Shakeel Butt --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 902da8a9c643..d122bfe33e98 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -573,9 +573,7 @@ static inline void memcg_rstat_updated(struct mem_cgrou= p *memcg, int val, if (!val) return; =20 - /* TODO: add to cgroup update tree once it is nmi-safe. */ - if (!in_nmi()) - css_rstat_updated(&memcg->css, cpu); + css_rstat_updated(&memcg->css, cpu); statc_pcpu =3D memcg->vmstats_percpu; for (; statc_pcpu; statc_pcpu =3D statc->parent_pcpu) { statc =3D this_cpu_ptr(statc_pcpu); @@ -2530,7 +2528,8 @@ static inline void account_slab_nmi_safe(struct mem_c= group *memcg, } else { struct mem_cgroup_per_node *pn =3D memcg->nodeinfo[pgdat->node_id]; =20 - /* TODO: add to cgroup update tree once it is nmi-safe. */ + /* preemption is disabled in_nmi(). */ + css_rstat_updated(&memcg->css, smp_processor_id()); if (idx =3D=3D NR_SLAB_RECLAIMABLE_B) atomic_add(nr, &pn->slab_reclaimable); else @@ -2753,7 +2752,8 @@ static inline void account_kmem_nmi_safe(struct mem_c= group *memcg, int val) if (likely(!in_nmi())) { mod_memcg_state(memcg, MEMCG_KMEM, val); } else { - /* TODO: add to cgroup update tree once it is nmi-safe. */ + /* preemption is disabled in_nmi(). */ + css_rstat_updated(&memcg->css, smp_processor_id()); atomic_add(val, &memcg->kmem_stat); } } --=20 2.47.1