From nobody Wed Dec 17 10:56:03 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A5DBC83F01 for ; Thu, 31 Aug 2023 16:56:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345009AbjHaQ4W (ORCPT ); Thu, 31 Aug 2023 12:56:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56450 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234401AbjHaQ4U (ORCPT ); Thu, 31 Aug 2023 12:56:20 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD3928F for ; Thu, 31 Aug 2023 09:56:16 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-58f9db8bc1dso13233097b3.3 for ; Thu, 31 Aug 2023 09:56:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693500976; x=1694105776; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6NYJxZAD84nnU9JwFvW1AuIl8sWdu9iyZk6Te6FK6mM=; b=f5TwrrLQVQs8pMXqHmkmqQkVPc2MaEyjgy+aX+kJvW7rT4Dg5q6prhCC8QSp8VmIHq 6HuE16PwcsqhVPr+QHJox7XQSenFriFl9iDzEbeNpbMTTIUpDTzElm1z5cuU7QJ1vjHi aP0Ld+RtBjX1/IIC8jsoYLrp/Yes09qLNKn4BNi3qsfjD+GSXS2gbrsJC4nPyGUD33b7 8qyX1iym3qrmJZndu7UfZmKGk8UB4aYV+/KEuuCIUBL640wpuf/Ufr4ldVVqj/ElJe/d mldKigLldK1EyzfJ7ObQkwCs8w/dbgPfFIFwVoB7qOl7caFMUkGvJgmUTYN1k8XHEE1K U3Eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693500976; x=1694105776; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6NYJxZAD84nnU9JwFvW1AuIl8sWdu9iyZk6Te6FK6mM=; b=iMFL+anvpWuLfqGg98NQqJwAO9mW2qI5a7TUSWzagS90WY3ExJERQyZPH0K432RAit l1bSLAQw2XiC///eAJK6JEFvuKz6Vj6cl6YRIqEjU1MJD6vOAvrlJiZtE/Mpl8YhApNj c0puAgp4zTnO/q/zmYCv5w/wiV+ot0IADQ+rLHI517q/3bx8JO862o9ybWQHTGR2ud2W HLPr8zvn9AsC2EcXhXwMIy89OcgyL2LVgbzeAt+HcWi9vkyv6RkR9gKNZS3DsqgxNeVN UIrmE5ZWgohNx1HDLXDk7eFjswbpStlCJ3j46VmnlO7jxBjWlWRkwSJZx5PdvGp4oYSo gilw== X-Gm-Message-State: AOJu0Yy6WbV4BI34cZfxCKwQNWOSFI55udaBLiJUnx7hrTO8iR3ADk1p dfPRSkMNhamnoSVuAeO0nxNWAx2DdrVrLxQN X-Google-Smtp-Source: AGHT+IG+OUEj8IqQB30ouOiPIGWocFJhxTzlAMy3JDUj6cQD4hmj7r5Edc+cYhwbUNs50K1U4Cy6rv2ZeUvHgJsN X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a5b:d4a:0:b0:d73:bcb7:7282 with SMTP id f10-20020a5b0d4a000000b00d73bcb77282mr7521ybr.8.1693500976143; Thu, 31 Aug 2023 09:56:16 -0700 (PDT) Date: Thu, 31 Aug 2023 16:56:08 +0000 In-Reply-To: <20230831165611.2610118-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc2.253.gd59a3bf2b4-goog Message-ID: <20230831165611.2610118-2-yosryahmed@google.com> Subject: [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , "=?UTF-8?q?Michal=20Koutn=C3=BD?=" , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Most contexts that flush memcg stats use "unified" flushing, where basically all flushers attempt to flush the entire hierarchy, but only one flusher is allowed at a time, others skip flushing. This is needed because we need to flush the stats from paths such as reclaim or refaults, which may have high concurrency, especially on large systems. Serializing such performance-sensitive paths can introduce regressions, hence, unified flushing offers a tradeoff between stats staleness and the performance impact of flushing stats. Document this properly and explicitly by renaming the common flushing helper from do_flush_stats() to do_unified_stats_flush(), and adding documentation to describe unified flushing. Additionally, rename flushing APIs to add "try" in the name, which implies that flushing will not always happen. Also add proper documentation. No functional change intended. Signed-off-by: Yosry Ahmed Acked-by: Michal Hocko Acked-by: Waiman Long --- include/linux/memcontrol.h | 8 ++--- mm/memcontrol.c | 61 +++++++++++++++++++++++++------------- mm/vmscan.c | 2 +- mm/workingset.c | 4 +-- 4 files changed, 47 insertions(+), 28 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 11810a2cfd2d..d517b0cc5221 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(s= truct lruvec *lruvec, return x; } =20 -void mem_cgroup_flush_stats(void); -void mem_cgroup_flush_stats_ratelimited(void); +void mem_cgroup_try_flush_stats(void); +void mem_cgroup_try_flush_stats_ratelimited(void); =20 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item i= dx, int val); @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local= (struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } =20 -static inline void mem_cgroup_flush_stats(void) +static inline void mem_cgroup_try_flush_stats(void) { } =20 -static inline void mem_cgroup_flush_stats_ratelimited(void) +static inline void mem_cgroup_try_flush_stats_ratelimited(void) { } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf57fe9318d5..2d0ec828a1c4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tr= ee_per_node *mctz) static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); static DEFINE_PER_CPU(unsigned int, stats_updates); -static atomic_t stats_flush_ongoing =3D ATOMIC_INIT(0); +static atomic_t stats_unified_flush_ongoing =3D ATOMIC_INIT(0); static atomic_t stats_flush_threshold =3D ATOMIC_INIT(0); static u64 flush_next_time; =20 @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgrou= p *memcg, int val) /* * If stats_flush_threshold exceeds the threshold * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further + * in mem_cgroup_try_flush_stats(). Increasing this var further * is redundant and simply adds overhead in atomic update. */ if (atomic_read(&stats_flush_threshold) <=3D num_online_cpus()) @@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem_cgr= oup *memcg, int val) } } =20 -static void do_flush_stats(void) +/* + * do_unified_stats_flush - do a unified flush of memory cgroup statistics + * + * A unified flush tries to flush the entire hierarchy, but skips if there= is + * another ongoing flush. This is meant for flushers that may have a lot of + * concurrency (e.g. reclaim, refault, etc), and should not be serialized = to + * avoid slowing down performance-sensitive paths. A unified flush may ski= p, and + * hence may yield stale stats. + */ +static void do_unified_stats_flush(void) { - /* - * We always flush the entire tree, so concurrent flushers can just - * skip. This avoids a thundering herd problem on the rstat global lock - * from memcg flushers (e.g. reclaim, refault, etc). - */ - if (atomic_read(&stats_flush_ongoing) || - atomic_xchg(&stats_flush_ongoing, 1)) + if (atomic_read(&stats_unified_flush_ongoing) || + atomic_xchg(&stats_unified_flush_ongoing, 1)) return; =20 WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); @@ -655,19 +659,34 @@ static void do_flush_stats(void) cgroup_rstat_flush(root_mem_cgroup->css.cgroup); =20 atomic_set(&stats_flush_threshold, 0); - atomic_set(&stats_flush_ongoing, 0); + atomic_set(&stats_unified_flush_ongoing, 0); } =20 -void mem_cgroup_flush_stats(void) +/* + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics + * + * Try to flush the stats of all memcgs that have stat updates since the l= ast + * flush. We do not flush the stats if: + * - The magnitude of the pending updates is below a certain threshold. + * - There is another ongoing unified flush (see do_unified_stats_flush()). + * + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due = to + * periodic flushing. + */ +void mem_cgroup_try_flush_stats(void) { if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - do_flush_stats(); + do_unified_stats_flush(); } =20 -void mem_cgroup_flush_stats_ratelimited(void) +/* + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flu= sher + * is late. + */ +void mem_cgroup_try_flush_stats_ratelimited(void) { if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); } =20 static void flush_memcg_stats_dwork(struct work_struct *w) @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct = *w) * Always flush here so that flushing in latency-sensitive paths is * as cheap as possible. */ - do_flush_stats(); + do_unified_stats_flush(); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } =20 @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memc= g, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, v= oid *v) int nid; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (stat =3D stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=3D%lu", stat->name, @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *mem= cg, struct seq_buf *s) =20 BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) !=3D ARRAY_SIZE(memcg1_stats)); =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (i =3D 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, un= signed long *pfilepages, struct mem_cgroup *memcg =3D mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 *pdirty =3D memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback =3D memcg_page_state(memcg, NR_WRITEBACK); @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, = void *v) int i; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; diff --git a/mm/vmscan.c b/mm/vmscan.c index c7c149cb8d66..457a18921fda 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, stru= ct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 /* * Determine the scan balance between anon and file LRUs. diff --git a/mm/workingset.c b/mm/workingset.c index da58a26d0d4d..affb8699e58d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shad= ow) } =20 /* Flush stats (and potentially sleep) before holding RCU read lock */ - mem_cgroup_flush_stats_ratelimited(); + mem_cgroup_try_flush_stats_ratelimited(); =20 rcu_read_lock(); =20 @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker= *shrinker, struct lruvec *lruvec; int i; =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages =3D 0, i =3D 0; i < NR_LRU_LISTS; i++) pages +=3D lruvec_page_state_local(lruvec, --=20 2.42.0.rc2.253.gd59a3bf2b4-goog From nobody Wed Dec 17 10:56:03 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 761C4C83F2E for ; Thu, 31 Aug 2023 16:56:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346571AbjHaQ4X (ORCPT ); Thu, 31 Aug 2023 12:56:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51744 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345539AbjHaQ4W (ORCPT ); Thu, 31 Aug 2023 12:56:22 -0400 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4A8E0107 for ; Thu, 31 Aug 2023 09:56:19 -0700 (PDT) Received: by mail-pf1-x44a.google.com with SMTP id d2e1a72fcca58-68bf27251b7so1190157b3a.0 for ; Thu, 31 Aug 2023 09:56:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693500979; x=1694105779; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=rnl+F2n1kOs4P011p6EJFvzkt9Ylx9T9gDi6gAnNyhk=; b=1UCvVlVxvkygSVCtxlJldpmxZZ67IB+D3cLFko+rqK7fWxwKANEVqrTdC3bdO2hdQH es4uWKrUOUc8aVbcp8xSS1KwmTpi/VUunr1dky9yrxfXXKA6o+7o0xJsILS9R2XpoxhM NtXHPvPjAHDx4taXuhGP6vQyl6yU3LrO2RXgzvEySqz5drf9C5xNd72rhbuO2GR/2h4+ bRoie1iJ0SoIsZZe3GO9LDJ40SPvoP3UnexBDicGWOepvnAHDpVQdYlbRLzDVKvIO5FH JItRgGwGgqPMo0Hf60APaYqSN43igGQZHeF793cnsx+gA9w3Do290y6qvYKE1gG1QK8u yFZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693500979; x=1694105779; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rnl+F2n1kOs4P011p6EJFvzkt9Ylx9T9gDi6gAnNyhk=; b=Ahng+UFTDWme1L+xtTYGWSc+jgjvdgKgwTKQEc0viDpz+wwfIxYPoD1fZ94qIDfOM8 tFlS9n2RO44b5YbJ0ZoBl4leCOohxIOJ6Mw7e+m4nUpXZk2giTyzOxzeuNYd23MhcMPs KBcHPoeuY0vXG9Tu6SSYUes2v2A2bLRxSStku1AFyTlmJqhXqDxhYbVYf7L3nTjRroeQ iVVYUOYl5bHiAKXBqw1fuv2dW86qzR3b20lJpPRQXmUFGFI3cQFV3av2yx6+dUcTVQXM 8k+HZktODIX/a/v+/l/3t+rkfC8ZbtyMg4RVQgtYbqxRGJVtuZ2ln7/1H+OkGPIXdifg SJLw== X-Gm-Message-State: AOJu0YxDtNQn/8FxQ+Vbfn9c/s9ksOHr34kHrGPpa/4MDXcHHFvKDtfF 4Ugb64Fij+q06/P/oF8Xhe2GNjEuLl/eiWvA X-Google-Smtp-Source: AGHT+IGQaFV/eoJPD0E5YObQdt63kn1VXtZb+DnEHQpo8u+eLis24QVhJvy1qf6oTqJysoROPjRG5kn0lET0eVSr X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a05:6a00:2d87:b0:68b:dbbc:dd00 with SMTP id fb7-20020a056a002d8700b0068bdbbcdd00mr86084pfb.0.1693500978729; Thu, 31 Aug 2023 09:56:18 -0700 (PDT) Date: Thu, 31 Aug 2023 16:56:09 +0000 In-Reply-To: <20230831165611.2610118-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc2.253.gd59a3bf2b4-goog Message-ID: <20230831165611.2610118-3-yosryahmed@google.com> Subject: [PATCH v4 2/4] mm: memcg: add a helper for non-unified stats flushing From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , "=?UTF-8?q?Michal=20Koutn=C3=BD?=" , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Some contexts flush memcg stats outside of unified flushing, directly using cgroup_rstat_flush(). Add a helper for non-unified flushing, a counterpart for do_unified_stats_flush(), and use it in those contexts, as well as in do_unified_stats_flush() itself. This abstracts the rstat API and makes it easy to introduce modifications to either unified or non-unified flushing functions without changing callers. No functional change intended. Signed-off-by: Yosry Ahmed Acked-by: Michal Hocko Acked-by: Waiman Long --- mm/memcontrol.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d0ec828a1c4..8c046feeaae7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgro= up *memcg, int val) } } =20 +/* + * do_stats_flush - do a flush of the memory cgroup statistics + * @memcg: memory cgroup to flush + * + * Only flushes the subtree of @memcg, does not skip under any conditions. + */ +static void do_stats_flush(struct mem_cgroup *memcg) +{ + cgroup_rstat_flush(memcg->css.cgroup); +} + /* * do_unified_stats_flush - do a unified flush of memory cgroup statistics * @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void) =20 WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); =20 - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); + do_stats_flush(root_mem_cgroup); =20 atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_unified_flush_ongoing, 0); @@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) break; } =20 - cgroup_rstat_flush(memcg->css.cgroup); + do_stats_flush(memcg); pages =3D memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; if (pages < max) continue; @@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *ob= jcg, size_t size) static u64 zswap_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { - cgroup_rstat_flush(css->cgroup); - return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B); + struct mem_cgroup *memcg =3D mem_cgroup_from_css(css); + + do_stats_flush(memcg); + return memcg_page_state(memcg, MEMCG_ZSWAP_B); } =20 static int zswap_max_show(struct seq_file *m, void *v) --=20 2.42.0.rc2.253.gd59a3bf2b4-goog From nobody Wed Dec 17 10:56:03 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83B11C83F2E for ; Thu, 31 Aug 2023 16:56:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346577AbjHaQ4c (ORCPT ); Thu, 31 Aug 2023 12:56:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51790 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245021AbjHaQ4Y (ORCPT ); Thu, 31 Aug 2023 12:56:24 -0400 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D0B58F for ; Thu, 31 Aug 2023 09:56:21 -0700 (PDT) Received: by mail-pj1-x1049.google.com with SMTP id 98e67ed59e1d1-267f666104aso1108775a91.0 for ; Thu, 31 Aug 2023 09:56:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693500981; x=1694105781; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=dJkwDc7Wxmywu9Yo0ZIK2ppmyNcYkDsKDS48R26U4fo=; b=onL+yixI1Hw0E45nLUTeAKG663+oWCI/D3IutTPmf516S6mKwzXdm4mDaMAodk1W6b pn7YH+wPBUVBEabXF7KYa+w3QY7DXyiviyVcEqCWY2muuS9/RrHv2ld43n7Sif8YE/i1 z6bYsupweiuPGNaewF8ZMWmsE3tyCIkl1kV871VT9aqseA9ry3W2bUsRq9eSGs9mAt4o jMVD/ltODgDJ1HWQEWFhc4uCZW/kxy2mCyroQibAO/6/yXGXd4goPGDD5oebkowPsqyR v9ax0IHNnSt7UPX3WXpER33Lr6ZFY6BJzOdNTOgP2qw2dajW9JCIwWLLnvCl8Df+S8xA XNuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693500981; x=1694105781; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=dJkwDc7Wxmywu9Yo0ZIK2ppmyNcYkDsKDS48R26U4fo=; b=TQQ6WP/MoFMfsgddhp3UEh/j2nP+6698VXOh3OP6vQdEP+Npe3ARMv1ek7MReXZ4gT Oz0imqw1UgolhykPNKSwKi40NlWHRHxQzWNOvSrTL/oqRMtEpO2ob8Hf89m30aCMsgmu RujLVixy/Qz0qQ9EieFKPftxjfYCBYnwXqeDTk/020Gem2XBrELf5dvxuEB28o91AP+u 8i38KD4r6IDx51w/ZjTMx6yIcz6Wh/MXfARi7zlWVYmgHiWdMHdHzu8nXKI070NWov4h 4DPxO51uptOqIgrBI5HII+kkuXw+EkDeN84WbOq0oeguVygGou/1La6kgHu4G9UCBZSS gVSg== X-Gm-Message-State: AOJu0Yzg4KmhfayiCOe589iDhS9uMjeIH8XKbfc/lHvYRzpCfBpZjw7A 7cfmYJvMs2cNai9AsxH94bOXm2iscbVsGr3V X-Google-Smtp-Source: AGHT+IGBXws1kwasa2qMZ2oV47hWaKQbyN2jJKglhtHqwCQYSWOwpg/OejdryTSoKa2sWR5HEgqm/Qoc3DqRPg8i X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a17:90a:c68c:b0:268:9cfa:171c with SMTP id n12-20020a17090ac68c00b002689cfa171cmr61794pjt.4.1693500980771; Thu, 31 Aug 2023 09:56:20 -0700 (PDT) Date: Thu, 31 Aug 2023 16:56:10 +0000 In-Reply-To: <20230831165611.2610118-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc2.253.gd59a3bf2b4-goog Message-ID: <20230831165611.2610118-4-yosryahmed@google.com> Subject: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , "=?UTF-8?q?Michal=20Koutn=C3=BD?=" , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Unified flushing of memcg stats keeps track of the magnitude of pending updates, and only allows a flush if that magnitude exceeds a threshold. It also keeps track of the time at which ratelimited flushing should be allowed as flush_next_time. A non-unified flush on the root memcg has the same effect as a unified flush, so let it help unified flushing by resetting pending updates and kicking flush_next_time forward. Move the logic into the common do_stats_flush() helper, and do it for all root flushes, unified or not. There is a subtle change here, we reset stats_flush_threshold before a flush rather than after a flush. This probably okay because: (a) For flushers: only unified flushers check stats_flush_threshold, and those flushers skip anyway if there is another unified flush ongoing. Having them also skip if there is an ongoing non-unified root flush is actually more consistent. (b) For updaters: Resetting stats_flush_threshold early may lead to more atomic updates of stats_flush_threshold, as we start updating it earlier. This should not be significant in practice because we stop updating stats_flush_threshold when it reaches the threshold anyway. If we start early and stop early, the number of atomic updates remain the same. The only difference is the scenario where we reset stats_flush_threshold early, start doing atomic updates early, and then the periodic flusher kicks in before we reach the threshold. In this case, we will have done more atomic updates. However, since the threshold wasn't reached, then we did not do a lot of updates anyway. Suggested-by: Michal Koutn=C3=BD Signed-off-by: Yosry Ahmed Acked-by: Waiman Long --- mm/memcontrol.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8c046feeaae7..94d5a6751a9e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -647,6 +647,11 @@ static inline void memcg_rstat_updated(struct mem_cgro= up *memcg, int val) */ static void do_stats_flush(struct mem_cgroup *memcg) { + /* for unified flushing, root non-unified flushing can help as well */ + if (mem_cgroup_is_root(memcg)) { + WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); + atomic_set(&stats_flush_threshold, 0); + } cgroup_rstat_flush(memcg->css.cgroup); } =20 @@ -665,11 +670,8 @@ static void do_unified_stats_flush(void) atomic_xchg(&stats_unified_flush_ongoing, 1)) return; =20 - WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); - do_stats_flush(root_mem_cgroup); =20 - atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_unified_flush_ongoing, 0); } =20 --=20 2.42.0.rc2.253.gd59a3bf2b4-goog From nobody Wed Dec 17 10:56:03 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95953C83F32 for ; Thu, 31 Aug 2023 16:56:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346684AbjHaQ4d (ORCPT ); Thu, 31 Aug 2023 12:56:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51932 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346632AbjHaQ43 (ORCPT ); Thu, 31 Aug 2023 12:56:29 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F751E40 for ; Thu, 31 Aug 2023 09:56:23 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d72403b9e03so905377276.0 for ; Thu, 31 Aug 2023 09:56:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693500982; x=1694105782; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lVjKd06WEqwLUbomPNwpng6u2PnrisAzbeEHsBpSLQA=; b=W38E5uAeTvyzvdugULNx6GuUlSmhzK+I1kCU5GpDRcJ/kzbDYMV1Zpgq3yvogCW+HO a0u1glBoatfXyeCi5dgKcnvjkmpqHQVt5p96qMBgQ4BpDaCVmJd2g9Sn47VyHINS6ZC5 ip1m/0C/p5Kk00qY0HhGlwqUdf6zss1JkrqMgMj37Ag0nlc/atAxGJoDTHRQieBEwkyd gQp2xt5/QtSgFC219yqziibx6jWWlEcqeux7rSE1dzLAqdyWJDyoQX5Op0VOfUXGPO3y ghSojvpv6+c296PaorTE6B016I4p+0Bxg/OafrYRFGt/Ez6FSKXw/E/CsMa8zgshkxqz WSyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693500982; x=1694105782; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lVjKd06WEqwLUbomPNwpng6u2PnrisAzbeEHsBpSLQA=; b=V+pFxbfZgy2nyqhNlfRCsg1pGRrEAqsgHhR1FEyd0imfmcb/VuQatC23rODaE3Wo8E m8QU1Mr1bU+bFbcbBmC1hpxZNjsY6z8igrDIBnPKI1ig/nA63KgbCSls6ebUtr/JQMiK 68mFJGxcpJSYMfZYzg89CIo0Rve1Iu5JL1j7UHjJAcoJ1T6PB7bCcoZUDc/zWUlA8egk aneORCmFvTCzpSyzhSCr59sPcb6VIP1Kwpo0441db6JU8PE1lPcUC2rLSgmVEAL4isct uRpGa0rso7XikclFQA9x12c+ZwT0LjIlWu8oiHDtgviBibR545Hsjc2lWbqOUYEssxgu BdSg== X-Gm-Message-State: AOJu0YwgQacQOu76tnQA6aXtURGyzoDMZUY/+uonotQQECIHtm0Q+WzJ cklM7+USdOaEUZEZ6ctz2/s/6zhCaZpKyTNB X-Google-Smtp-Source: AGHT+IEy8G064MlfVtEPW2U+DbL+KGFNApMore566FGW/RELtpuRVjTr6mp/8DBxtEe5U/4TTDMHu8xHlFoQIToU X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a25:ba90:0:b0:d72:8661:ee29 with SMTP id s16-20020a25ba90000000b00d728661ee29mr6359ybg.2.1693500982588; Thu, 31 Aug 2023 09:56:22 -0700 (PDT) Date: Thu, 31 Aug 2023 16:56:11 +0000 In-Reply-To: <20230831165611.2610118-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc2.253.gd59a3bf2b4-goog Message-ID: <20230831165611.2610118-5-yosryahmed@google.com> Subject: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , "=?UTF-8?q?Michal=20Koutn=C3=BD?=" , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Unified flushing allows for great concurrency for paths that attempt to flush the stats, at the expense of potential staleness and a single flusher paying the extra cost of flushing the full tree. This tradeoff makes sense for in-kernel flushers that may observe high concurrency (e.g. reclaim, refault). For userspace readers, stale stats may be unexpected and problematic, especially when such stats are used for critical paths such as userspace OOM handling. Additionally, a userspace reader will occasionally pay the cost of flushing the entire hierarchy, which also causes problems in some cases [1]. Opt userspace reads out of unified flushing. This makes the cost of reading the stats more predictable (proportional to the size of the subtree), as well as the freshness of the stats. Userspace readers are not expected to have similar concurrency to in-kernel flushers, serializing them among themselves and among in-kernel flushers should be okay. Nonetheless, for extra safety, introduce a mutex when flushing for userspace readers to make sure only a single userspace reader can compete with in-kernel flushers at a time. This takes away userspace ability to directly influence or hurt in-kernel lock contention. An alternative is to remove flushing from the stats reading path completely, and rely on the periodic flusher. This should be accompanied by making the periodic flushing period tunable, and providing an interface for userspace to force a flush, following a similar model to /proc/vmstat. However, such a change will be hard to reverse if the implementation needs to be changed because: - The cost of reading stats will be very cheap and we won't be able to take that back easily. - There are user-visible interfaces involved. Hence, let's go with the change that's most reversible first and revisit as needed. This was tested on a machine with 256 cpus by running a synthetic test script [2] that creates 50 top-level cgroups, each with 5 children (250 leaf cgroups). Each leaf cgroup has 10 processes running that allocate memory beyond the cgroup limit, invoking reclaim (which is an in-kernel unified flusher). Concurrently, one thread is spawned per-cgroup to read the stats every second (including root, top-level, and leaf cgroups -- so total 251 threads). No significant regressions were observed in the total run time, which means that userspace readers are not significantly affecting in-kernel flushers: Base (mm-unstable): real 0m22.500s user 0m9.399s sys 73m41.381s real 0m22.749s user 0m15.648s sys 73m13.113s real 0m22.466s user 0m10.000s sys 73m11.933s With this patch: real 0m23.092s user 0m10.110s sys 75m42.774s real 0m22.277s user 0m10.443s sys 72m7.182s real 0m24.127s user 0m12.617s sys 78m52.765s [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME= 4-dsgfoQ@mail.gmail.com/ [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6= POYTV-4g@mail.gmail.com/ Signed-off-by: Yosry Ahmed Acked-by: Michal Hocko Acked-by: Waiman Long --- mm/memcontrol.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 94d5a6751a9e..46a7abf71c73 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tr= ee_per_node *mctz) static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); static DEFINE_PER_CPU(unsigned int, stats_updates); +static DEFINE_MUTEX(stats_user_flush_mutex); static atomic_t stats_unified_flush_ongoing =3D ATOMIC_INIT(0); static atomic_t stats_flush_threshold =3D ATOMIC_INIT(0); static u64 flush_next_time; @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) cgroup_rstat_flush(memcg->css.cgroup); } =20 +/* + * mem_cgroup_user_flush_stats - do a stats flush for a user read + * @memcg: memory cgroup to flush + * + * Flush the subtree of @memcg. A mutex is used for userspace readers to g= ate + * the global rstat spinlock. This protects in-kernel flushers from usersp= ace + * readers hogging the lock. + */ +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) +{ + mutex_lock(&stats_user_flush_mutex); + do_stats_flush(memcg); + mutex_unlock(&stats_user_flush_mutex); +} + /* * do_unified_stats_flush - do a unified flush of memory cgroup statistics * @@ -1608,7 +1624,7 @@ static void memcg_stat_format(struct mem_cgroup *memc= g, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4050,7 +4066,7 @@ static int memcg_numa_stat_show(struct seq_file *m, v= oid *v) int nid; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); =20 for (stat =3D stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=3D%lu", stat->name, @@ -4125,7 +4141,7 @@ static void memcg1_stat_format(struct mem_cgroup *mem= cg, struct seq_buf *s) =20 BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) !=3D ARRAY_SIZE(memcg1_stats)); =20 - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); =20 for (i =3D 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -6642,7 +6658,7 @@ static int memory_numa_stat_show(struct seq_file *m, = void *v) int i; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; --=20 2.42.0.rc2.253.gd59a3bf2b4-goog