From nobody Wed Dec 17 10:57:23 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60F77EE49A8 for ; Mon, 21 Aug 2023 20:55:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231488AbjHUUzY (ORCPT ); Mon, 21 Aug 2023 16:55:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231477AbjHUUzX (ORCPT ); Mon, 21 Aug 2023 16:55:23 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 87C77A1 for ; Mon, 21 Aug 2023 13:55:19 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-cf4cb742715so4778697276.2 for ; Mon, 21 Aug 2023 13:55:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692651318; x=1693256118; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wvdq61L+aFGdt8QKKpNnJHBdE39S0UeYM1yvvHjD3xQ=; b=gDjS77jh86r8ZKs1/zKg7Cco96sopZmFLEz/zj9xB/n1q5vx7rO/YHELKGS+X9Ug4U YpGS8y0oXx1+lQOuCRWO4hJaQzYd+MTroVbrUvtNf91m8TASXjyJIAiKKFjFs2SvbEMF ETcYXj5DRyAg6uMI1Tg+PNCmWCLVT4wh3IA7ElB0FC39nvwl0DxqHUQMVF3L6VZeLxWD Igvb6FKlgK7kbWdQtRC2tXSUf35syDuAtXUdPI50X7vL1Y+pprzowUj3JjDeaCVIeOGE KhQl4U7drhKa+h5IQMVK0nmZmAfMy/Kou0T2fowQemhM6iwR2fCYPTNirZnXMfkhAzuf 7vZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692651318; x=1693256118; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wvdq61L+aFGdt8QKKpNnJHBdE39S0UeYM1yvvHjD3xQ=; b=I68YXwhp15eApD443+o2kWvgb1ZMm6wh5GLyMqqE1TB/hoz+T/HTC10fLiQ0hktUpQ m8gy784RXLl0B4rsOt8PXbrPSPh6m6aWwGK+CKqK8cE8Y2T11uMso6lQMmdS1rgadopk pgpiRcNhVYuskZB0rD+3Z3TiJ1SsutS9OeCdL7ho0ZKR7RO5IUSskMzYAVEFdEO0CQCI g5rIQgh8/kVE6mMozByXiTaW9Fw7SMIJLdq3V2AzYAEQabrvYkDP24TepTncbnPc56uw zV+y4ZnMXz97mZxe1meY6MD2/4XgbpMVGN7vdojCR8W3r8z+ZOVXaybORIeVQJxW5ek/ lCLA== X-Gm-Message-State: AOJu0YwR3/EQSxYjDo7t7hrhIioy7Dms3+9v7ppuCCNts42lXEoOBxj1 blgvhFfxfmgpMP+B/xZOU69WyWG9rAL2+Fnf X-Google-Smtp-Source: AGHT+IHI17WgDLA9h5213qI/4QCn9pZF+BZYijIyegDPoTXjNDSTPJibwpZtlmJmIAbTjzp62J1FhYw7o02Y6LmV X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a25:d604:0:b0:cf9:3564:33cc with SMTP id n4-20020a25d604000000b00cf9356433ccmr61637ybg.13.1692651318564; Mon, 21 Aug 2023 13:55:18 -0700 (PDT) Date: Mon, 21 Aug 2023 20:54:56 +0000 In-Reply-To: <20230821205458.1764662-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230821205458.1764662-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc1.204.g551eb34607-goog Message-ID: <20230821205458.1764662-2-yosryahmed@google.com> Subject: [PATCH 1/3] mm: memcg: properly name and document unified stats flushing From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Most contexts that flush memcg stats use "unified" flushing, where basically all flushers attempt to flush the entire hierarchy, but only one flusher is allowed at a time, others skip flushing. This is needed because we need to flush the stats from paths such as reclaim or refaults, which may have high concurrency, especially on large systems. Serializing such performance-sensitive paths can introduce regressions, hence, unified flushing offers a tradeoff between stats staleness and the performance impact of flushing stats. Document this properly and explicitly by renaming the common flushing helper from do_flush_stats() to do_unified_stats_flush(), and adding documentation to describe unified flushing. Additionally, rename flushing APIs to add "try" in the name, which implies that flushing will not always happen. Also add proper documentation. No functional change intended. Signed-off-by: Yosry Ahmed --- include/linux/memcontrol.h | 8 +++--- mm/memcontrol.c | 53 ++++++++++++++++++++++++++------------ mm/vmscan.c | 2 +- mm/workingset.c | 4 +-- 4 files changed, 43 insertions(+), 24 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 11810a2cfd2d..d517b0cc5221 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(s= truct lruvec *lruvec, return x; } =20 -void mem_cgroup_flush_stats(void); -void mem_cgroup_flush_stats_ratelimited(void); +void mem_cgroup_try_flush_stats(void); +void mem_cgroup_try_flush_stats_ratelimited(void); =20 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item i= dx, int val); @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local= (struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } =20 -static inline void mem_cgroup_flush_stats(void) +static inline void mem_cgroup_try_flush_stats(void) { } =20 -static inline void mem_cgroup_flush_stats_ratelimited(void) +static inline void mem_cgroup_try_flush_stats_ratelimited(void) { } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf57fe9318d5..c6150ea54d48 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgrou= p *memcg, int val) /* * If stats_flush_threshold exceeds the threshold * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further + * in mem_cgroup_try_flush_stats(). Increasing this var further * is redundant and simply adds overhead in atomic update. */ if (atomic_read(&stats_flush_threshold) <=3D num_online_cpus()) @@ -639,13 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgr= oup *memcg, int val) } } =20 -static void do_flush_stats(void) +/* + * do_unified_stats_flush - do a unified flush of memory cgroup statistics + * + * A unified flush tries to flush the entire hierarchy, but skips if there= is + * another ongoing flush. This is meant for flushers that may have a lot of + * concurrency (e.g. reclaim, refault, etc), and should not be serialized = to + * avoid slowing down performance-sensitive paths. A unified flush may ski= p, and + * hence may yield stale stats. + */ +static void do_unified_stats_flush(void) { - /* - * We always flush the entire tree, so concurrent flushers can just - * skip. This avoids a thundering herd problem on the rstat global lock - * from memcg flushers (e.g. reclaim, refault, etc). - */ if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1)) return; @@ -658,16 +662,31 @@ static void do_flush_stats(void) atomic_set(&stats_flush_ongoing, 0); } =20 -void mem_cgroup_flush_stats(void) +/* + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics + * + * Try to flush the stats of all memcgs that have stat updates since the l= ast + * flush. We do not flush the stats if: + * - The magnitude of the pending updates is below a certain threshold. + * - There is another ongoing unified flush (see do_unified_stats_flush()). + * + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due = to + * periodic flushing. + */ +void mem_cgroup_try_flush_stats(void) { if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - do_flush_stats(); + do_unified_stats_flush(); } =20 -void mem_cgroup_flush_stats_ratelimited(void) +/* + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flu= sher + * is late. + */ +void mem_cgroup_try_flush_stats_ratelimited(void) { if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); } =20 static void flush_memcg_stats_dwork(struct work_struct *w) @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct = *w) * Always flush here so that flushing in latency-sensitive paths is * as cheap as possible. */ - do_flush_stats(); + do_unified_stats_flush(); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } =20 @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memc= g, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, v= oid *v) int nid; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (stat =3D stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=3D%lu", stat->name, @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *mem= cg, struct seq_buf *s) =20 BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) !=3D ARRAY_SIZE(memcg1_stats)); =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (i =3D 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, un= signed long *pfilepages, struct mem_cgroup *memcg =3D mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 *pdirty =3D memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback =3D memcg_page_state(memcg, NR_WRITEBACK); @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, = void *v) int i; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; diff --git a/mm/vmscan.c b/mm/vmscan.c index c7c149cb8d66..457a18921fda 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, stru= ct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); =20 /* * Determine the scan balance between anon and file LRUs. diff --git a/mm/workingset.c b/mm/workingset.c index da58a26d0d4d..affb8699e58d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shad= ow) } =20 /* Flush stats (and potentially sleep) before holding RCU read lock */ - mem_cgroup_flush_stats_ratelimited(); + mem_cgroup_try_flush_stats_ratelimited(); =20 rcu_read_lock(); =20 @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker= *shrinker, struct lruvec *lruvec; int i; =20 - mem_cgroup_flush_stats(); + mem_cgroup_try_flush_stats(); lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages =3D 0, i =3D 0; i < NR_LRU_LISTS; i++) pages +=3D lruvec_page_state_local(lruvec, --=20 2.42.0.rc1.204.g551eb34607-goog From nobody Wed Dec 17 10:57:23 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66100EE49A5 for ; Mon, 21 Aug 2023 20:55:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231495AbjHUUzZ (ORCPT ); Mon, 21 Aug 2023 16:55:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231479AbjHUUzX (ORCPT ); Mon, 21 Aug 2023 16:55:23 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1069512C for ; Mon, 21 Aug 2023 13:55:21 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id d9443c01a7336-1bf0b05bbbeso57706115ad.0 for ; Mon, 21 Aug 2023 13:55:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692651320; x=1693256120; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=EWXap7NsEgOymMxjxsCXNzpaZzVX2vYEXhkkwglYb2w=; b=BxE45txI7COiK2llaPLekWLKmnwKvbfF2s+ErO15DE6ofDgcY41orJSnC1lxnC5xSD ns0NI3CFDeBKOgHffDfz8J+snMVDi/Bm1Z7XQP+gB/BJurXrc210wa7+HakGaRIeyfrR gkyrBvxIdVL0w6/wEiEjuj80edhWhBOarSLoMLG6xilaOxakqnBgLMne6hlBqjVYfs4v ENXPjTfI0g332oAIouzQJ4gw4Eqbnd43rlLTg2T4GFT3PPeE9f1Qb7UfSFQpFlOSlXs8 BweEjiJ9C0STVkzlpyv7rd/z4Ep9F8B9BTFfyXqsH3AJsnekq8bnFOZfOryTXvrS+E+8 DCMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692651320; x=1693256120; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=EWXap7NsEgOymMxjxsCXNzpaZzVX2vYEXhkkwglYb2w=; b=M26tpkahXrVteUO5AbjPfKeGk1udOT5Lf4ijcLQkH8U4O8rn6uXWbLoBnr+aRDWqRQ uPrgON4VGZeG4wyF6IAf82n1VTrTqKjHh1Bk/yWopq4bqMwukcy/mbPJ70y86wUpOQtV OoMKgD4u4a9ojaY5klV+yUkYAo+iHFPKHY7Uz33T2n/8DC49qtqTofos2k+sYYUSK9Hr I2mcia8vlNibXANoNNci8LJ5XP8Tbu7zaD8WRKeD0rguHsX4eY7vf7YcdqIggzGSUtR9 nHnk8tZvARwJJE3RL9Qb+5+vt+1fGZlxCk69KWAFUISWljIbIxv6d9eb/wOjwe3NBUL4 +DxA== X-Gm-Message-State: AOJu0YxsJ+FJeG0M2xAXa/KEmH9W+G0DLJmfCmpqlIrtUOQ0Lw+rcHIy YAD1Uqps2RSPUdHxc2uf+xsjynpe0Kk1Mdr8 X-Google-Smtp-Source: AGHT+IFrW4DDu5hPoWYHSg5YRuSc/PnkgVYk/IScMtefDrv62gvp7vZ2XxWBYhOW7vd5z08FD5JBCBz/m5ZQr4DG X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a17:903:2806:b0:1bc:5182:1ddb with SMTP id kp6-20020a170903280600b001bc51821ddbmr2949941plb.3.1692651320527; Mon, 21 Aug 2023 13:55:20 -0700 (PDT) Date: Mon, 21 Aug 2023 20:54:57 +0000 In-Reply-To: <20230821205458.1764662-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230821205458.1764662-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc1.204.g551eb34607-goog Message-ID: <20230821205458.1764662-3-yosryahmed@google.com> Subject: [PATCH 2/3] mm: memcg: add a helper for non-unified stats flushing From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Some contexts flush memcg stats outside of unified flushing, directly using cgroup_rstat_flush(). Add a helper for non-unified flushing, a counterpart for do_unified_stats_flush(), and use it in those contexts, as well as in do_unified_stats_flush() itself. This abstracts the rstat API and makes it easy to introduce modifications to either unified or non-unified flushing functions without changing callers. No functional change intended. Signed-off-by: Yosry Ahmed --- mm/memcontrol.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6150ea54d48..90f08b35fa77 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgro= up *memcg, int val) } } =20 +/* + * do_stats_flush - do a flush of the memory cgroup statistics + * @memcg: memory cgroup to flush + * + * Only flushes the subtree of @memcg, does not skip under any conditions. + */ +static void do_stats_flush(struct mem_cgroup *memcg) +{ + cgroup_rstat_flush(memcg->css.cgroup); +} + /* * do_unified_stats_flush - do a unified flush of memory cgroup statistics * @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void) =20 WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); =20 - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); + do_stats_flush(root_mem_cgroup); =20 atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_flush_ongoing, 0); @@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) break; } =20 - cgroup_rstat_flush(memcg->css.cgroup); + do_stats_flush(memcg); pages =3D memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; if (pages < max) continue; @@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *ob= jcg, size_t size) static u64 zswap_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { - cgroup_rstat_flush(css->cgroup); - return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B); + struct mem_cgroup *memcg =3D mem_cgroup_from_css(css); + + do_stats_flush(memcg); + return memcg_page_state(memcg, MEMCG_ZSWAP_B); } =20 static int zswap_max_show(struct seq_file *m, void *v) --=20 2.42.0.rc1.204.g551eb34607-goog From nobody Wed Dec 17 10:57:23 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BAA6EEE4996 for ; Mon, 21 Aug 2023 20:55:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229483AbjHUUzb (ORCPT ); Mon, 21 Aug 2023 16:55:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231513AbjHUUz0 (ORCPT ); Mon, 21 Aug 2023 16:55:26 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 50045C7 for ; Mon, 21 Aug 2023 13:55:23 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id d9443c01a7336-1bef53ee9a8so42191285ad.0 for ; Mon, 21 Aug 2023 13:55:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692651323; x=1693256123; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=II24bAIuGVg2YyiqwiCMReE2mVTE25qmDUAuXBGJ4Ok=; b=KmH6O0vZ9sBp7HHBae0PwJCnRa8ujSumjQ0AlsGJ1pi5KbQIxpyrvcCO0krC5ztY+u N8ljzGOzIX6jer2HV1GxzZPCURYnOZC6D1lrlNXbpCHNDpRJB0nITXoN0mon6iaCy4e1 QkiLalZ25r1ZxplkbXN+NiDkoW8GDcA+Uj1Shac+wu5psshTskn8q8Lf6B5bKZffckoB jKrEuG2Vw03aNvOQfFER14s8IkriQqDS2U2JTgSKgCtg7TmcMJ9q3B18uyHoSMZM7lnj ioj21pXqwFO1fJbsQwZ0rFa+1Vc07Ob56riuci9Ea7Iw13vPBzZ96Hsd5+OoDw6zGDiR 3h5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692651323; x=1693256123; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=II24bAIuGVg2YyiqwiCMReE2mVTE25qmDUAuXBGJ4Ok=; b=cqw6XENwIINX2xNRtxiOWQ/EF3E8mQEZQXqAtfMgNZ1KCW/xObKxGgRN+vSyJo9W/8 gZ8nhCQhIZ7Ga2R6Y4NupuPNTiBbZpxqagtMz84AlmZGn6r+9/EAG2s8OdcqE8Dx2BOB fGR6poDXVettsYbZ5qfL0N6KNBPkM/zxymRYmZyRqTxVs+lWoN7NuS8FkzhhbMr5or+Y SSglVvU1DN8hsjiiCMYxUTFqEUB0ic6vCRZ0g3jCNd3QzIxI7RYe00oWsWUm3tmpn2Ot 4VnZ9Js08exIF+UrTZ2B2iQ8tbvN4Yupf7dZFwZfTmdToVNdukXgk9tRSgEIv6GmzwKH WR2g== X-Gm-Message-State: AOJu0Yz0bGmK8VDfG4FcxWHUpVfdAFbSKqmagD/zv+uwI+K6Ohy4BEUh R4FqBsNChzggciJ1aGInOFAQ2KEGk/qzkxtp X-Google-Smtp-Source: AGHT+IEvQmiYC84sZKnE5YzQy6JNP4dH1ZNkD129Wrs7O1rPHJU5tveJ2ce7KEKHH/DtaNVXLSxePpzIYxXdsGad X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a17:902:eccc:b0:1b8:a555:385d with SMTP id a12-20020a170902eccc00b001b8a555385dmr3847926plh.9.1692651322749; Mon, 21 Aug 2023 13:55:22 -0700 (PDT) Date: Mon, 21 Aug 2023 20:54:58 +0000 In-Reply-To: <20230821205458.1764662-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230821205458.1764662-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc1.204.g551eb34607-goog Message-ID: <20230821205458.1764662-4-yosryahmed@google.com> Subject: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Unified flushing allows for great concurrency for paths that attempt to flush the stats, at the expense of potential staleness and a single flusher paying the extra cost of flushing the full tree. This tradeoff makes sense for in-kernel flushers that may observe high concurrency (e.g. reclaim, refault). For userspace readers, stale stats may be unexpected and problematic, especially when such stats are used for critical paths such as userspace OOM handling. Additionally, a userspace reader will occasionally pay the cost of flushing the entire hierarchy, which also causes problems in some cases [1]. Opt userspace reads out of unified flushing. This makes the cost of reading the stats more predictable (proportional to the size of the subtree), as well as the freshness of the stats. Since userspace readers are not expected to have similar concurrency to in-kernel flushers, serializing them among themselves and among in-kernel flushers should be okay. This was tested on a machine with 256 cpus by running a synthetic test The script that creates 50 top-level cgroups, each with 5 children (250 leaf cgroups). Each leaf cgroup has 10 processes running that allocate memory beyond the cgroup limit, invoking reclaim (which is an in-kernel unified flusher). Concurrently, one thread is spawned per-cgroup to read the stats every second (including root, top-level, and leaf cgroups -- so total 251 threads). No regressions were observed in the total running time; which means that non-unified userspace readers are not slowing down in-kernel unified flushers: Base (mm-unstable): real 0m18.228s user 0m9.463s sys 60m15.879s real 0m20.828s user 0m8.535s sys 70m12.364s real 0m19.789s user 0m9.177s sys 66m10.798s With this patch: real 0m19.632s user 0m8.608s sys 64m23.483s real 0m18.463s user 0m7.465s sys 60m34.089s real 0m20.309s user 0m7.754s sys 68m2.392s Additionally, the average latency for reading stats went from roughly 40ms to 5 ms, because we mostly read the stats of leaf cgroups in this script, so we only have to flush one cgroup, instead of *sometimes* flushing the entire tree with unified flushing. [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME= 4-dsgfoQ@mail.gmail.com/ Signed-off-by: Yosry Ahmed --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 90f08b35fa77..d3b13a06224c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1606,7 +1606,7 @@ static void memcg_stat_format(struct mem_cgroup *memc= g, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4048,7 +4048,7 @@ static int memcg_numa_stat_show(struct seq_file *m, v= oid *v) int nid; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); =20 for (stat =3D stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=3D%lu", stat->name, @@ -4123,7 +4123,7 @@ static void memcg1_stat_format(struct mem_cgroup *mem= cg, struct seq_buf *s) =20 BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) !=3D ARRAY_SIZE(memcg1_stats)); =20 - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); =20 for (i =3D 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4625,7 +4625,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, un= signed long *pfilepages, struct mem_cgroup *memcg =3D mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; =20 - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); =20 *pdirty =3D memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback =3D memcg_page_state(memcg, NR_WRITEBACK); @@ -6640,7 +6640,7 @@ static int memory_numa_stat_show(struct seq_file *m, = void *v) int i; struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); =20 - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); =20 for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; --=20 2.42.0.rc1.204.g551eb34607-goog