From nobody Sun Apr 5 21:30:05 2026 Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F44237BE64 for ; Mon, 23 Feb 2026 22:38:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886326; cv=none; b=Q91BX7R8W+ST1qZaHWgOpmhEYz+Y5dCl7pYEl74YUN/4bJ8VCopYo83lQlua22mQ4sCVymi+aenRynhSqvrvkXh0UTuPx7eeRVhldYyoUjjch94c/wYNrFdC6riebOPjJzgoSCfhtSZo1upKX5aBGX0z5sG59tYQkrFY5r6qkIo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886326; c=relaxed/simple; bh=9p9b4wTyiH+3zXgCgFpIzE03VpYZIw3CjMarPDWWvAc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=J11upvFOH1tE6aFGHZlTuaTxmwHlmYeEiJP1HaPM57xjdiF/oaGtk40CaGb3hy2S/PuOvw1SxAH577iwMjD+kAp5qt3saYrjT1kFdBpt43a4zwXjaYtEZwqBvp6jJDfpnUFgedFCJ1yaswG5LZc8DQ3+woLn6OCTcxlzG+ATgCs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=gi4MdkUt; arc=none smtp.client-ip=209.85.210.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gi4MdkUt" Received: by mail-ot1-f46.google.com with SMTP id 46e09a7af769-7d4bc9e48bbso1852941a34.2 for ; Mon, 23 Feb 2026 14:38:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771886322; x=1772491122; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FkAGKD7Y1lByzW5ZH+EADePeSZfMfZz07pF+iYnM/14=; b=gi4MdkUtt9Pm8O1CBJHqMgsvjsqqcI0gwUPvdkRLNb07cnt9bNODnR3+Qpwsb2BzeV Kxse5ZrgtOge7/+RCYvJs3L4FDMzh0RJfmVyjxn+TlYJrNywDclLQD/0tAH1u6KrPb80 u+d6ZWmrbLI4qIRTn9KagOdUVM31fQt+u52CMtHHqlz9TXdAb6RyJoMCGhtD0T0BEl6F nrHMvvaflooGfry45x+4A/s9WsR/V78RARdOEW8wqvtA6FmKeEc6xhyYaQ2PqDg2py3n 0fjP7An3It4Yn2IRhk98jh+l/xXO6/LlgeM6nM2XW99h4qEPyTVj8bhxbV2j9iu8kQrC qBiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771886322; x=1772491122; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=FkAGKD7Y1lByzW5ZH+EADePeSZfMfZz07pF+iYnM/14=; b=sGMk08GboWRfDqZGfYDlmwjYhgGeegoFzqdUPrzdq3cItWfoxwimIiTV3JuBoZGoM0 p+/66x7NWq1lUg1BTcrtkSo6Gar0YjdfqGI64cLskSROKIs6MdiNV1VtHeMowUfcxWv0 T47Ahl2nraJA3gXHEJkQ6mo8oEGlJnB/SKTQJ0HH/YgZ06hQujxwCBtfgQ5MlU2T0LH1 WjgRz1c25Bu5WEx85QzBljjnXx6pr+AI+PLoDaHg/NPSxumwL9EV+p3uSj2W1yZtZ4ra ydumksYLNYVty874inT1VGYGngoI75sBPE84CEW+INBCtJQ/MpsOkLnE1DkL9RHXuYbm p87A== X-Forwarded-Encrypted: i=1; AJvYcCWX6r8pJb/XBeUxxSjl7RoNW76wQDspfinxb49QNrLUXXKsDK+sBEa6jWsv0gPYRv3RUnxf0KO2Msb9v/I=@vger.kernel.org X-Gm-Message-State: AOJu0YyAFWwZqYXwoLMcl+Y+SfL6QQNYxgdboihF8PDz9BPCWZcZhpGl aZq9/4nMU3MKl0hh/dOli9xq4Z3MZsrt58tZDh5J9MDvk5JESvU1MVLC X-Gm-Gg: AZuq6aLYmNPAh0syYzCYmJjDzkJVYx9ym3N++Z7L41E3+RemNb3XsPwTExn6p7QHavX 70RkG9uBmYgtZo3QJL9hfB31kirX9SLEYzq/EmbJLxLSkZ8gE8aeEm10pyafeWqxwbyx0klti+X MAas5LQAme3T/2PjR+xJyOsxEvPkQUT/t/COeeathjN4sVKBWVdq4cjkD25pxHlKDfZAi0kXLXL ohi6aZ1RFFZM5LfnjSxRqfMdhrdVYYFUdHFEfK/F8mjrxDmAPsXgebzgCummHr2dzCBy07y5yt2 h7jP+m7+aaSFJpbqMWaPomGItP9OpFhhgAiyTURMxt7SyHm4zM+fU+Bl4Y9aFfTMHELqlIZ1r/a tyCQOG3x9KizDQYdd38KVTeYqzcfwhkjWIzJOjROoPwpwdw7CWcUY47uyEKjO01CW6QURmNHEdh q8isK+8xsZavaE7HHB7/67KQ== X-Received: by 2002:a05:6830:4709:b0:78a:5183:8f6a with SMTP id 46e09a7af769-7d52bf6f7ccmr8378736a34.28.1771886321962; Mon, 23 Feb 2026 14:38:41 -0800 (PST) Received: from localhost ([2a03:2880:10ff:41::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7d52d0386c6sm8952995a34.13.2026.02.23.14.38.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 14:38:41 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R . Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Tejun Heo , Michal Koutny , Axel Rasmussen , Yuanchu Xie , Wei Xu , Qi Zheng , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Date: Mon, 23 Feb 2026 14:38:28 -0800 Message-ID: <20260223223830.586018-6-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260223223830.586018-1-joshua.hahnjy@gmail.com> References: <20260223223830.586018-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On machines serving multiple workloads whose memory is isolated via the memory cgroup controller, it is currently impossible to enforce a fair distribution of toptier memory among the worloads, as the only enforcable limits have to do with total memory footprint, but not where that memory resides. This makes ensuring a consistent and baseline performance difficult, as each workload's performance is heavily impacted by workload-external factors such as which other workloads are co-located in the same host, and the order at which different workloads are started. Extend the existing memory.low protection to be tier-aware in the charging, enforcement, and protection calculation to provide best-effort attempts at protecting a fair proportion of toptier memory. Updates to protection and charging are performed in the same path as the standard memcontrol equivalents. Enforcing tier-aware memcg limits however, are gated behind the sysctl tier_aware_memcg. This is so that runtime-enabling of tier aware limits can account for memory already present in the system. Signed-off-by: Joshua Hahn --- include/linux/memcontrol.h | 15 +++++++++++---- include/linux/page_counter.h | 7 ++++--- kernel/cgroup/dmem.c | 2 +- mm/memcontrol.c | 14 ++++++++++++-- mm/page_counter.c | 35 ++++++++++++++++++++++++++++++++++- mm/vmscan.c | 13 +++++++++---- 6 files changed, 71 insertions(+), 15 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 900a36112b62..a998a1e3b8b0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -606,7 +606,9 @@ static inline void mem_cgroup_protection(struct mem_cgr= oup *root, } =20 void mem_cgroup_calculate_protection(struct mem_cgroup *root, - struct mem_cgroup *memcg); + struct mem_cgroup *memcg, bool toptier); + +unsigned long mem_cgroup_toptier_usage(struct mem_cgroup *memcg); =20 void update_memcg_toptier_capacity(void); =20 @@ -623,11 +625,15 @@ static inline bool mem_cgroup_unprotected(struct mem_= cgroup *target, } =20 static inline bool mem_cgroup_below_low(struct mem_cgroup *target, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, bool toptier) { if (mem_cgroup_unprotected(target, memcg)) return false; =20 + if (toptier) + return READ_ONCE(memcg->memory.etoptier_low) >=3D + mem_cgroup_toptier_usage(memcg); + return READ_ONCE(memcg->memory.elow) >=3D page_counter_read(&memcg->memory); } @@ -1114,7 +1120,8 @@ static inline void mem_cgroup_protection(struct mem_c= group *root, } =20 static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, + bool toptier) { } =20 @@ -1128,7 +1135,7 @@ static inline bool mem_cgroup_unprotected(struct mem_= cgroup *target, return true; } static inline bool mem_cgroup_below_low(struct mem_cgroup *target, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, bool toptier) { return false; } diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index ada5f1dd75d4..6635ee7b9575 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -120,15 +120,16 @@ static inline void page_counter_reset_watermark(struc= t page_counter *counter) #if IS_ENABLED(CONFIG_MEMCG) || IS_ENABLED(CONFIG_CGROUP_DMEM) void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, - bool recursive_protection); + bool recursive_protection, bool toptier); void page_counter_update_toptier_capacity(struct page_counter *counter, const nodemask_t *allowed); unsigned long page_counter_toptier_high(struct page_counter *counter); unsigned long page_counter_toptier_low(struct page_counter *counter); #else static inline void page_counter_calculate_protection(struct page_counter *= root, - struct page_counter *counter, - bool recursive_protection) {} + struct page_counter *counter, + bool recursive_protection, + bool toptier) {} #endif =20 #endif /* _LINUX_PAGE_COUNTER_H */ diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c index 1ea6afffa985..536d43c42de8 100644 --- a/kernel/cgroup/dmem.c +++ b/kernel/cgroup/dmem.c @@ -277,7 +277,7 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_poo= l_state *limit_pool, continue; =20 page_counter_calculate_protection( - climit, &found_pool->cnt, true); + climit, &found_pool->cnt, true, false); =20 if (found_pool =3D=3D test_pool) break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 07464f02c529..8aa7ae361a73 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4806,12 +4806,13 @@ struct cgroup_subsys memory_cgrp_subsys =3D { * mem_cgroup_calculate_protection - check if memory consumption is in the= normal range * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check + * @toptier: whether the caller is in a toptier node * * WARNING: This function is not stateless! It can only be used as part * of a top-down tree iteration, not for isolated queries. */ void mem_cgroup_calculate_protection(struct mem_cgroup *root, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, bool toptier) { bool recursive_protection =3D cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT; @@ -4822,7 +4823,16 @@ void mem_cgroup_calculate_protection(struct mem_cgro= up *root, if (!root) root =3D root_mem_cgroup; =20 - page_counter_calculate_protection(&root->memory, &memcg->memory, recursiv= e_protection); + page_counter_calculate_protection(&root->memory, &memcg->memory, + recursive_protection, toptier); +} + +unsigned long mem_cgroup_toptier_usage(struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled() || !memcg) + return 0; + + return atomic_long_read(&memcg->memory.toptier_usage); } =20 void update_memcg_toptier_capacity(void) diff --git a/mm/page_counter.c b/mm/page_counter.c index cf21c72bfd4e..79d46a1c4c0c 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -410,12 +410,39 @@ static unsigned long effective_protection(unsigned lo= ng usage, return ep; } =20 +static void calculate_protection_toptier(struct page_counter *counter, + bool recursive_protection) +{ + struct page_counter *parent =3D counter->parent; + unsigned long toptier_low; + unsigned long toptier_usage, parent_toptier_usage; + unsigned long toptier_protected, old_toptier_protected; + long delta; + + toptier_low =3D page_counter_toptier_low(counter); + toptier_usage =3D atomic_long_read(&counter->toptier_usage); + parent_toptier_usage =3D atomic_long_read(&parent->toptier_usage); + + /* Propagate toptier low usage to parent for sibling distribution */ + toptier_protected =3D min(toptier_usage, toptier_low); + old_toptier_protected =3D atomic_long_xchg(&counter->toptier_low_usage, + toptier_protected); + delta =3D toptier_protected - old_toptier_protected; + atomic_long_add(delta, &parent->children_toptier_low_usage); + + WRITE_ONCE(counter->etoptier_low, + effective_protection(toptier_usage, parent_toptier_usage, + toptier_low, READ_ONCE(parent->etoptier_low), + atomic_long_read(&parent->children_toptier_low_usage), + recursive_protection)); +} =20 /** * page_counter_calculate_protection - check if memory consumption is in t= he normal range * @root: the top ancestor of the sub-tree being checked * @counter: the page_counter the counter to update * @recursive_protection: Whether to use memory_recursiveprot behavior. + * @toptier: Whether to calculate toptier-proportional protection * * Calculates elow/emin thresholds for given page_counter. * @@ -424,7 +451,7 @@ static unsigned long effective_protection(unsigned long= usage, */ void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, - bool recursive_protection) + bool recursive_protection, bool toptier) { unsigned long usage, parent_usage; struct page_counter *parent =3D counter->parent; @@ -446,6 +473,9 @@ void page_counter_calculate_protection(struct page_coun= ter *root, if (parent =3D=3D root) { counter->emin =3D READ_ONCE(counter->min); counter->elow =3D READ_ONCE(counter->low); + if (toptier) + WRITE_ONCE(counter->etoptier_low, + page_counter_toptier_low(counter)); return; } =20 @@ -462,6 +492,9 @@ void page_counter_calculate_protection(struct page_coun= ter *root, READ_ONCE(parent->elow), atomic_long_read(&parent->children_low_usage), recursive_protection)); + + if (toptier) + calculate_protection_toptier(counter, recursive_protection); } =20 void page_counter_update_toptier_capacity(struct page_counter *counter, diff --git a/mm/vmscan.c b/mm/vmscan.c index 6a87ac7be43c..5b4cb030a477 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4144,6 +4144,7 @@ static void lru_gen_age_node(struct pglist_data *pgda= t, struct scan_control *sc) struct mem_cgroup *memcg; unsigned long min_ttl =3D READ_ONCE(lru_gen_min_ttl); bool reclaimable =3D !min_ttl; + bool toptier =3D node_is_toptier(pgdat->node_id); =20 VM_WARN_ON_ONCE(!current_is_kswapd()); =20 @@ -4153,7 +4154,7 @@ static void lru_gen_age_node(struct pglist_data *pgda= t, struct scan_control *sc) do { struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); =20 - mem_cgroup_calculate_protection(NULL, memcg); + mem_cgroup_calculate_protection(NULL, memcg, toptier); =20 if (!reclaimable) reclaimable =3D lruvec_is_reclaimable(lruvec, sc, min_ttl); @@ -4905,12 +4906,14 @@ static int shrink_one(struct lruvec *lruvec, struct= scan_control *sc) unsigned long reclaimed =3D sc->nr_reclaimed; struct mem_cgroup *memcg =3D lruvec_memcg(lruvec); struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); + bool toptier =3D tier_aware_memcg_limits && + node_is_toptier(pgdat->node_id); =20 /* lru_gen_age_node() called mem_cgroup_calculate_protection() */ if (mem_cgroup_below_min(NULL, memcg)) return MEMCG_LRU_YOUNG; =20 - if (mem_cgroup_below_low(NULL, memcg)) { + if (mem_cgroup_below_low(NULL, memcg, toptier)) { /* see the comment on MEMCG_NR_GENS */ if (READ_ONCE(lruvec->lrugen.seg) !=3D MEMCG_LRU_TAIL) return MEMCG_LRU_TAIL; @@ -5960,6 +5963,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, stru= ct scan_control *sc) }; struct mem_cgroup_reclaim_cookie *partial =3D &reclaim; struct mem_cgroup *memcg; + bool toptier =3D node_is_toptier(pgdat->node_id); =20 /* * In most cases, direct reclaimers can do partial walks @@ -5987,7 +5991,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, stru= ct scan_control *sc) */ cond_resched(); =20 - mem_cgroup_calculate_protection(target_memcg, memcg); + mem_cgroup_calculate_protection(target_memcg, memcg, toptier); =20 if (mem_cgroup_below_min(target_memcg, memcg)) { /* @@ -5995,7 +5999,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, stru= ct scan_control *sc) * If there is no reclaimable memory, OOM. */ continue; - } else if (mem_cgroup_below_low(target_memcg, memcg)) { + } else if (mem_cgroup_below_low(target_memcg, memcg, + tier_aware_memcg_limits && toptier)) { /* * Soft protection. * Respect the protection only as long as --=20 2.47.3