From nobody Sun Apr 5 19:49:50 2026 Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 351FA5695 for ; Mon, 23 Feb 2026 22:38:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886317; cv=none; b=u1bjJDs2rPJyNUj4zKSk6FxsbMJYa/HGozJf/fJra4Is/vWW0m5j6nGZarm/gNLqE2ammWTUQ9hKYevmABnfEkM3PpVD2UsLMQcsQ+SOt4u/94q+bdlxah8RCAkStAVTZn1VIpr5zOFC+8GZ9ZpKkXYzkK/2ZG6uB3VJjaDvRAw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886317; c=relaxed/simple; bh=LipM9TIj6qPt1S91E3BBciJ1PqNYlJlWE8dPs7ej+Zg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=h3kNrElP/t6gbg63VPOkDN1LnXDkPP9kXKJ1UwsafhJi+XFM0+15QXCF/tT/hONz/X7uY5pX8O/Ji802cP4tvC7F0C9O0gIWl8eXbzT/q4m6McvO0UFS1hzTuzJTTQmnAYYowfBTAusIpcCGTBya8ANoGtivsqoyMiQakafwu9w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=XYT9oYHm; arc=none smtp.client-ip=209.85.210.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XYT9oYHm" Received: by mail-ot1-f46.google.com with SMTP id 46e09a7af769-7d4b9c839b1so1952297a34.1 for ; Mon, 23 Feb 2026 14:38:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771886315; x=1772491115; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=DuinEuuv0LU2GEZWG09nmp0WL8L3OVm8EbjH7cRnI5M=; b=XYT9oYHmMg2qNfvztqp2U5w1JBBd6gPc4HNAtUKZqqdzJMClQBwDxRmBZLp1kjsiiC UEgG1gx+mPslmvmTgtesAN9jCtpmTd2BeJ9HeQuVqJsHt4qLHmFSRbVo2b6UJIk1bxVs sqGafZHtvZEhW/B+ESXLZMFq+hFojaZ0REGv+jHtZmVKqQKw4TmeAsmUa00PDAwGno62 XF9Nmn8pmpNmzNVd/OqVSHnkDv8tHvT5ZIbS4XjNnBsAtDcia1sZbu8n4zC8GDvD/Sv8 jMaO+nGZV2c8+egWi9vY/GvM7yn1LxNMy8EsKlkQkyV2v+B9Qg6Ek9IGhKZeUi09cl4e MSdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771886315; x=1772491115; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=DuinEuuv0LU2GEZWG09nmp0WL8L3OVm8EbjH7cRnI5M=; b=mSy0rLanZPEzgbSxoEOXcTGcybStwDMToDxCGnsis0yyLIySi69vtuYkfVlBfHbSyl Zc85fOMl4KSXkOCaCTGbvqIUGJ41WRYwTSYVFwxlmcG2bxNVzFSrFUY8hAUoJ68NihNy zm152ziYtO9OLfE1ZEs7sdtOCa352J9E5HF3SQBSE5Cccfia81Tu9REENUC6UGiJggBG MDv65iJBKMf1Tr5JdvQGZ9qLVdq1XlXF6fPfvclQfUX2QQ8VIxWkcEDkMUt1ZIDObdcv HAAZqlSFBVe84oPeSgEcIYq1Bir1w4OUt2Vc+4XDQry4LvzmCMQk5foemLH5DDf3xEVd 6QvA== X-Forwarded-Encrypted: i=1; AJvYcCUdN3aY0aD4qbO5Xor2FVP8QDBe9o1BYWBQZU9DIyMpr4DaG/s5QspBTmPPgP4zw6ItWf2MKalGCHwYht8=@vger.kernel.org X-Gm-Message-State: AOJu0YzEWik9QLvNRr4SQFn3q3rdb5MGYm4JjjE5hiJ8ypEwXJ6At5tM d2MhkH9WLvrzzUJ3TfXASKCNqQIxV8RaB7I1rE/7OVGL/hAiLxxy/s173TknMA== X-Gm-Gg: AZuq6aIiEF08Xmb1w5QA2q/AtKDNO1Yl8jZieiwwQxikorMi0/g91VXVsEx8Wa+RyGa ai+V/o/zseCNB0xClp4Ex2Q/mnQOukeCJb6v5GHE/zyVqo/PXoeaonuDw+sGObtVpCEA/34dy/g 3dSueKvQv881Cr5W/N0OdCdUgs84QwQB0F8dAK43PxmeFCqcU1JwLt+Cb8TUB1VlewDceQ6huN7 B6M02DmpV8bTRJa6yHWex9ulAp0noidCX0g+jIEdcIHeaidIkqHlNKwJdoZ2mn5t14UeNDGXpDe YsfXwZ71PXJDVkTvahpODk267RvUFCYbwxIjPACkHNqbNOCwR+lQob02DiOjTUwxSfpwCUP+9uG n+zC3D/IWdai3nAxdHvsdtrPN7OjRCCy9diArRTY/UUCSl2gk5+l0akszl1wZDjHFpSe9AEfgYk x7q4a7oPnb/EIr35e0deTSjJOePgkqjJ+d X-Received: by 2002:a05:6830:660b:b0:7cf:da7d:608d with SMTP id 46e09a7af769-7d52c21500amr5531693a34.8.1771886315198; Mon, 23 Feb 2026 14:38:35 -0800 (PST) Received: from localhost ([2a03:2880:10ff:4d::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7d52cf9f42dsm8185949a34.9.2026.02.23.14.38.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 14:38:34 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R . Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Date: Mon, 23 Feb 2026 14:38:24 -0800 Message-ID: <20260223223830.586018-2-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260223223830.586018-1-joshua.hahnjy@gmail.com> References: <20260223223830.586018-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce a sysfs entry /sys/kernel/mm/numa/tier_aware_memcg to allow users to toggle between memcg limits that are proportional to the system's toptier:total capacity ratio. Signed-off-by: Joshua Hahn --- include/linux/memory-tiers.h | 1 + mm/memory-tiers.c | 22 ++++++++++++++++++++++ 2 files changed, 23 insertions(+) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 96987d9d95a8..85440473effb 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -37,6 +37,7 @@ struct access_coordinate; =20 #ifdef CONFIG_NUMA extern bool numa_demotion_enabled; +extern bool tier_aware_memcg_limits; extern struct memory_dev_type *default_dram_type; extern nodemask_t default_dram_nodes; struct memory_dev_type *alloc_memory_type(int adistance); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 545e34626df7..a88256381519 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -939,6 +939,8 @@ subsys_initcall(memory_tier_init); =20 bool numa_demotion_enabled =3D false; =20 +bool tier_aware_memcg_limits; + #ifdef CONFIG_MIGRATION #ifdef CONFIG_SYSFS static ssize_t demotion_enabled_show(struct kobject *kobj, @@ -975,8 +977,28 @@ static ssize_t demotion_enabled_store(struct kobject *= kobj, static struct kobj_attribute numa_demotion_enabled_attr =3D __ATTR_RW(demotion_enabled); =20 +static ssize_t tier_aware_memcg_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", str_true_false(tier_aware_memcg_limits)); +} + +static ssize_t tier_aware_memcg_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (kstrtobool(buf, &tier_aware_memcg_limits)) + return -EINVAL; + + return count; +} + +static struct kobj_attribute numa_tier_aware_memcg_attr =3D + __ATTR_RW(tier_aware_memcg); + static struct attribute *numa_attrs[] =3D { &numa_demotion_enabled_attr.attr, + &numa_tier_aware_memcg_attr.attr, NULL, }; =20 --=20 2.47.3 From nobody Sun Apr 5 19:49:50 2026 Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A14972C11D6 for ; Mon, 23 Feb 2026 22:38:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886318; cv=none; b=Rt9Mh+L27TCN7bePZXbmjTNHBg2wiuSj5VA8aWXNnUBagmryNmzQaVyrL2Jz6726lETmEGjgUITEuzy6N3ePKfha8rHXxFyTIQc5ekyYI0jL+hYtlh0uoGiOAU9FfKtImb2GVZ1LEjxX6/CkXV8ozGLEaE9wJ4wZqqNBLdtQoFw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886318; c=relaxed/simple; bh=CaHKfHzTnSEHysdxeu/IeDFU49MuR1XQ3fwEJyAveAQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Qc+ov16Cf7JGAmVG9GBDoLDyGMPVF9IL7zRIyP0nTcHzvsnGOyHO/J5XWwl7+ybPj1FHNSS4SmgralNKTzmBamE+it8TGn6afZ5lJbeM9PSZFJWzjw8gD7YiGdiNnAE6DjusCLKZNY930WAFN2IjegBhwj41JjM5Hmdm81OdoQk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mPjEDf68; arc=none smtp.client-ip=209.85.210.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mPjEDf68" Received: by mail-ot1-f46.google.com with SMTP id 46e09a7af769-7d4c65d772cso4470483a34.1 for ; Mon, 23 Feb 2026 14:38:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771886316; x=1772491116; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=yQzkmrIS51etVi0n5gq17TgjaIpr+hzH+jF14jX4sdo=; b=mPjEDf68l6hiid2EqfuyhHlTz9lrGl+PBYPGHQ1en0dIe5lvtm0LOs62bAOS4YOfeu HML0WyfVtO/UqJ8XW2UNq9YciWUpNdaSxVmau5iQV9uA8EHcMsbXzUq+B3JOD0Q6Af+e 3slCfmkOeW6AB9u3nOmytd4n9703GagipNDPaFCZes1Xz8X1Ge3yuz3IMjP6ctjpYIM5 6LnhFWfQDQija+JRiIqG7PH1iDpwF4DqNN7VJhYWZXosAk5fYffmSuQKVwriPOBlwWdn TXcffbl2FkfBuN2kDoJMJ1Cp7qmxV+8Dxjmk37CF2mUFP4udyb79LNBAPJrYG3J2d5Wm 6bpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771886316; x=1772491116; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=yQzkmrIS51etVi0n5gq17TgjaIpr+hzH+jF14jX4sdo=; b=gtOnGzbc9Yb7R8qQGhYqMvHR+4fYbt6/2V7gzatTq+K59C+2e84lo8f96azDh5uE7G VSKxMPa9WTBdU7CemV6iK+daXZ1V4vPih4rTH53oHt3HKdWr+MrW04C7K8Uos9ZptEDl l4GR/moQdEG2hKayb1LqUpkjBeRFcsh4HH8BP5lv5Udd156tLapO9N3Y8UGsM8gaiiig ivv8AZjE7UNm1ib7gRRXzns7uuIq7rq21xhku7+RcKX47I8+8t8UMoiaMyXxVFoPjU50 ekKJgDALX/jUEtU0sg9xLTRXBgGkBdleuNikUvGfzjN80rGZ/9VXyBHMC2JNz7lIDElH in2A== X-Forwarded-Encrypted: i=1; AJvYcCW7LM5u+UbU4S7Kv4394qKQr3YZ05KpqAvklmIs9FTG30a78ZXGJ4MOcPwswqjKyJLjfffbuax6sRkId9c=@vger.kernel.org X-Gm-Message-State: AOJu0Yy0kV0gtDbm96PyCFm8HQOm61t8YjVju7lE1fZD8ycwxY0WjcNf I5DCOTk2KhAjphTSP+aXwfyoCqXhKEId8xHQuCXSsrgT3CPgx7hGCsiO X-Gm-Gg: AZuq6aJ+/vrPD+xL52HbZWLxpMT45Ixu88kP7Z1oaVFgNsgyMtStdGe2AnOYZ0Cw9nq dWocaanAdukt4vBaYYVCMVkQru4jQQTYFxrSPWUBSDiZYluqSFfiQ0Th+olzGCNCbk3cOwiCD/X CI0kf352g2hX4eYjmyHaSL5CziVMybBPyB2N+dPodqUHqx0MDr+s6K7Be1cWk8XQK9RmWkjAESt oCOrU4ICRNqLIvq0i9a5YrBa2eMVe9V4Xhnu0/ha1jqPstvjR9lR0Juk+xlH80xbzD2sRNsiCUp vZiJQhBe1pZ1oUxgdGYSVfpr6dv+WcqGjpSnGhpk7+w83frGWk7OYtp7uH+oqYZ+vFeaKpgBqvS tk32huPft2gwaGcFvuF1kwern/CBojIRpWLT+xY0rb5VG0OqMtPfTTsPiK+orFQaimYcjx24Ifg 6Q3+WKWHsAkYUBT9CPr4j+ X-Received: by 2002:a05:6830:369b:b0:7c5:2dbf:4a7d with SMTP id 46e09a7af769-7d52c1c04c4mr6267036a34.31.1771886316479; Mon, 23 Feb 2026 14:38:36 -0800 (PST) Received: from localhost ([2a03:2880:10ff:9::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7d52cfa03bcsm8389594a34.10.2026.02.23.14.38.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 14:38:36 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R . Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Date: Mon, 23 Feb 2026 14:38:25 -0800 Message-ID: <20260223223830.586018-3-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260223223830.586018-1-joshua.hahnjy@gmail.com> References: <20260223223830.586018-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On systems with tiered memory, there is currently no tracking of memory at the tier-memcg granularity. While per-memcg-lruvec serves at a finer granularity that can be accumulated to give us the desired per-tier-memcg accounting, relying on these lruvec stats for limit checking can prove touch too many hot paths too frequently and can introduce increased latency for other memcg users. Instead, add a new cacheline in struct page_counter to track toptier memcg limits and usage, as well as cached capacity values. This cacheline is only used by the mem_cgroup->memory page_counter. Also, introduce helpers that use these new fields to calculate proportional toptier high and low values, based on the system's toptier:total capacity ratio. Signed-off-by: Joshua Hahn --- include/linux/page_counter.h | 22 +++++++++++++++++++++- mm/page_counter.c | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+), 1 deletion(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index d649b6bbbc87..128c1272c88c 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -5,6 +5,7 @@ #include #include #include +#include #include =20 struct page_counter { @@ -31,9 +32,23 @@ struct page_counter { /* Latest cg2 reset watermark */ unsigned long local_watermark; =20 - /* Keep all the read most fields in a separete cacheline. */ + /* Keep all the tiered memory fields in a separate cacheline. */ CACHELINE_PADDING(_pad2_); =20 + atomic_long_t toptier_usage; + + /* effective toptier-proportional low protection */ + unsigned long etoptier_low; + atomic_long_t toptier_low_usage; + atomic_long_t children_toptier_low_usage; + + /* Cached toptier capacity for proportional limit calculations */ + unsigned long toptier_capacity; + unsigned long total_capacity; + + /* Keep all the read most fields in a separate cacheline. */ + CACHELINE_PADDING(_pad3_); + bool protection_support; bool track_failcnt; unsigned long min; @@ -61,6 +76,9 @@ static inline void page_counter_init(struct page_counter = *counter, counter->parent =3D parent; counter->protection_support =3D protection_support; counter->track_failcnt =3D false; + counter->toptier_usage =3D (atomic_long_t)ATOMIC_LONG_INIT(0); + counter->toptier_capacity =3D 0; + counter->total_capacity =3D 0; } =20 static inline unsigned long page_counter_read(struct page_counter *counter) @@ -103,6 +121,8 @@ static inline void page_counter_reset_watermark(struct = page_counter *counter) void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, bool recursive_protection); +unsigned long page_counter_toptier_high(struct page_counter *counter); +unsigned long page_counter_toptier_low(struct page_counter *counter); #else static inline void page_counter_calculate_protection(struct page_counter *= root, struct page_counter *counter, diff --git a/mm/page_counter.c b/mm/page_counter.c index 661e0f2a5127..5ec97811c418 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -462,4 +462,38 @@ void page_counter_calculate_protection(struct page_cou= nter *root, atomic_long_read(&parent->children_low_usage), recursive_protection)); } + +unsigned long page_counter_toptier_high(struct page_counter *counter) +{ + unsigned long high =3D READ_ONCE(counter->high); + unsigned long toptier_cap, total_cap; + + if (high =3D=3D PAGE_COUNTER_MAX) + return PAGE_COUNTER_MAX; + + toptier_cap =3D counter->toptier_capacity; + total_cap =3D counter->total_capacity; + + if (!total_cap) + return PAGE_COUNTER_MAX; + + return mult_frac(high, toptier_cap, total_cap); +} + +unsigned long page_counter_toptier_low(struct page_counter *counter) +{ + unsigned long low =3D READ_ONCE(counter->low); + unsigned long toptier_cap, total_cap; + + if (!low) + return 0; + + toptier_cap =3D counter->toptier_capacity; + total_cap =3D counter->total_capacity; + + if (!total_cap) + return 0; + + return mult_frac(low, toptier_cap, total_cap); +} #endif /* CONFIG_MEMCG || CONFIG_CGROUP_DMEM */ --=20 2.47.3 From nobody Sun Apr 5 19:49:50 2026 Received: from mail-oi1-f169.google.com (mail-oi1-f169.google.com [209.85.167.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E69C37B41B for ; Mon, 23 Feb 2026 22:38:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886321; cv=none; b=CT2TC0Zm2BTJeROdeDhu66eklK68kpV5yLB4gUIZpMBfOanNWwSLGJ0xDwIEtkQzCgq0sX8VxuAmS2MXlBelOVyMclWVOTCCBIHWywnhntf7V7eCkcMytuKu70OIbXY//7CBltQtm6fTN6zahRfoN87Db1qrSKDlJmBylHAhnPM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886321; c=relaxed/simple; bh=T4vB2DOedOT6i9IF/jBh4raYemILf9XeHoYu3yDDtnE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Z81DV7xWfA3kWb94eEzjSB5JdiVzgRCaV9i3tl509YP7PZGeSrhzye6R4C2nP0G1/jLTMyUZf74iMFaFDDj9AU53aP3V8a9p2nKFmlR/KS5d6VPVcIlheXtTLvdB4IMcO75Ap/PzHqBCPM+05F+gWO2zpDmvN7fVtajDtTHqoM0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=iD6o+4Yg; arc=none smtp.client-ip=209.85.167.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="iD6o+4Yg" Received: by mail-oi1-f169.google.com with SMTP id 5614622812f47-45f10d7eb81so1840131b6e.3 for ; Mon, 23 Feb 2026 14:38:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771886319; x=1772491119; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wYuk97hSinOTQ/m7mlOvoqja3g8YulhySTVwuiXXVKY=; b=iD6o+4YgNTdGualUDzzIUgocUSIG8bLvnOFIzheZuNSghQngRcyGS9aG2WskobwNvw 8D9I3dHCC4uBpMFfu3wL7XJYGtdXtA3lXfDuz+9QOWlRHL+pbXku//vgzCTG0hgGHclW Mscd1XV+BKeBPHHG2Ojvztglu7aC7ARsGYISt5UjOOXWT/lN5iJmvDNLH24+DCSNImv+ OFndZz9mDLEc2oQtHWfG8sLrijCEHOJ61BSdZLJFTZvd2C/kPpDK467O487G44QyNuHC QgT+MZIjcdrTDg0Zuhfpz6yHdnFIpllGMS3ak8diPGDo1uJrPr+CAMOC2QJFeO0wkSPs heeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771886319; x=1772491119; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=wYuk97hSinOTQ/m7mlOvoqja3g8YulhySTVwuiXXVKY=; b=ItcwvljhAdDKrFCybhQO6uN5wsX5IRG528zu9HlF4h6FOYpzoQFg6U8z2beVWBpc/x Tlr6cFsLBs7Ne1mLqcjj/Q/3bT7PgQ3DbKgmyrRF8eRgM3E/svm2kkpBrmRhYhmYiugP /dOmbCjBdQ0gd2H+HubC6is72zIbpBp7+zJfN1T31XuDVTPQth48fvGvJWmbezQP1qZO 83XHOJJ8Wq/0BuToIJhDAr3PvCl4ZfgZ6orcA7LkGjq34ID4DuTmg+6LTn+SOL8hzMq4 Tl7lcZpj6lS7c0uREn/n6DuC3LMrYTwnw2AB+GcWjjFlD+/64RklDhj8fYBDwafydAWm aFtg== X-Forwarded-Encrypted: i=1; AJvYcCVUs1wcPzkzrQht/ziCHlBYaj4X4BCYlD9xwgNFvNxAyl2DeSyPAkZhaKimQfmI1U6KkVXQvoMXbWRcp00=@vger.kernel.org X-Gm-Message-State: AOJu0YziA/y/fU6jFC92sA5MOLzfHDmLfLq967TL+TGzl9e4MCfaBLBs qaPqciYPZlwWuOa5LEMqddrdtqg2fF+e/+GjT5ldfEhwetVNplt4WrrG X-Gm-Gg: AZuq6aLt7372MuJK4IUGqF1ItWoCK9z59rF2mIb9pVY1bHQrNBD0meeK+h3h9/rmZpc ZqfHOgzAEefE8ivkUKGmSUuPlTSsme0HhQGdz29InXbs3K8+TRu6tg1CYiIFu0LEpKY2/tG1QAQ H+/Ed9cPVSI/MLy8BFmFpTGgZ5mgxi0Mfv7BhutVA497djGyAiSbPhd1U92TYZHnakvsYFX+7v+ x2sS9cWrBP09vPT4mZyhVOhntaHmeEhvpCwKyPiwHsiV4TfC8ZjDmrDAizHGE0BcsBvn8b0mlpG siXK7e2PlS4o587ZJKrgrqmGoHzT3LJnLTZAwKi7KVPyX0EFH31kzylI/AlWyebG+3pPWOakl0s Al5eA0fRv90CkmYSdC4eTgQKvy7oTYkxlrLCPOlBLPf+iZlL/FirrWfiLC7aa1FMc2otbi9caB4 n9ZN98Om2nfJbqK6yB7hS9nw== X-Received: by 2002:a05:6808:1481:b0:45f:727:8fd7 with SMTP id 5614622812f47-4644638ee8bmr5488352b6e.46.1771886318987; Mon, 23 Feb 2026 14:38:38 -0800 (PST) Received: from localhost ([2a03:2880:10ff:45::]) by smtp.gmail.com with ESMTPSA id 586e51a60fabf-4157d2d7826sm8635887fac.10.2026.02.23.14.38.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 14:38:38 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R . Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Waiman Long , Chen Ridong , Tejun Heo , Michal Koutny , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Date: Mon, 23 Feb 2026 14:38:26 -0800 Message-ID: <20260223223830.586018-4-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260223223830.586018-1-joshua.hahnjy@gmail.com> References: <20260223223830.586018-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" What a memcg considers to be a valid toptier node is defined by three criteria: (1) The node has CPUs, (2) The node has online memory, and (3) The node is within the cgroup's cpuset.mems. Of the three, the second and third criteria are the only ones that can change dynamically during runtime, via memory hotplug events and cpuset.mems changes, respectively. Introduce functions to calculate and update toptier capacity, and call them during cpuset.mems changes and memory hotplug events. Signed-off-by: Joshua Hahn --- include/linux/memcontrol.h | 6 ++++++ include/linux/memory-tiers.h | 29 +++++++++++++++++++++++++ include/linux/page_counter.h | 2 ++ kernel/cgroup/cpuset.c | 2 +- mm/memcontrol.c | 17 +++++++++++++++ mm/memory-tiers.c | 41 ++++++++++++++++++++++++++++++++++++ mm/page_counter.c | 8 +++++++ 7 files changed, 104 insertions(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5173a9f16721..900a36112b62 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -608,6 +608,8 @@ static inline void mem_cgroup_protection(struct mem_cgr= oup *root, void mem_cgroup_calculate_protection(struct mem_cgroup *root, struct mem_cgroup *memcg); =20 +void update_memcg_toptier_capacity(void); + static inline bool mem_cgroup_unprotected(struct mem_cgroup *target, struct mem_cgroup *memcg) { @@ -1116,6 +1118,10 @@ static inline void mem_cgroup_calculate_protection(s= truct mem_cgroup *root, { } =20 +static inline void update_memcg_toptier_capacity(void) +{ +} + static inline bool mem_cgroup_unprotected(struct mem_cgroup *target, struct mem_cgroup *memcg) { diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 85440473effb..cf616885e0db 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -53,6 +53,9 @@ int mt_perf_to_adistance(struct access_coordinate *perf, = int *adist); struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types); void mt_put_memory_types(struct list_head *memory_types); +void mt_get_toptier_nodemask(nodemask_t *mask, const nodemask_t *allowed); +unsigned long mt_get_toptier_capacity(const nodemask_t *allowed); +unsigned long mt_get_total_capacity(const nodemask_t *allowed); #ifdef CONFIG_MIGRATION int next_demotion_node(int node, const nodemask_t *allowed_mask); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); @@ -152,5 +155,31 @@ static inline struct memory_dev_type *mt_find_alloc_me= mory_type(int adist, static inline void mt_put_memory_types(struct list_head *memory_types) { } + +static inline void mt_get_toptier_nodemask(nodemask_t *mask, + const nodemask_t *allowed) +{ + *mask =3D node_states[N_MEMORY]; + if (allowed) + nodes_and(*mask, *mask, *allowed); +} + +static inline unsigned long mt_get_toptier_capacity(const nodemask_t *allo= wed) +{ + int nid; + unsigned long capacity =3D 0; + + for_each_node_state(nid, N_MEMORY) { + if (allowed && !node_isset(nid, *allowed)) + continue; + capacity +=3D NODE_DATA(nid)->node_present_pages; + } + return capacity; +} + +static inline unsigned long mt_get_total_capacity(const nodemask_t *allowe= d) +{ + return mt_get_toptier_capacity(allowed); +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 128c1272c88c..ada5f1dd75d4 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -121,6 +121,8 @@ static inline void page_counter_reset_watermark(struct = page_counter *counter) void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, bool recursive_protection); +void page_counter_update_toptier_capacity(struct page_counter *counter, + const nodemask_t *allowed); unsigned long page_counter_toptier_high(struct page_counter *counter); unsigned long page_counter_toptier_low(struct page_counter *counter); #else diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 7607dfe516e6..e5641dc1af88 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2620,7 +2620,6 @@ static void update_nodemasks_hier(struct cpuset *cs, = nodemask_t *new_mems) rcu_read_lock(); cpuset_for_each_descendant_pre(cp, pos_css, cs) { struct cpuset *parent =3D parent_cs(cp); - bool has_mems =3D nodes_and(*new_mems, cp->mems_allowed, parent->effecti= ve_mems); =20 /* @@ -2701,6 +2700,7 @@ static int update_nodemask(struct cpuset *cs, struct = cpuset *trialcs, =20 /* use trialcs->mems_allowed as a temp variable */ update_nodemasks_hier(cs, &trialcs->mems_allowed); + update_memcg_toptier_capacity(); return 0; } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0be1e823d813..f3e4a6ce7181 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include #include @@ -3906,6 +3907,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pare= nt_css) =20 page_counter_init(&memcg->memory, &parent->memory, memcg_on_dfl); page_counter_init(&memcg->swap, &parent->swap, false); + page_counter_update_toptier_capacity(&memcg->memory, NULL); #ifdef CONFIG_MEMCG_V1 memcg->memory.track_failcnt =3D !memcg_on_dfl; WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable)); @@ -3917,6 +3919,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pare= nt_css) init_memcg_events(); page_counter_init(&memcg->memory, NULL, true); page_counter_init(&memcg->swap, NULL, false); + page_counter_update_toptier_capacity(&memcg->memory, NULL); #ifdef CONFIG_MEMCG_V1 page_counter_init(&memcg->kmem, NULL, false); page_counter_init(&memcg->tcpmem, NULL, false); @@ -4804,6 +4807,20 @@ void mem_cgroup_calculate_protection(struct mem_cgro= up *root, page_counter_calculate_protection(&root->memory, &memcg->memory, recursiv= e_protection); } =20 +void update_memcg_toptier_capacity(void) +{ + struct mem_cgroup *memcg; + nodemask_t allowed; + + for_each_mem_cgroup(memcg) { + if (memcg =3D=3D root_mem_cgroup) + continue; + + cpuset_nodes_allowed(memcg->css.cgroup, &allowed); + page_counter_update_toptier_capacity(&memcg->memory, &allowed); + } +} + static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, gfp_t gfp) { diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index a88256381519..259caaf4be8f 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -889,6 +889,7 @@ static int __meminit memtier_hotplug_callback(struct no= tifier_block *self, mutex_lock(&memory_tier_lock); if (clear_node_memory_tier(nn->nid)) establish_demotion_targets(); + update_memcg_toptier_capacity(); mutex_unlock(&memory_tier_lock); break; case NODE_ADDED_FIRST_MEMORY: @@ -896,6 +897,7 @@ static int __meminit memtier_hotplug_callback(struct no= tifier_block *self, memtier =3D set_node_memory_tier(nn->nid); if (!IS_ERR(memtier)) establish_demotion_targets(); + update_memcg_toptier_capacity(); mutex_unlock(&memory_tier_lock); break; } @@ -941,6 +943,45 @@ bool numa_demotion_enabled =3D false; =20 bool tier_aware_memcg_limits; =20 +void mt_get_toptier_nodemask(nodemask_t *mask, const nodemask_t *allowed) +{ + int nid; + + *mask =3D NODE_MASK_NONE; + for_each_node_state(nid, N_MEMORY) { + if (node_is_toptier(nid)) + node_set(nid, *mask); + } + if (allowed) + nodes_and(*mask, *mask, *allowed); +} + +unsigned long mt_get_toptier_capacity(const nodemask_t *allowed) +{ + int nid; + unsigned long capacity =3D 0; + nodemask_t mask; + + mt_get_toptier_nodemask(&mask, allowed); + for_each_node_mask(nid, mask) + capacity +=3D NODE_DATA(nid)->node_present_pages; + + return capacity; +} + +unsigned long mt_get_total_capacity(const nodemask_t *allowed) +{ + int nid; + unsigned long capacity =3D 0; + + for_each_node_state(nid, N_MEMORY) { + if (allowed && !node_isset(nid, *allowed)) + continue; + capacity +=3D NODE_DATA(nid)->node_present_pages; + } + return capacity; +} + #ifdef CONFIG_MIGRATION #ifdef CONFIG_SYSFS static ssize_t demotion_enabled_show(struct kobject *kobj, diff --git a/mm/page_counter.c b/mm/page_counter.c index 5ec97811c418..cf21c72bfd4e 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -11,6 +11,7 @@ #include #include #include +#include #include =20 static bool track_protection(struct page_counter *c) @@ -463,6 +464,13 @@ void page_counter_calculate_protection(struct page_cou= nter *root, recursive_protection)); } =20 +void page_counter_update_toptier_capacity(struct page_counter *counter, + const nodemask_t *allowed) +{ + counter->toptier_capacity =3D mt_get_toptier_capacity(allowed); + counter->total_capacity =3D mt_get_total_capacity(allowed); +} + unsigned long page_counter_toptier_high(struct page_counter *counter) { unsigned long high =3D READ_ONCE(counter->high); --=20 2.47.3 From nobody Sun Apr 5 19:49:50 2026 Received: from mail-oi1-f179.google.com (mail-oi1-f179.google.com [209.85.167.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65F7537A49D for ; Mon, 23 Feb 2026 22:38:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886322; cv=none; b=Xf7tZcJLIsXrVcMeb4fm3CRAm0+t5z4qe4ursG8D2SDLfbIqJ3gCE8YT1EjaQUDerTup6Qy3NZc3AuVy//Kxco/Cm3CAeSVhdXvto2UCAJOdE1u+oWImN4w+FMYXZosnLc0+Ehtjz+yeKEMLPwJCcQSTLLmc0Qfyn1jRGQrnGTw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886322; c=relaxed/simple; bh=J+BT9JHVO5MqWgVBXYMCvRaAsNpI/lKiHSmM/7rTRko=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CuCWFD8b7diHx7ixImfBgN7f03LA1/40IdjPnPik2M7uYch5iCEjsL2CT7Klken6g2SsRldN1hb7na+aMtKio/g9k4DcZJ6GSO0HbwvObqY2dv8ZaXuZcMQKJImFzNFI4uNJvN39NJDljYjj7voVQQP80AndGCC7iy5YGTnc+z0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ZLsqMBTk; arc=none smtp.client-ip=209.85.167.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZLsqMBTk" Received: by mail-oi1-f179.google.com with SMTP id 5614622812f47-463a94f8475so3076398b6e.0 for ; Mon, 23 Feb 2026 14:38:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771886320; x=1772491120; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FgNKK4IJNslBJZZOngZoyEbar9yDv1r70Pc6CR+o2yQ=; b=ZLsqMBTkUrxPmUXniY20pVQ3VuNzlsDGv/KaLU50Bv2rJG8/0Jh/04kbFDxHYegmdR 4u1hwYOP5Oc3Rvw1j2GjMGquFZP7BA3MPlaViY99SkbC7IYYPjus46+pzr3NSDFF3t7f TvFGRz451M5xbBgjviQnb+/5TYarK/4dP6qqEGPf/b9uTrye8+FPbkxf33U1JK+thch9 F0yMafUZcYBA3uHoMfwzbEBqDmCMd0hLbbhkqp4VuU20/xtZGyeALnsXDOyxBz6BIbWY r1RvXO/WkWbnwcEDrfHWOOsvDwg/UDCz2lFijwafkK0xsr65RMIQh71cLi58yR45y3UL 0hIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771886320; x=1772491120; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=FgNKK4IJNslBJZZOngZoyEbar9yDv1r70Pc6CR+o2yQ=; b=QBZzvK2XthOpGjyw2E2pAzmOH4buH8zx4KB3ociShcHqzMvIQwhRpc0JSsYJ73lGoz d11SUyK0IX1XlW7twsu1d6YRJ7P7RIl4LM6RZI1PHqQlwbrUQH4q56VUw0Ytwu5Oc8K4 J5kVvP/EJp7MjrOkdWHv67GeV0IZTyCrsn7EIr/uSw0qHavfQxzsIUUz9wRG13Zfg2W9 bV/6ziKBQIDdZlv/6VQmp0sDEk8mB4cdqVuXCxxgw333w/wN8Z7xFxr6LmGBhLMrWJqP XHXJmn2q3Y95sKJhRCUAebzzsZW4fW74mKHz3s2hrytSq+o76uDRaORO4k4TOtK/Oejx nLYQ== X-Forwarded-Encrypted: i=1; AJvYcCVZqOJKa3GkClOECjqijR3cIIDBqFrr3r1Mh67FaAo165mdSLIISj3apinBZH9EiYDWVt6f0AveqTL0Qww=@vger.kernel.org X-Gm-Message-State: AOJu0YxS96blVmn30U3QImHHJMEtVcMBsLmdtJHeptIelOp8egKPeywv THz60htsU0/rbDFhTtpvZFyvglGPMB3tDSdLZSs08pT38/DRsG5eqrD871uM5w== X-Gm-Gg: AZuq6aJkU9CrhHViM33O24Wa2yMREB0aSPDgqRWiIuI5QERwcnBEoLfy3oGGGmfyR+W V43/no/xU+ntDWHYa7xEwNBxoVvTjLY7J/0OLicvJW2qErL9Pc8pXAfVMzq8h52CFFlvNfxuAVD cFwX40leHYCx2AgmVWAXx0bu6RyAR5ZxPngW4T2h8Qo61Aqa0zdSMYiFKf/4dw1/xVJPWeqsToo igxJFqQxCBC/z6dQ/mKAtt/VrlvFo4EPIRq8u4BDVZKQ53NR+MdMZQQCPyDesZoxJaXz8UNJZFa HDSpcwIdcJVf8IzGB6xTuhBbAM9EAf5JNELguKj8UfPAOlzLWErBGT+9OHt5+st5APJkRSmeKW2 9V8Fm5x+B/esmsirqv74KJuRW7myYqNt7AEkQ+xnheqcsVvrDugUtn7r0Mddax9aIeVHVBS7hz0 LKLBxnVS6z62dHy7ZdPMN+NNR3nDj+n9c= X-Received: by 2002:a05:6808:3442:b0:453:7cad:63be with SMTP id 5614622812f47-464276813fbmr8219044b6e.31.1771886320460; Mon, 23 Feb 2026 14:38:40 -0800 (PST) Received: from localhost ([2a03:2880:10ff:2::]) by smtp.gmail.com with ESMTPSA id 5614622812f47-4644a1afa44sm5819928b6e.16.2026.02.23.14.38.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 14:38:39 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Date: Mon, 23 Feb 2026 14:38:27 -0800 Message-ID: <20260223223830.586018-5-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260223223830.586018-1-joshua.hahnjy@gmail.com> References: <20260223223830.586018-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Modify memcg charging and uncharging sites to also update toptier statistics. Unfortunately, try_charge_memcg is unaware of the physical folio being charged; it only deals with nr_pages. Instead of modifying try_charge_memcg, instead adjust the toptier fields once try_charge_memcg succeeds, inside charge_memcg. Signed-off-by: Joshua Hahn --- mm/memcontrol.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f3e4a6ce7181..07464f02c529 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1948,6 +1948,24 @@ static void memcg_uncharge(struct mem_cgroup *memcg,= unsigned int nr_pages) page_counter_uncharge(&memcg->memsw, nr_pages); } =20 +static void memcg_charge_toptier(struct mem_cgroup *memcg, + unsigned long nr_pages) +{ + struct page_counter *c; + + for (c =3D &memcg->memory; c; c =3D c->parent) + atomic_long_add(nr_pages, &c->toptier_usage); +} + +static void memcg_uncharge_toptier(struct mem_cgroup *memcg, + unsigned long nr_pages) +{ + struct page_counter *c; + + for (c =3D &memcg->memory; c; c =3D c->parent) + atomic_long_sub(nr_pages, &c->toptier_usage); +} + /* * Returns stocks cached in percpu and reset cached information. */ @@ -4830,6 +4848,9 @@ static int charge_memcg(struct folio *folio, struct m= em_cgroup *memcg, if (ret) goto out; =20 + if (node_is_toptier(folio_nid(folio))) + memcg_charge_toptier(memcg, folio_nr_pages(folio)); + css_get(&memcg->css); commit_charge(folio, memcg); memcg1_commit_charge(folio, memcg); @@ -4921,6 +4942,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *foli= o, struct mm_struct *mm, struct uncharge_gather { struct mem_cgroup *memcg; unsigned long nr_memory; + unsigned long nr_toptier; unsigned long pgpgout; unsigned long nr_kmem; int nid; @@ -4941,6 +4963,8 @@ static void uncharge_batch(const struct uncharge_gath= er *ug) } memcg1_oom_recover(ug->memcg); } + if (ug->nr_toptier) + memcg_uncharge_toptier(ug->memcg, ug->nr_toptier); =20 memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid); =20 @@ -4989,6 +5013,9 @@ static void uncharge_folio(struct folio *folio, struc= t uncharge_gather *ug) =20 nr_pages =3D folio_nr_pages(folio); =20 + if (node_is_toptier(folio_nid(folio))) + ug->nr_toptier +=3D nr_pages; + if (folio_memcg_kmem(folio)) { ug->nr_memory +=3D nr_pages; ug->nr_kmem +=3D nr_pages; @@ -5072,6 +5099,10 @@ void mem_cgroup_replace_folio(struct folio *old, str= uct folio *new) page_counter_charge(&memcg->memsw, nr_pages); } =20 + /* The old folio's toptier_usage will be decremented when it is freed */ + if (node_is_toptier(folio_nid(new))) + memcg_charge_toptier(memcg, nr_pages); + css_get(&memcg->css); commit_charge(new, memcg); memcg1_commit_charge(new, memcg); @@ -5091,6 +5122,7 @@ void mem_cgroup_replace_folio(struct folio *old, stru= ct folio *new) void mem_cgroup_migrate(struct folio *old, struct folio *new) { struct mem_cgroup *memcg; + int old_toptier, new_toptier; =20 VM_BUG_ON_FOLIO(!folio_test_locked(old), old); VM_BUG_ON_FOLIO(!folio_test_locked(new), new); @@ -5111,6 +5143,13 @@ void mem_cgroup_migrate(struct folio *old, struct fo= lio *new) if (!memcg) return; =20 + old_toptier =3D node_is_toptier(folio_nid(old)); + new_toptier =3D node_is_toptier(folio_nid(new)); + if (old_toptier && !new_toptier) + memcg_uncharge_toptier(memcg, folio_nr_pages(old)); + else if (!old_toptier && new_toptier) + memcg_charge_toptier(memcg, folio_nr_pages(old)); + /* Transfer the charge and the css ref */ commit_charge(new, memcg); =20 --=20 2.47.3 From nobody Sun Apr 5 19:49:50 2026 Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F44237BE64 for ; Mon, 23 Feb 2026 22:38:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886326; cv=none; b=Q91BX7R8W+ST1qZaHWgOpmhEYz+Y5dCl7pYEl74YUN/4bJ8VCopYo83lQlua22mQ4sCVymi+aenRynhSqvrvkXh0UTuPx7eeRVhldYyoUjjch94c/wYNrFdC6riebOPjJzgoSCfhtSZo1upKX5aBGX0z5sG59tYQkrFY5r6qkIo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886326; c=relaxed/simple; bh=9p9b4wTyiH+3zXgCgFpIzE03VpYZIw3CjMarPDWWvAc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=J11upvFOH1tE6aFGHZlTuaTxmwHlmYeEiJP1HaPM57xjdiF/oaGtk40CaGb3hy2S/PuOvw1SxAH577iwMjD+kAp5qt3saYrjT1kFdBpt43a4zwXjaYtEZwqBvp6jJDfpnUFgedFCJ1yaswG5LZc8DQ3+woLn6OCTcxlzG+ATgCs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=gi4MdkUt; arc=none smtp.client-ip=209.85.210.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gi4MdkUt" Received: by mail-ot1-f46.google.com with SMTP id 46e09a7af769-7d4bc9e48bbso1852941a34.2 for ; Mon, 23 Feb 2026 14:38:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771886322; x=1772491122; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FkAGKD7Y1lByzW5ZH+EADePeSZfMfZz07pF+iYnM/14=; b=gi4MdkUtt9Pm8O1CBJHqMgsvjsqqcI0gwUPvdkRLNb07cnt9bNODnR3+Qpwsb2BzeV Kxse5ZrgtOge7/+RCYvJs3L4FDMzh0RJfmVyjxn+TlYJrNywDclLQD/0tAH1u6KrPb80 u+d6ZWmrbLI4qIRTn9KagOdUVM31fQt+u52CMtHHqlz9TXdAb6RyJoMCGhtD0T0BEl6F nrHMvvaflooGfry45x+4A/s9WsR/V78RARdOEW8wqvtA6FmKeEc6xhyYaQ2PqDg2py3n 0fjP7An3It4Yn2IRhk98jh+l/xXO6/LlgeM6nM2XW99h4qEPyTVj8bhxbV2j9iu8kQrC qBiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771886322; x=1772491122; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=FkAGKD7Y1lByzW5ZH+EADePeSZfMfZz07pF+iYnM/14=; b=sGMk08GboWRfDqZGfYDlmwjYhgGeegoFzqdUPrzdq3cItWfoxwimIiTV3JuBoZGoM0 p+/66x7NWq1lUg1BTcrtkSo6Gar0YjdfqGI64cLskSROKIs6MdiNV1VtHeMowUfcxWv0 T47Ahl2nraJA3gXHEJkQ6mo8oEGlJnB/SKTQJ0HH/YgZ06hQujxwCBtfgQ5MlU2T0LH1 WjgRz1c25Bu5WEx85QzBljjnXx6pr+AI+PLoDaHg/NPSxumwL9EV+p3uSj2W1yZtZ4ra ydumksYLNYVty874inT1VGYGngoI75sBPE84CEW+INBCtJQ/MpsOkLnE1DkL9RHXuYbm p87A== X-Forwarded-Encrypted: i=1; AJvYcCWX6r8pJb/XBeUxxSjl7RoNW76wQDspfinxb49QNrLUXXKsDK+sBEa6jWsv0gPYRv3RUnxf0KO2Msb9v/I=@vger.kernel.org X-Gm-Message-State: AOJu0YyAFWwZqYXwoLMcl+Y+SfL6QQNYxgdboihF8PDz9BPCWZcZhpGl aZq9/4nMU3MKl0hh/dOli9xq4Z3MZsrt58tZDh5J9MDvk5JESvU1MVLC X-Gm-Gg: AZuq6aLYmNPAh0syYzCYmJjDzkJVYx9ym3N++Z7L41E3+RemNb3XsPwTExn6p7QHavX 70RkG9uBmYgtZo3QJL9hfB31kirX9SLEYzq/EmbJLxLSkZ8gE8aeEm10pyafeWqxwbyx0klti+X MAas5LQAme3T/2PjR+xJyOsxEvPkQUT/t/COeeathjN4sVKBWVdq4cjkD25pxHlKDfZAi0kXLXL ohi6aZ1RFFZM5LfnjSxRqfMdhrdVYYFUdHFEfK/F8mjrxDmAPsXgebzgCummHr2dzCBy07y5yt2 h7jP+m7+aaSFJpbqMWaPomGItP9OpFhhgAiyTURMxt7SyHm4zM+fU+Bl4Y9aFfTMHELqlIZ1r/a tyCQOG3x9KizDQYdd38KVTeYqzcfwhkjWIzJOjROoPwpwdw7CWcUY47uyEKjO01CW6QURmNHEdh q8isK+8xsZavaE7HHB7/67KQ== X-Received: by 2002:a05:6830:4709:b0:78a:5183:8f6a with SMTP id 46e09a7af769-7d52bf6f7ccmr8378736a34.28.1771886321962; Mon, 23 Feb 2026 14:38:41 -0800 (PST) Received: from localhost ([2a03:2880:10ff:41::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7d52d0386c6sm8952995a34.13.2026.02.23.14.38.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 14:38:41 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R . Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Tejun Heo , Michal Koutny , Axel Rasmussen , Yuanchu Xie , Wei Xu , Qi Zheng , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Date: Mon, 23 Feb 2026 14:38:28 -0800 Message-ID: <20260223223830.586018-6-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260223223830.586018-1-joshua.hahnjy@gmail.com> References: <20260223223830.586018-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On machines serving multiple workloads whose memory is isolated via the memory cgroup controller, it is currently impossible to enforce a fair distribution of toptier memory among the worloads, as the only enforcable limits have to do with total memory footprint, but not where that memory resides. This makes ensuring a consistent and baseline performance difficult, as each workload's performance is heavily impacted by workload-external factors such as which other workloads are co-located in the same host, and the order at which different workloads are started. Extend the existing memory.low protection to be tier-aware in the charging, enforcement, and protection calculation to provide best-effort attempts at protecting a fair proportion of toptier memory. Updates to protection and charging are performed in the same path as the standard memcontrol equivalents. Enforcing tier-aware memcg limits however, are gated behind the sysctl tier_aware_memcg. This is so that runtime-enabling of tier aware limits can account for memory already present in the system. Signed-off-by: Joshua Hahn --- include/linux/memcontrol.h | 15 +++++++++++---- include/linux/page_counter.h | 7 ++++--- kernel/cgroup/dmem.c | 2 +- mm/memcontrol.c | 14 ++++++++++++-- mm/page_counter.c | 35 ++++++++++++++++++++++++++++++++++- mm/vmscan.c | 13 +++++++++---- 6 files changed, 71 insertions(+), 15 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 900a36112b62..a998a1e3b8b0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -606,7 +606,9 @@ static inline void mem_cgroup_protection(struct mem_cgr= oup *root, } =20 void mem_cgroup_calculate_protection(struct mem_cgroup *root, - struct mem_cgroup *memcg); + struct mem_cgroup *memcg, bool toptier); + +unsigned long mem_cgroup_toptier_usage(struct mem_cgroup *memcg); =20 void update_memcg_toptier_capacity(void); =20 @@ -623,11 +625,15 @@ static inline bool mem_cgroup_unprotected(struct mem_= cgroup *target, } =20 static inline bool mem_cgroup_below_low(struct mem_cgroup *target, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, bool toptier) { if (mem_cgroup_unprotected(target, memcg)) return false; =20 + if (toptier) + return READ_ONCE(memcg->memory.etoptier_low) >=3D + mem_cgroup_toptier_usage(memcg); + return READ_ONCE(memcg->memory.elow) >=3D page_counter_read(&memcg->memory); } @@ -1114,7 +1120,8 @@ static inline void mem_cgroup_protection(struct mem_c= group *root, } =20 static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, + bool toptier) { } =20 @@ -1128,7 +1135,7 @@ static inline bool mem_cgroup_unprotected(struct mem_= cgroup *target, return true; } static inline bool mem_cgroup_below_low(struct mem_cgroup *target, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, bool toptier) { return false; } diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index ada5f1dd75d4..6635ee7b9575 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -120,15 +120,16 @@ static inline void page_counter_reset_watermark(struc= t page_counter *counter) #if IS_ENABLED(CONFIG_MEMCG) || IS_ENABLED(CONFIG_CGROUP_DMEM) void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, - bool recursive_protection); + bool recursive_protection, bool toptier); void page_counter_update_toptier_capacity(struct page_counter *counter, const nodemask_t *allowed); unsigned long page_counter_toptier_high(struct page_counter *counter); unsigned long page_counter_toptier_low(struct page_counter *counter); #else static inline void page_counter_calculate_protection(struct page_counter *= root, - struct page_counter *counter, - bool recursive_protection) {} + struct page_counter *counter, + bool recursive_protection, + bool toptier) {} #endif =20 #endif /* _LINUX_PAGE_COUNTER_H */ diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c index 1ea6afffa985..536d43c42de8 100644 --- a/kernel/cgroup/dmem.c +++ b/kernel/cgroup/dmem.c @@ -277,7 +277,7 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_poo= l_state *limit_pool, continue; =20 page_counter_calculate_protection( - climit, &found_pool->cnt, true); + climit, &found_pool->cnt, true, false); =20 if (found_pool =3D=3D test_pool) break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 07464f02c529..8aa7ae361a73 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4806,12 +4806,13 @@ struct cgroup_subsys memory_cgrp_subsys =3D { * mem_cgroup_calculate_protection - check if memory consumption is in the= normal range * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check + * @toptier: whether the caller is in a toptier node * * WARNING: This function is not stateless! It can only be used as part * of a top-down tree iteration, not for isolated queries. */ void mem_cgroup_calculate_protection(struct mem_cgroup *root, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, bool toptier) { bool recursive_protection =3D cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT; @@ -4822,7 +4823,16 @@ void mem_cgroup_calculate_protection(struct mem_cgro= up *root, if (!root) root =3D root_mem_cgroup; =20 - page_counter_calculate_protection(&root->memory, &memcg->memory, recursiv= e_protection); + page_counter_calculate_protection(&root->memory, &memcg->memory, + recursive_protection, toptier); +} + +unsigned long mem_cgroup_toptier_usage(struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled() || !memcg) + return 0; + + return atomic_long_read(&memcg->memory.toptier_usage); } =20 void update_memcg_toptier_capacity(void) diff --git a/mm/page_counter.c b/mm/page_counter.c index cf21c72bfd4e..79d46a1c4c0c 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -410,12 +410,39 @@ static unsigned long effective_protection(unsigned lo= ng usage, return ep; } =20 +static void calculate_protection_toptier(struct page_counter *counter, + bool recursive_protection) +{ + struct page_counter *parent =3D counter->parent; + unsigned long toptier_low; + unsigned long toptier_usage, parent_toptier_usage; + unsigned long toptier_protected, old_toptier_protected; + long delta; + + toptier_low =3D page_counter_toptier_low(counter); + toptier_usage =3D atomic_long_read(&counter->toptier_usage); + parent_toptier_usage =3D atomic_long_read(&parent->toptier_usage); + + /* Propagate toptier low usage to parent for sibling distribution */ + toptier_protected =3D min(toptier_usage, toptier_low); + old_toptier_protected =3D atomic_long_xchg(&counter->toptier_low_usage, + toptier_protected); + delta =3D toptier_protected - old_toptier_protected; + atomic_long_add(delta, &parent->children_toptier_low_usage); + + WRITE_ONCE(counter->etoptier_low, + effective_protection(toptier_usage, parent_toptier_usage, + toptier_low, READ_ONCE(parent->etoptier_low), + atomic_long_read(&parent->children_toptier_low_usage), + recursive_protection)); +} =20 /** * page_counter_calculate_protection - check if memory consumption is in t= he normal range * @root: the top ancestor of the sub-tree being checked * @counter: the page_counter the counter to update * @recursive_protection: Whether to use memory_recursiveprot behavior. + * @toptier: Whether to calculate toptier-proportional protection * * Calculates elow/emin thresholds for given page_counter. * @@ -424,7 +451,7 @@ static unsigned long effective_protection(unsigned long= usage, */ void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, - bool recursive_protection) + bool recursive_protection, bool toptier) { unsigned long usage, parent_usage; struct page_counter *parent =3D counter->parent; @@ -446,6 +473,9 @@ void page_counter_calculate_protection(struct page_coun= ter *root, if (parent =3D=3D root) { counter->emin =3D READ_ONCE(counter->min); counter->elow =3D READ_ONCE(counter->low); + if (toptier) + WRITE_ONCE(counter->etoptier_low, + page_counter_toptier_low(counter)); return; } =20 @@ -462,6 +492,9 @@ void page_counter_calculate_protection(struct page_coun= ter *root, READ_ONCE(parent->elow), atomic_long_read(&parent->children_low_usage), recursive_protection)); + + if (toptier) + calculate_protection_toptier(counter, recursive_protection); } =20 void page_counter_update_toptier_capacity(struct page_counter *counter, diff --git a/mm/vmscan.c b/mm/vmscan.c index 6a87ac7be43c..5b4cb030a477 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4144,6 +4144,7 @@ static void lru_gen_age_node(struct pglist_data *pgda= t, struct scan_control *sc) struct mem_cgroup *memcg; unsigned long min_ttl =3D READ_ONCE(lru_gen_min_ttl); bool reclaimable =3D !min_ttl; + bool toptier =3D node_is_toptier(pgdat->node_id); =20 VM_WARN_ON_ONCE(!current_is_kswapd()); =20 @@ -4153,7 +4154,7 @@ static void lru_gen_age_node(struct pglist_data *pgda= t, struct scan_control *sc) do { struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); =20 - mem_cgroup_calculate_protection(NULL, memcg); + mem_cgroup_calculate_protection(NULL, memcg, toptier); =20 if (!reclaimable) reclaimable =3D lruvec_is_reclaimable(lruvec, sc, min_ttl); @@ -4905,12 +4906,14 @@ static int shrink_one(struct lruvec *lruvec, struct= scan_control *sc) unsigned long reclaimed =3D sc->nr_reclaimed; struct mem_cgroup *memcg =3D lruvec_memcg(lruvec); struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); + bool toptier =3D tier_aware_memcg_limits && + node_is_toptier(pgdat->node_id); =20 /* lru_gen_age_node() called mem_cgroup_calculate_protection() */ if (mem_cgroup_below_min(NULL, memcg)) return MEMCG_LRU_YOUNG; =20 - if (mem_cgroup_below_low(NULL, memcg)) { + if (mem_cgroup_below_low(NULL, memcg, toptier)) { /* see the comment on MEMCG_NR_GENS */ if (READ_ONCE(lruvec->lrugen.seg) !=3D MEMCG_LRU_TAIL) return MEMCG_LRU_TAIL; @@ -5960,6 +5963,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, stru= ct scan_control *sc) }; struct mem_cgroup_reclaim_cookie *partial =3D &reclaim; struct mem_cgroup *memcg; + bool toptier =3D node_is_toptier(pgdat->node_id); =20 /* * In most cases, direct reclaimers can do partial walks @@ -5987,7 +5991,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, stru= ct scan_control *sc) */ cond_resched(); =20 - mem_cgroup_calculate_protection(target_memcg, memcg); + mem_cgroup_calculate_protection(target_memcg, memcg, toptier); =20 if (mem_cgroup_below_min(target_memcg, memcg)) { /* @@ -5995,7 +5999,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, stru= ct scan_control *sc) * If there is no reclaimable memory, OOM. */ continue; - } else if (mem_cgroup_below_low(target_memcg, memcg)) { + } else if (mem_cgroup_below_low(target_memcg, memcg, + tier_aware_memcg_limits && toptier)) { /* * Soft protection. * Respect the protection only as long as --=20 2.47.3 From nobody Sun Apr 5 19:49:50 2026 Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9BE0D37B41A for ; Mon, 23 Feb 2026 22:38:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886326; cv=none; b=lOHWMhcQB+CUKSzl6Gb0feLLG961cu4VnbanI2YptlU9kCY2BmrFavrK5YVM6q2+Hrwki5RqArbmW6s1ZWh8QzXc/3xyb4yUWtOJSJz6Ri3kvmCtY4G8am+SipCw00KE3N8kw/eUQzTNdEp5l45pms3gvPD7LgffEIHJPLgfHLo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771886326; c=relaxed/simple; bh=sBvHCRLOSKHT83gdccnLmheOXBIOB5m3ZeZY9+qfNQA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=B6XRR8v+w1fJzFzFxLxa4sTHc6UzEceFnXGBAti3Oa3G0U/TgAh+A8GPVoDRoxAeFBVUVvewrPstHkW6sj7j8YwVnuRRkI6NJvmZJH25HXiPRuynjculIY4cpi8Atgg/ydz8wU1ERd9Pk2qvrAurSw5E093wKKgesG755uVNkH4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=MJwHVpTS; arc=none smtp.client-ip=209.85.210.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="MJwHVpTS" Received: by mail-ot1-f46.google.com with SMTP id 46e09a7af769-7d4c68f0e47so2859270a34.1 for ; Mon, 23 Feb 2026 14:38:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771886323; x=1772491123; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=1/pi4GMyqfUzxedG8y+xzMj5Pky+ci5h56Uzs52BODo=; b=MJwHVpTSrjWbYibbWrJ0R73PEr0g9TRcUEdz7li+J4GMN5D6xVnlFthNRBVtg7SuQ0 mwP5xmsaVwTlAJozaLGVhPo+uC+nlB3tlRrMi0YFTQkP0ErFhKwAQj59WYxJz03QsS2V bPU300jPsoJdEhMBklIEPs6xcK7YA4fr4seeRv45Bm5jtK4KA2TnamjTg440ItepjBPI j1/usqUOeMorV8qDDDTZ0HuvkZAj2rlHUQEkcHkCh+0BUfEoSSHvOFITQPaILHSoAREn PYAnIl2yViMCvKmDVPJuXN3DJM9DcC3Cdt0GLCxyYsvzvxz+xBNpDW73poU/oergbY+e RyuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771886323; x=1772491123; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=1/pi4GMyqfUzxedG8y+xzMj5Pky+ci5h56Uzs52BODo=; b=G3PStY3YUi+k6PJ+NcEMmmbf5h8O2mKqI2Y0tNfR0eNFydBZXsGebBRR5BFzsd7TAW h9BYpVeQ8uZh1LRSqg7YpCWsVsVLPhtw4CqSRvx1ZdYtuMnKLeatzsitTqJ+Iyu0IwZP 1RHE2yHUjRp4ZoIsleUUiWfOvZeNKYlSbWF3+hPOWjLT6e02gI/1Lfcp8TVY6rBffeC/ kZMqbTipXSujaBT/e/Fq5PmLcvOJo8V6zw14S3CB4G60aEbWqItZS3nA6QJ0IMfzKv8E xmSSovY5t7XuGnoTfBovXM+V9o3tb5QRmNiNdi7hW/bJhGfVAet52ZdPjgnCkbUWypdx l1PQ== X-Forwarded-Encrypted: i=1; AJvYcCV5U50Ts97hBYWx+GMUCjGuxgEuF9lsU0gGTaOtd5nQxSEHheafMRx0VfBujyMpYKNT+UN+I6MLo0X1vMk=@vger.kernel.org X-Gm-Message-State: AOJu0YykgrRXnFTYz9s+pQMmbm/I1qAQ2SQ53/MjN9hJK5CErmkPHDPO V7I9lnJULPX5uo5PrgloF0QAgpHtaZlSMfZWX8nKX0PX97jMDPfF5uSB X-Gm-Gg: AZuq6aJyUvdA5vnG0bU29Hv/p9WllhCFbTuGx3QeP1WOzNWY1VH3GqLy+8a+CHn1I/C QcZz3zyxwTGVF01QhuaUBDfYc4+Nt7s0hkLCjFczIDoQIuBPa7tlIthjL+Z7X63LHJRchvVvADE iB6WtlfE6WL13GqBuoXIu0mZYKgABnnT6IDcVvFPQFsl5HMosx+i6ViMDrLbx9T6RJQEW0pLs5d p7uS00FkfKlvotblfWKSyYCpFEg4PQCIur3M4Fb2a15QO1N4NmHZ5ic5KbhwQgMRMiOnGOnMhN0 fXW+/4YM3Li2MihOShadlLw+SeFCgLgretDBi78oCfnRuCza99xFBGQVzN5ND4FmE3IDk8qTZTu ncqXEzu9hIQ8d8wQgbXGhwOOWVUQ8tqjDiiGD3x3AfCTakHrGim+M58TdiAsbLlI+FDLFCZYwyW jj1doiJSf6jUTfXEFlTZeUJb4jeUr6PTb6 X-Received: by 2002:a05:6830:6610:b0:7d1:9da9:c6e with SMTP id 46e09a7af769-7d52bf6bb20mr5168975a34.25.1771886323539; Mon, 23 Feb 2026 14:38:43 -0800 (PST) Received: from localhost ([2a03:2880:10ff:71::]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-7d52ce63069sm8952812a34.0.2026.02.23.14.38.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 14:38:42 -0800 (PST) From: Joshua Hahn To: Joshua Hahn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Qi Zheng , Axel Rasmussen , Yuanchu Xie , Wei Xu , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Subject: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Date: Mon, 23 Feb 2026 14:38:29 -0800 Message-ID: <20260223223830.586018-7-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260223223830.586018-1-joshua.hahnjy@gmail.com> References: <20260223223830.586018-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On machines serving multiple workloads whose memory is isolated via the memory cgroup controller, it is currently impossible to enforce a fair distribution of toptier memory among the workloads, as the only enforcable limits have to do with total memory footprint, but not where that memory resides. This makes ensuring a consistent and baseline performance difficult, as each workload's performance is heavily impacted by workload-external factors wuch as which other workloads are co-located in the same host, and the order at which different workloads are started. Extend the existing memory.high protection to be tier-aware in the charging and enforcement to limit toptier-hogging for workloads. Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages, which can be used to selectively reclaim from memory at the memcg-tier interection of a cgroup. Signed-off-by: Joshua Hahn --- include/linux/swap.h | 3 +- mm/memcontrol-v1.c | 6 ++-- mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++++++++------- mm/vmscan.c | 11 +++--- 4 files changed, 84 insertions(+), 21 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 0effe3cc50f5..c6037ac7bf6e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struc= t mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, unsigned int reclaim_options, - int *swappiness); + int *swappiness, + nodemask_t *allowed); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 0b39ba608109..29630c7f3567 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *m= emcg, } =20 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, + NULL, NULL)) { ret =3D -EBUSY; break; } @@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *= memcg) return -EINTR; =20 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP, NULL)) + MEMCG_RECLAIM_MAY_SWAP, + NULL, NULL)) nr_retries--; } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8aa7ae361a73..ebd4a1b73c51 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup= *memcg, =20 do { unsigned long pflags; - - if (page_counter_read(&memcg->memory) <=3D - READ_ONCE(memcg->memory.high)) + nodemask_t toptier_nodes, *reclaim_nodes; + bool mem_high_ok, toptier_high_ok; + + mt_get_toptier_nodemask(&toptier_nodes, NULL); + mem_high_ok =3D page_counter_read(&memcg->memory) <=3D + READ_ONCE(memcg->memory.high); + toptier_high_ok =3D !(tier_aware_memcg_limits && + mem_cgroup_toptier_usage(memcg) > + page_counter_toptier_high(&memcg->memory)); + if (mem_high_ok && toptier_high_ok) continue; =20 + if (mem_high_ok && !toptier_high_ok) + reclaim_nodes =3D &toptier_nodes; + else + reclaim_nodes =3D NULL; + memcg_memory_event(memcg, MEMCG_HIGH); =20 psi_memstall_enter(&pflags); nr_reclaimed +=3D try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, MEMCG_RECLAIM_MAY_SWAP, - NULL); + NULL, reclaim_nodes); psi_memstall_leave(&pflags); } while ((memcg =3D parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2296,6 +2308,24 @@ static u64 mem_find_max_overage(struct mem_cgroup *m= emcg) return max_overage; } =20 +static u64 toptier_find_max_overage(struct mem_cgroup *memcg) +{ + u64 overage, max_overage =3D 0; + + if (!tier_aware_memcg_limits) + return 0; + + do { + unsigned long usage =3D mem_cgroup_toptier_usage(memcg); + unsigned long high =3D page_counter_toptier_high(&memcg->memory); + + overage =3D calculate_overage(usage, high); + max_overage =3D max(overage, max_overage); + } while ((memcg =3D parent_mem_cgroup(memcg)) && + !mem_cgroup_is_root(memcg)); + + return max_overage; +} static u64 swap_find_max_overage(struct mem_cgroup *memcg) { u64 overage, max_overage =3D 0; @@ -2401,6 +2431,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask) penalty_jiffies +=3D calculate_high_delay(memcg, nr_pages, swap_find_max_overage(memcg)); =20 + /* + * Don't double-penalize for toptier high overage if system-wide + * memory.high has already been breached. + */ + if (!penalty_jiffies) + penalty_jiffies +=3D calculate_high_delay(memcg, nr_pages, + toptier_find_max_overage(memcg)); + /* * Clamp the max delay per usermode return so as to still keep the * application moving forwards and also permit diagnostics, albeit @@ -2503,7 +2541,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg,= gfp_t gfp_mask, =20 psi_memstall_enter(&pflags); nr_reclaimed =3D try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options, NULL); + gfp_mask, reclaim_options, + NULL, NULL); psi_memstall_leave(&pflags); =20 if (mem_cgroup_margin(mem_over_limit) >=3D nr_pages) @@ -2592,23 +2631,26 @@ static int try_charge_memcg(struct mem_cgroup *memc= g, gfp_t gfp_mask, * reclaim, the cost of mismatch is negligible. */ do { - bool mem_high, swap_high; + bool mem_high, swap_high, toptier_high =3D false; =20 mem_high =3D page_counter_read(&memcg->memory) > READ_ONCE(memcg->memory.high); swap_high =3D page_counter_read(&memcg->swap) > READ_ONCE(memcg->swap.high); + toptier_high =3D tier_aware_memcg_limits && + (mem_cgroup_toptier_usage(memcg) > + page_counter_toptier_high(&memcg->memory)); =20 /* Don't bother a random interrupted task */ if (!in_task()) { - if (mem_high) { + if (mem_high || toptier_high) { schedule_work(&memcg->high_work); break; } continue; } =20 - if (mem_high || swap_high) { + if (mem_high || swap_high || toptier_high) { /* * The allocating tasks in this cgroup will need to do * reclaim or be throttled to prevent further growth @@ -4476,7 +4518,7 @@ static ssize_t memory_high_write(struct kernfs_open_f= ile *of, struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); unsigned int nr_retries =3D MAX_RECLAIM_RETRIES; bool drained =3D false; - unsigned long high; + unsigned long high, toptier_high; int err; =20 buf =3D strstrip(buf); @@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open= _file *of, return err; =20 page_counter_set_high(&memcg->memory, high); + toptier_high =3D page_counter_toptier_high(&memcg->memory); =20 if (of->file->f_flags & O_NONBLOCK) goto out; =20 for (;;) { unsigned long nr_pages =3D page_counter_read(&memcg->memory); + unsigned long toptier_pages =3D mem_cgroup_toptier_usage(memcg); unsigned long reclaimed; + unsigned long to_free; + nodemask_t toptier_nodes, *reclaim_nodes; + bool mem_high_ok =3D nr_pages <=3D high; + bool toptier_high_ok =3D !(tier_aware_memcg_limits && + toptier_pages > toptier_high); =20 - if (nr_pages <=3D high) + if (mem_high_ok && toptier_high_ok) break; =20 if (signal_pending(current)) @@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_= file *of, continue; } =20 - reclaimed =3D try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); + mt_get_toptier_nodemask(&toptier_nodes, NULL); + if (mem_high_ok && !toptier_high_ok) { + reclaim_nodes =3D &toptier_nodes; + to_free =3D toptier_pages - toptier_high; + } else { + reclaim_nodes =3D NULL; + to_free =3D nr_pages - high; + } + reclaimed =3D try_to_free_mem_cgroup_pages(memcg, to_free, + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL, reclaim_nodes); =20 if (!reclaimed && !nr_retries--) break; @@ -4558,7 +4616,8 @@ static ssize_t memory_max_write(struct kernfs_open_fi= le *of, =20 if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL, NULL)) nr_reclaims--; continue; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 5b4cb030a477..94498734b4f5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6652,7 +6652,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem= _cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, unsigned int reclaim_options, - int *swappiness) + int *swappiness, nodemask_t *allowed) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6668,6 +6668,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem= _cgroup *memcg, .may_unmap =3D 1, .may_swap =3D !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive =3D !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .nodemask =3D allowed, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put @@ -6693,7 +6694,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem= _cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, unsigned int reclaim_options, - int *swappiness) + int *swappiness, nodemask_t *allowed) { return 0; } @@ -7806,9 +7807,9 @@ int user_proactive_reclaim(char *buf, reclaim_options =3D MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; reclaimed =3D try_to_free_mem_cgroup_pages(memcg, - batch_size, gfp_mask, - reclaim_options, - swappiness =3D=3D -1 ? NULL : &swappiness); + batch_size, gfp_mask, reclaim_options, + swappiness =3D=3D -1 ? NULL : &swappiness, + NULL); } else { struct scan_control sc =3D { .gfp_mask =3D current_gfp_context(gfp_mask), --=20 2.47.3