mm/memcontrol: Make memcg limits tier-aware

[RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Joshua Hahn 1 month, 1 week ago

On machines serving multiple workloads whose memory is isolated via the
memory cgroup controller, it is currently impossible to enforce a fair
distribution of toptier memory among the workloads, as the only
enforcable limits have to do with total memory footprint, but not where
that memory resides.

This makes ensuring a consistent and baseline performance difficult, as
each workload's performance is heavily impacted by workload-external
factors wuch as which other workloads are co-located in the same host,
and the order at which different workloads are started.

Extend the existing memory.high protection to be tier-aware in the
charging and enforcement to limit toptier-hogging for workloads.

Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
which can be used to selectively reclaim from memory at the
memcg-tier interection of a cgroup.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/swap.h |  3 +-
 mm/memcontrol-v1.c   |  6 ++--
 mm/memcontrol.c      | 85 +++++++++++++++++++++++++++++++++++++-------
 mm/vmscan.c          | 11 +++---
 4 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0effe3cc50f5..c6037ac7bf6e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
 						  unsigned int reclaim_options,
-						  int *swappiness);
+						  int *swappiness,
+						  nodemask_t *allowed);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 0b39ba608109..29630c7f3567 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
+				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+				NULL, NULL)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP, NULL))
+						  MEMCG_RECLAIM_MAY_SWAP,
+						  NULL, NULL))
 			nr_retries--;
 	}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8aa7ae361a73..ebd4a1b73c51 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 
 	do {
 		unsigned long pflags;
-
-		if (page_counter_read(&memcg->memory) <=
-		    READ_ONCE(memcg->memory.high))
+		nodemask_t toptier_nodes, *reclaim_nodes;
+		bool mem_high_ok, toptier_high_ok;
+
+		mt_get_toptier_nodemask(&toptier_nodes, NULL);
+		mem_high_ok = page_counter_read(&memcg->memory) <=
+			      READ_ONCE(memcg->memory.high);
+		toptier_high_ok = !(tier_aware_memcg_limits &&
+				    mem_cgroup_toptier_usage(memcg) >
+				    page_counter_toptier_high(&memcg->memory));
+		if (mem_high_ok && toptier_high_ok)
 			continue;
 
+		if (mem_high_ok && !toptier_high_ok)
+			reclaim_nodes = &toptier_nodes;
+		else
+			reclaim_nodes = NULL;
+
 		memcg_memory_event(memcg, MEMCG_HIGH);
 
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
 							MEMCG_RECLAIM_MAY_SWAP,
-							NULL);
+							NULL, reclaim_nodes);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2296,6 +2308,24 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 	return max_overage;
 }
 
+static u64 toptier_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
+
+	if (!tier_aware_memcg_limits)
+		return 0;
+
+	do {
+		unsigned long usage = mem_cgroup_toptier_usage(memcg);
+		unsigned long high = page_counter_toptier_high(&memcg->memory);
+
+		overage = calculate_overage(usage, high);
+		max_overage = max(overage, max_overage);
+	} while ((memcg = parent_mem_cgroup(memcg)) &&
+		  !mem_cgroup_is_root(memcg));
+
+	return max_overage;
+}
 static u64 swap_find_max_overage(struct mem_cgroup *memcg)
 {
 	u64 overage, max_overage = 0;
@@ -2401,6 +2431,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
 						swap_find_max_overage(memcg));
 
+	/*
+	 * Don't double-penalize for toptier high overage if system-wide
+	 * memory.high has already been breached.
+	 */
+	if (!penalty_jiffies)
+		penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+					toptier_find_max_overage(memcg));
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -2503,7 +2541,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options, NULL);
+						    gfp_mask, reclaim_options,
+						    NULL, NULL);
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -2592,23 +2631,26 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		bool mem_high, swap_high;
+		bool mem_high, swap_high, toptier_high = false;
 
 		mem_high = page_counter_read(&memcg->memory) >
 			READ_ONCE(memcg->memory.high);
 		swap_high = page_counter_read(&memcg->swap) >
 			READ_ONCE(memcg->swap.high);
+		toptier_high = tier_aware_memcg_limits &&
+			       (mem_cgroup_toptier_usage(memcg) >
+				page_counter_toptier_high(&memcg->memory));
 
 		/* Don't bother a random interrupted task */
 		if (!in_task()) {
-			if (mem_high) {
+			if (mem_high || toptier_high) {
 				schedule_work(&memcg->high_work);
 				break;
 			}
 			continue;
 		}
 
-		if (mem_high || swap_high) {
+		if (mem_high || swap_high || toptier_high) {
 			/*
 			 * The allocating tasks in this cgroup will need to do
 			 * reclaim or be throttled to prevent further growth
@@ -4476,7 +4518,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	bool drained = false;
-	unsigned long high;
+	unsigned long high, toptier_high;
 	int err;
 
 	buf = strstrip(buf);
@@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		return err;
 
 	page_counter_set_high(&memcg->memory, high);
+	toptier_high = page_counter_toptier_high(&memcg->memory);
 
 	if (of->file->f_flags & O_NONBLOCK)
 		goto out;
 
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
+		unsigned long toptier_pages = mem_cgroup_toptier_usage(memcg);
 		unsigned long reclaimed;
+		unsigned long to_free;
+		nodemask_t toptier_nodes, *reclaim_nodes;
+		bool mem_high_ok = nr_pages <= high;
+		bool toptier_high_ok = !(tier_aware_memcg_limits &&
+					 toptier_pages > toptier_high);
 
-		if (nr_pages <= high)
+		if (mem_high_ok && toptier_high_ok)
 			break;
 
 		if (signal_pending(current))
@@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 			continue;
 		}
 
-		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
+		mt_get_toptier_nodemask(&toptier_nodes, NULL);
+		if (mem_high_ok && !toptier_high_ok) {
+			reclaim_nodes = &toptier_nodes;
+			to_free = toptier_pages - toptier_high;
+		} else {
+			reclaim_nodes = NULL;
+			to_free = nr_pages - high;
+		}
+		reclaimed = try_to_free_mem_cgroup_pages(memcg, to_free,
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL, reclaim_nodes);
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -4558,7 +4616,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL, NULL))
 				nr_reclaims--;
 			continue;
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5b4cb030a477..94498734b4f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6652,7 +6652,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
 					   unsigned int reclaim_options,
-					   int *swappiness)
+					   int *swappiness, nodemask_t *allowed)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6668,6 +6668,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.nodemask = allowed,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
@@ -6693,7 +6694,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
 					   unsigned int reclaim_options,
-					   int *swappiness)
+					   int *swappiness, nodemask_t *allowed)
 {
 	return 0;
 }
@@ -7806,9 +7807,9 @@ int user_proactive_reclaim(char *buf,
 			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
 					  MEMCG_RECLAIM_PROACTIVE;
 			reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						 batch_size, gfp_mask,
-						 reclaim_options,
-						 swappiness == -1 ? NULL : &swappiness);
+					batch_size, gfp_mask, reclaim_options,
+					swappiness == -1 ? NULL : &swappiness,
+					NULL);
 		} else {
 			struct scan_control sc = {
 				.gfp_mask = current_gfp_context(gfp_mask),
-- 
2.47.3

Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Donet Tom 1 week, 5 days ago

On 2/24/26 4:08 AM, Joshua Hahn wrote:
> On machines serving multiple workloads whose memory is isolated via the
> memory cgroup controller, it is currently impossible to enforce a fair
> distribution of toptier memory among the workloads, as the only
> enforcable limits have to do with total memory footprint, but not where
> that memory resides.
>
> This makes ensuring a consistent and baseline performance difficult, as
> each workload's performance is heavily impacted by workload-external
> factors wuch as which other workloads are co-located in the same host,
> and the order at which different workloads are started.
>
> Extend the existing memory.high protection to be tier-aware in the
> charging and enforcement to limit toptier-hogging for workloads.
>
> Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
> which can be used to selectively reclaim from memory at the
> memcg-tier interection of a cgroup.
>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>   include/linux/swap.h |  3 +-
>   mm/memcontrol-v1.c   |  6 ++--
>   mm/memcontrol.c      | 85 +++++++++++++++++++++++++++++++++++++-------
>   mm/vmscan.c          | 11 +++---
>   4 files changed, 84 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0effe3cc50f5..c6037ac7bf6e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>   						  unsigned long nr_pages,
>   						  gfp_t gfp_mask,
>   						  unsigned int reclaim_options,
> -						  int *swappiness);
> +						  int *swappiness,
> +						  nodemask_t *allowed);
>   extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>   						gfp_t gfp_mask, bool noswap,
>   						pg_data_t *pgdat,
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 0b39ba608109..29630c7f3567 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>   		}
>   
>   		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> +				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> +				NULL, NULL)) {
>   			ret = -EBUSY;
>   			break;
>   		}
> @@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>   			return -EINTR;
>   
>   		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -						  MEMCG_RECLAIM_MAY_SWAP, NULL))
> +						  MEMCG_RECLAIM_MAY_SWAP,
> +						  NULL, NULL))
>   			nr_retries--;
>   	}
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8aa7ae361a73..ebd4a1b73c51 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>   
>   	do {
>   		unsigned long pflags;
> -
> -		if (page_counter_read(&memcg->memory) <=
> -		    READ_ONCE(memcg->memory.high))
> +		nodemask_t toptier_nodes, *reclaim_nodes;
> +		bool mem_high_ok, toptier_high_ok;
> +
> +		mt_get_toptier_nodemask(&toptier_nodes, NULL);
> +		mem_high_ok = page_counter_read(&memcg->memory) <=
> +			      READ_ONCE(memcg->memory.high);
> +		toptier_high_ok = !(tier_aware_memcg_limits &&
> +				    mem_cgroup_toptier_usage(memcg) >
> +				    page_counter_toptier_high(&memcg->memory));
> +		if (mem_high_ok && toptier_high_ok)
>   			continue;
>   
> +		if (mem_high_ok && !toptier_high_ok)
> +			reclaim_nodes = &toptier_nodes;
> +		else
> +			reclaim_nodes = NULL;


IIUC The intent of this patch is to partition cgroup memory such that
0 → toptier_high is backed by higher-tier memory, and
toptier_high → max is backed by lower-tier memory.

Based on this:

1.If top-tier usage exceeds toptier_high, pages should be
   demoted to the lower tier.

2. If lower-tier usage exceeds (max - toptier_high), pages
   should be swapped out.

3. If total memory usage exceeds max, demotion should be
   avoided and reclaim should directly swap out pages.

I think we are only handling case (1) in this patch. When
mem_high_ok && !toptier_high_ok, we are reclaiming pages (demotion first)

However, if !mem_high_ok, the memcg reclaim path works as if
there is no memory tiering  in cgroup. This can lead to more demotion
and may eventually result in OOM.

Should we also handle cases (2) and (3) in this patch?


> +
>   		memcg_memory_event(memcg, MEMCG_HIGH);
>   
>   		psi_memstall_enter(&pflags);
>   		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>   							gfp_mask,
>   							MEMCG_RECLAIM_MAY_SWAP,
> -							NULL);
> +							NULL, reclaim_nodes);
>   		psi_memstall_leave(&pflags);
>   	} while ((memcg = parent_mem_cgroup(memcg)) &&
>   		 !mem_cgroup_is_root(memcg));
> @@ -2296,6 +2308,24 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
>   	return max_overage;
>   }
>   
> +static u64 toptier_find_max_overage(struct mem_cgroup *memcg)
> +{
> +	u64 overage, max_overage = 0;
> +
> +	if (!tier_aware_memcg_limits)
> +		return 0;
> +
> +	do {
> +		unsigned long usage = mem_cgroup_toptier_usage(memcg);
> +		unsigned long high = page_counter_toptier_high(&memcg->memory);
> +
> +		overage = calculate_overage(usage, high);
> +		max_overage = max(overage, max_overage);
> +	} while ((memcg = parent_mem_cgroup(memcg)) &&
> +		  !mem_cgroup_is_root(memcg));
> +
> +	return max_overage;
> +}
>   static u64 swap_find_max_overage(struct mem_cgroup *memcg)
>   {
>   	u64 overage, max_overage = 0;
> @@ -2401,6 +2431,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>   	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
>   						swap_find_max_overage(memcg));
>   
> +	/*
> +	 * Don't double-penalize for toptier high overage if system-wide
> +	 * memory.high has already been breached.
> +	 */
> +	if (!penalty_jiffies)
> +		penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> +					toptier_find_max_overage(memcg));
> +
>   	/*
>   	 * Clamp the max delay per usermode return so as to still keep the
>   	 * application moving forwards and also permit diagnostics, albeit
> @@ -2503,7 +2541,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   
>   	psi_memstall_enter(&pflags);
>   	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -						    gfp_mask, reclaim_options, NULL);
> +						    gfp_mask, reclaim_options,
> +						    NULL, NULL);
>   	psi_memstall_leave(&pflags);
>   
>   	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -2592,23 +2631,26 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   	 * reclaim, the cost of mismatch is negligible.
>   	 */
>   	do {
> -		bool mem_high, swap_high;
> +		bool mem_high, swap_high, toptier_high = false;
>   
>   		mem_high = page_counter_read(&memcg->memory) >
>   			READ_ONCE(memcg->memory.high);
>   		swap_high = page_counter_read(&memcg->swap) >
>   			READ_ONCE(memcg->swap.high);
> +		toptier_high = tier_aware_memcg_limits &&
> +			       (mem_cgroup_toptier_usage(memcg) >
> +				page_counter_toptier_high(&memcg->memory));
>   
>   		/* Don't bother a random interrupted task */
>   		if (!in_task()) {
> -			if (mem_high) {
> +			if (mem_high || toptier_high) {
>   				schedule_work(&memcg->high_work);
>   				break;
>   			}
>   			continue;
>   		}
>   
> -		if (mem_high || swap_high) {
> +		if (mem_high || swap_high || toptier_high) {
>   			/*
>   			 * The allocating tasks in this cgroup will need to do
>   			 * reclaim or be throttled to prevent further growth
> @@ -4476,7 +4518,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>   	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>   	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>   	bool drained = false;
> -	unsigned long high;
> +	unsigned long high, toptier_high;
>   	int err;
>   
>   	buf = strstrip(buf);
> @@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>   		return err;
>   
>   	page_counter_set_high(&memcg->memory, high);
> +	toptier_high = page_counter_toptier_high(&memcg->memory);
>   
>   	if (of->file->f_flags & O_NONBLOCK)
>   		goto out;
>   
>   	for (;;) {
>   		unsigned long nr_pages = page_counter_read(&memcg->memory);
> +		unsigned long toptier_pages = mem_cgroup_toptier_usage(memcg);
>   		unsigned long reclaimed;
> +		unsigned long to_free;
> +		nodemask_t toptier_nodes, *reclaim_nodes;
> +		bool mem_high_ok = nr_pages <= high;
> +		bool toptier_high_ok = !(tier_aware_memcg_limits &&
> +					 toptier_pages > toptier_high);
>   
> -		if (nr_pages <= high)
> +		if (mem_high_ok && toptier_high_ok)
>   			break;
>   
>   		if (signal_pending(current))
> @@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>   			continue;
>   		}
>   
> -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
> +		mt_get_toptier_nodemask(&toptier_nodes, NULL);
> +		if (mem_high_ok && !toptier_high_ok) {
> +			reclaim_nodes = &toptier_nodes;
> +			to_free = toptier_pages - toptier_high;
> +		} else {
> +			reclaim_nodes = NULL;
> +			to_free = nr_pages - high;
> +		}
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg, to_free,
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NULL, reclaim_nodes);
>   
>   		if (!reclaimed && !nr_retries--)
>   			break;
> @@ -4558,7 +4616,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>   
>   		if (nr_reclaims) {
>   			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NULL, NULL))
>   				nr_reclaims--;
>   			continue;
>   		}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5b4cb030a477..94498734b4f5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6652,7 +6652,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>   					   unsigned long nr_pages,
>   					   gfp_t gfp_mask,
>   					   unsigned int reclaim_options,
> -					   int *swappiness)
> +					   int *swappiness, nodemask_t *allowed)
>   {
>   	unsigned long nr_reclaimed;
>   	unsigned int noreclaim_flag;
> @@ -6668,6 +6668,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>   		.may_unmap = 1,
>   		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>   		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +		.nodemask = allowed,
>   	};
>   	/*
>   	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> @@ -6693,7 +6694,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>   					   unsigned long nr_pages,
>   					   gfp_t gfp_mask,
>   					   unsigned int reclaim_options,
> -					   int *swappiness)
> +					   int *swappiness, nodemask_t *allowed)
>   {
>   	return 0;
>   }
> @@ -7806,9 +7807,9 @@ int user_proactive_reclaim(char *buf,
>   			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
>   					  MEMCG_RECLAIM_PROACTIVE;
>   			reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -						 batch_size, gfp_mask,
> -						 reclaim_options,
> -						 swappiness == -1 ? NULL : &swappiness);
> +					batch_size, gfp_mask, reclaim_options,
> +					swappiness == -1 ? NULL : &swappiness,
> +					NULL);
>   		} else {
>   			struct scan_control sc = {
>   				.gfp_mask = current_gfp_context(gfp_mask),

Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Joshua Hahn 1 week, 5 days ago

On Tue, 24 Mar 2026 16:21:06 +0530 Donet Tom <donettom@linux.ibm.com> wrote:

> 
> On 2/24/26 4:08 AM, Joshua Hahn wrote:
> > On machines serving multiple workloads whose memory is isolated via the
> > memory cgroup controller, it is currently impossible to enforce a fair
> > distribution of toptier memory among the workloads, as the only
> > enforcable limits have to do with total memory footprint, but not where
> > that memory resides.
> >
> > This makes ensuring a consistent and baseline performance difficult, as
> > each workload's performance is heavily impacted by workload-external
> > factors wuch as which other workloads are co-located in the same host,
> > and the order at which different workloads are started.
> >
> > Extend the existing memory.high protection to be tier-aware in the
> > charging and enforcement to limit toptier-hogging for workloads.
> >
> > Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
> > which can be used to selectively reclaim from memory at the
> > memcg-tier interection of a cgroup.
> >
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> > ---
> >   include/linux/swap.h |  3 +-
> >   mm/memcontrol-v1.c   |  6 ++--
> >   mm/memcontrol.c      | 85 +++++++++++++++++++++++++++++++++++++-------
> >   mm/vmscan.c          | 11 +++---
> >   4 files changed, 84 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 0effe3cc50f5..c6037ac7bf6e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >   						  unsigned long nr_pages,
> >   						  gfp_t gfp_mask,
> >   						  unsigned int reclaim_options,
> > -						  int *swappiness);
> > +						  int *swappiness,
> > +						  nodemask_t *allowed);
> >   extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> >   						gfp_t gfp_mask, bool noswap,
> >   						pg_data_t *pgdat,
> > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> > index 0b39ba608109..29630c7f3567 100644
> > --- a/mm/memcontrol-v1.c
> > +++ b/mm/memcontrol-v1.c
> > @@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> >   		}
> >   
> >   		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > -				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> > +				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> > +				NULL, NULL)) {
> >   			ret = -EBUSY;
> >   			break;
> >   		}
> > @@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> >   			return -EINTR;
> >   
> >   		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > -						  MEMCG_RECLAIM_MAY_SWAP, NULL))
> > +						  MEMCG_RECLAIM_MAY_SWAP,
> > +						  NULL, NULL))
> >   			nr_retries--;
> >   	}
> >   
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 8aa7ae361a73..ebd4a1b73c51 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> >   
> >   	do {
> >   		unsigned long pflags;
> > -
> > -		if (page_counter_read(&memcg->memory) <=
> > -		    READ_ONCE(memcg->memory.high))
> > +		nodemask_t toptier_nodes, *reclaim_nodes;
> > +		bool mem_high_ok, toptier_high_ok;
> > +
> > +		mt_get_toptier_nodemask(&toptier_nodes, NULL);
> > +		mem_high_ok = page_counter_read(&memcg->memory) <=
> > +			      READ_ONCE(memcg->memory.high);
> > +		toptier_high_ok = !(tier_aware_memcg_limits &&
> > +				    mem_cgroup_toptier_usage(memcg) >
> > +				    page_counter_toptier_high(&memcg->memory));
> > +		if (mem_high_ok && toptier_high_ok)
> >   			continue;
> >   
> > +		if (mem_high_ok && !toptier_high_ok)
> > +			reclaim_nodes = &toptier_nodes;
> > +		else
> > +			reclaim_nodes = NULL;
> 
> 
> IIUC The intent of this patch is to partition cgroup memory such that
> 0 → toptier_high is backed by higher-tier memory, and
> toptier_high → max is backed by lower-tier memory.
> 
> Based on this:
> 
> 1.If top-tier usage exceeds toptier_high, pages should be
>    demoted to the lower tier.
> 
> 2. If lower-tier usage exceeds (max - toptier_high), pages
>    should be swapped out.
> 
> 3. If total memory usage exceeds max, demotion should be
>    avoided and reclaim should directly swap out pages.
> 
> I think we are only handling case (1) in this patch. When
> mem_high_ok && !toptier_high_ok, we are reclaiming pages (demotion first)
> 
> However, if !mem_high_ok, the memcg reclaim path works as if
> there is no memory tiering  in cgroup. This can lead to more demotion
> and may eventually result in OOM.
> 
> Should we also handle cases (2) and (3) in this patch?

Hello Donet! I hope you are doing well.

For the second condition, should pages be swapped out? If a workload
is using 0 toptier memory (extreme case, let's say they haven't set
memory.low) then lower-tier should be able to use all the way up to
max memory.

Maybe you mean if lowtier_usage exceeds (max - toptier_usage) pages
should be swapped out? But if we rearrange this

                lowtier_usage >= max - toptier_usage
lowtier_usage + toptier_usage >= max
                  total_usage >= max

And this is just the memory.max check and is already handled by
existing reclaim semantics : -)

I think case 3 is a bit more nuanced. If we directly swap out from 
high tier and skip demotions, this is introducing a priority inversion
since memory in toptier should be hotter than memory in lowtier, so
we should prefer to swap out the colder memory in lowtier before
swapping out memory in toptier.

The idea was discussed at length at [1]. It also feels like an orthogonal
discussion since the behavior isn't related to toptier high or low
behaviors.

Please let me know what you think. Thank you, I hope you have a great day!
Joshua

[1] https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@google.com/

Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Donet Tom 1 week, 5 days ago

On 3/24/26 9:14 PM, Joshua Hahn wrote:
> On Tue, 24 Mar 2026 16:21:06 +0530 Donet Tom <donettom@linux.ibm.com> wrote:
>
>> On 2/24/26 4:08 AM, Joshua Hahn wrote:
>>> On machines serving multiple workloads whose memory is isolated via the
>>> memory cgroup controller, it is currently impossible to enforce a fair
>>> distribution of toptier memory among the workloads, as the only
>>> enforcable limits have to do with total memory footprint, but not where
>>> that memory resides.
>>>
>>> This makes ensuring a consistent and baseline performance difficult, as
>>> each workload's performance is heavily impacted by workload-external
>>> factors wuch as which other workloads are co-located in the same host,
>>> and the order at which different workloads are started.
>>>
>>> Extend the existing memory.high protection to be tier-aware in the
>>> charging and enforcement to limit toptier-hogging for workloads.
>>>
>>> Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
>>> which can be used to selectively reclaim from memory at the
>>> memcg-tier interection of a cgroup.
>>>
>>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> ---
>>>    include/linux/swap.h |  3 +-
>>>    mm/memcontrol-v1.c   |  6 ++--
>>>    mm/memcontrol.c      | 85 +++++++++++++++++++++++++++++++++++++-------
>>>    mm/vmscan.c          | 11 +++---
>>>    4 files changed, 84 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0effe3cc50f5..c6037ac7bf6e 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>>    						  unsigned long nr_pages,
>>>    						  gfp_t gfp_mask,
>>>    						  unsigned int reclaim_options,
>>> -						  int *swappiness);
>>> +						  int *swappiness,
>>> +						  nodemask_t *allowed);
>>>    extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>>>    						gfp_t gfp_mask, bool noswap,
>>>    						pg_data_t *pgdat,
>>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>>> index 0b39ba608109..29630c7f3567 100644
>>> --- a/mm/memcontrol-v1.c
>>> +++ b/mm/memcontrol-v1.c
>>> @@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>>>    		}
>>>    
>>>    		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>>> -				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
>>> +				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
>>> +				NULL, NULL)) {
>>>    			ret = -EBUSY;
>>>    			break;
>>>    		}
>>> @@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>>>    			return -EINTR;
>>>    
>>>    		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>>> -						  MEMCG_RECLAIM_MAY_SWAP, NULL))
>>> +						  MEMCG_RECLAIM_MAY_SWAP,
>>> +						  NULL, NULL))
>>>    			nr_retries--;
>>>    	}
>>>    
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 8aa7ae361a73..ebd4a1b73c51 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>>>    
>>>    	do {
>>>    		unsigned long pflags;
>>> -
>>> -		if (page_counter_read(&memcg->memory) <=
>>> -		    READ_ONCE(memcg->memory.high))
>>> +		nodemask_t toptier_nodes, *reclaim_nodes;
>>> +		bool mem_high_ok, toptier_high_ok;
>>> +
>>> +		mt_get_toptier_nodemask(&toptier_nodes, NULL);
>>> +		mem_high_ok = page_counter_read(&memcg->memory) <=
>>> +			      READ_ONCE(memcg->memory.high);
>>> +		toptier_high_ok = !(tier_aware_memcg_limits &&
>>> +				    mem_cgroup_toptier_usage(memcg) >
>>> +				    page_counter_toptier_high(&memcg->memory));
>>> +		if (mem_high_ok && toptier_high_ok)
>>>    			continue;
>>>    
>>> +		if (mem_high_ok && !toptier_high_ok)
>>> +			reclaim_nodes = &toptier_nodes;
>>> +		else
>>> +			reclaim_nodes = NULL;
>>
>> IIUC The intent of this patch is to partition cgroup memory such that
>> 0 → toptier_high is backed by higher-tier memory, and
>> toptier_high → max is backed by lower-tier memory.
>>
>> Based on this:
>>
>> 1.If top-tier usage exceeds toptier_high, pages should be
>>     demoted to the lower tier.
>>
>> 2. If lower-tier usage exceeds (max - toptier_high), pages
>>     should be swapped out.
>>
>> 3. If total memory usage exceeds max, demotion should be
>>     avoided and reclaim should directly swap out pages.
>>
>> I think we are only handling case (1) in this patch. When
>> mem_high_ok && !toptier_high_ok, we are reclaiming pages (demotion first)
>>
>> However, if !mem_high_ok, the memcg reclaim path works as if
>> there is no memory tiering  in cgroup. This can lead to more demotion
>> and may eventually result in OOM.
>>
>> Should we also handle cases (2) and (3) in this patch?
> Hello Donet! I hope you are doing well.
>
> For the second condition, should pages be swapped out? If a workload
> is using 0 toptier memory (extreme case, let's say they haven't set
> memory.low) then lower-tier should be able to use all the way up to
> max memory.
>
> Maybe you mean if lowtier_usage exceeds (max - toptier_usage) pages
> should be swapped out? But if we rearrange this
>
>                  lowtier_usage >= max - toptier_usage
> lowtier_usage + toptier_usage >= max
>                    total_usage >= max
>
> And this is just the memory.max check and is already handled by
> existing reclaim semantics : -)
>
> I think case 3 is a bit more nuanced. If we directly swap out from
> high tier and skip demotions, this is introducing a priority inversion
> since memory in toptier should be hotter than memory in lowtier, so
> we should prefer to swap out the colder memory in lowtier before
> swapping out memory in toptier.
>
> The idea was discussed at length at [1]. It also feels like an orthogonal
> discussion since the behavior isn't related to toptier high or low
> behaviors.
>
> Please let me know what you think. Thank you, I hope you have a great day!


Thanks, Joshua, for your clarification.

[1] disabled demotion from memcg. With memcg limits now being
tier-aware, I was thinking about how to handle the demotion
issue. You are right that this is a separate topic not related to this.

[1] 
https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@google.com/


> Joshua
>
> [1] https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@google.com/
>

Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Gregory Price 1 week, 5 days ago

On Tue, Mar 24, 2026 at 04:21:06PM +0530, Donet Tom wrote:
> 
> IIUC The intent of this patch is to partition cgroup memory such that
> 0 → toptier_high is backed by higher-tier memory, and
> toptier_high → max is backed by lower-tier memory.
> 
> Based on this:
> 
> 1.If top-tier usage exceeds toptier_high, pages should be
>   demoted to the lower tier.
> 
> 2. If lower-tier usage exceeds (max - toptier_high), pages
>   should be swapped out.
> 

This is not accurate and an incorrect heuristic.

Transiently, lower-tier usage may exceed (max - toptier_high) for any
number of reasons which should not be used as signal for pushing swap.

driving swap usage is a function of (usage > memory.high) without regard
for toptier / lowtier.

> 3. If total memory usage exceeds max, demotion should be
>   avoided and reclaim should directly swap out pages.
> 

This is also incorrect, as it would drive agingin inversions.
Demotion is a natural extension of the LRU infrastructure:

toptier active -> toptier inactive -> lowtier inactive -> swap

if you do (toptier inactive -> swap) you have inverted the LRU.

As far as I know, from testing, we retain all the existing behavior - we
are just managing a limited resource (top tier memory) to manage the
noisy-neighbor issue.  So...

> Should we also handle cases (2) and (3) in this patch?

No, I don't think we should

~Gregory

Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Donet Tom 1 week, 5 days ago

On 3/24/26 8:53 PM, Gregory Price wrote:
> On Tue, Mar 24, 2026 at 04:21:06PM +0530, Donet Tom wrote:
>> IIUC The intent of this patch is to partition cgroup memory such that
>> 0 → toptier_high is backed by higher-tier memory, and
>> toptier_high → max is backed by lower-tier memory.
>>
>> Based on this:
>>
>> 1.If top-tier usage exceeds toptier_high, pages should be
>>    demoted to the lower tier.
>>
>> 2. If lower-tier usage exceeds (max - toptier_high), pages
>>    should be swapped out.
>>
> This is not accurate and an incorrect heuristic.
>
> Transiently, lower-tier usage may exceed (max - toptier_high) for any
> number of reasons which should not be used as signal for pushing swap.
>
> driving swap usage is a function of (usage > memory.high) without regard
> for toptier / lowtier.
>
>> 3. If total memory usage exceeds max, demotion should be
>>    avoided and reclaim should directly swap out pages.
>>
> This is also incorrect, as it would drive agingin inversions.
> Demotion is a natural extension of the LRU infrastructure:
>
> toptier active -> toptier inactive -> lowtier inactive -> swap
>
> if you do (toptier inactive -> swap) you have inverted the LRU.


Thanks, Gregory, for the clarification.

One remaining concern is that under cgroup memory pressure,
demotion to the lower tier can still happen. Since demotion
does not uncharge the memcg, this could still trigger OOM.

Is this an issue we should address?


>
> As far as I know, from testing, we retain all the existing behavior - we
> are just managing a limited resource (top tier memory) to manage the
> noisy-neighbor issue.  So...
>
>
>> Should we also handle cases (2) and (3) in this patch?
> No, I don't think we should
>
> ~Gregory
>

Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Bing Jiao 3 weeks, 3 days ago

On Mon, Feb 23, 2026 at 02:38:29PM -0800, Joshua Hahn wrote:
> @@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  		return err;
>
>  	page_counter_set_high(&memcg->memory, high);
> +	toptier_high = page_counter_toptier_high(&memcg->memory);
>
>  	if (of->file->f_flags & O_NONBLOCK)
>  		goto out;
>
>  	for (;;) {
>  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> +		unsigned long toptier_pages = mem_cgroup_toptier_usage(memcg);
>  		unsigned long reclaimed;
> +		unsigned long to_free;
> +		nodemask_t toptier_nodes, *reclaim_nodes;
> +		bool mem_high_ok = nr_pages <= high;
> +		bool toptier_high_ok = !(tier_aware_memcg_limits &&
> +					 toptier_pages > toptier_high);
>
> -		if (nr_pages <= high)
> +		if (mem_high_ok && toptier_high_ok)
>  			break;
>
>  		if (signal_pending(current))
> @@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  			continue;
>  		}
>
> -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
> +		mt_get_toptier_nodemask(&toptier_nodes, NULL);
> +		if (mem_high_ok && !toptier_high_ok) {
> +			reclaim_nodes = &toptier_nodes;
> +			to_free = toptier_pages - toptier_high;
> +		} else {
> +			reclaim_nodes = NULL;
> +			to_free = nr_pages - high;
> +		}
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg, to_free,
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NULL, reclaim_nodes);
>
>  		if (!reclaimed && !nr_retries--)
>  			break;

Hi Joshua, thanks for the patch.

I have a concern regarding the system behavior when both the total
memory.high limit and the new toptier_high limit are breached.

If both mem_high_ok and toptier_high are false, memory_high_write()
invokes try_to_free_mem_cgroup_pages() with reclaim_nodes set to NULL
to target all nodes. Under these conditions, the reclaimer might attempt
to satisfy the target bytes by demoting pages from the top-tier to lower
tiers. While this fulfills the toptier_high requirement, it fails to
reduce the total memory charge for the cgroup because the counter tracks
the sum across all tiers. Consequently, since the total memory usage
remains unchanged, the reclaimer will likely become trapped in the loop
until it reaches MAX_RECLAIM_RETRIES and other situations (e.g.,
both !reclaimed && !nr_retries–), leading to excessive CPU consumption
without successfully bringing the cgroup below its total memory limit,
or causing all top-tier pages demoted to far-tier, or causing premature
OOM kills.

Given your tier-aware memcg limits, I think it is better to reclaim from
lower tiers to swap to satisfy mem_high_ok by setting the allowed nodemask
to far-tier nodes. Then demote pages from top tiers to ensure
toptier_high is okay. This also prevents reclaiming pages directly from
top tiers to swap and ensures that demotion actually contributes to
reaching the targeted memory state without unnecessary performance
penalties.

To address the issue where a memcg exceeds its total limit and demotion
cannot help to relief the memory memcg pressure, I am considering to
introduce a reclaim_options setting that prevents page demotion by
setting sc.no_demote = 1. I have a local patch for this and am preparing
it for submission.

Please let me know if I have misunderstood any part of your
implementation or if you see any issues with this proposed adjustment.

Best,
Bing

Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

Posted by Joshua Hahn 3 weeks, 3 days ago

On Wed, 11 Mar 2026 22:05:16 +0000 Bing Jiao <bingjiao@google.com> wrote:

> On Mon, Feb 23, 2026 at 02:38:29PM -0800, Joshua Hahn wrote:
> > @@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> >  		return err;
> >
> >  	page_counter_set_high(&memcg->memory, high);
> > +	toptier_high = page_counter_toptier_high(&memcg->memory);
> >
> >  	if (of->file->f_flags & O_NONBLOCK)
> >  		goto out;
> >
> >  	for (;;) {
> >  		unsigned long nr_pages = page_counter_read(&memcg->memory);
> > +		unsigned long toptier_pages = mem_cgroup_toptier_usage(memcg);
> >  		unsigned long reclaimed;
> > +		unsigned long to_free;
> > +		nodemask_t toptier_nodes, *reclaim_nodes;
> > +		bool mem_high_ok = nr_pages <= high;
> > +		bool toptier_high_ok = !(tier_aware_memcg_limits &&
> > +					 toptier_pages > toptier_high);
> >
> > -		if (nr_pages <= high)
> > +		if (mem_high_ok && toptier_high_ok)
> >  			break;
> >
> >  		if (signal_pending(current))
> > @@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> >  			continue;
> >  		}
> >
> > -		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
> > +		mt_get_toptier_nodemask(&toptier_nodes, NULL);
> > +		if (mem_high_ok && !toptier_high_ok) {
> > +			reclaim_nodes = &toptier_nodes;
> > +			to_free = toptier_pages - toptier_high;
> > +		} else {
> > +			reclaim_nodes = NULL;
> > +			to_free = nr_pages - high;
> > +		}
> > +		reclaimed = try_to_free_mem_cgroup_pages(memcg, to_free,
> > +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> > +					NULL, reclaim_nodes);
> >
> >  		if (!reclaimed && !nr_retries--)
> >  			break;
> 
> Hi Joshua, thanks for the patch.

Hello Bing!

I hope you are doing well, thank you for reviewing my patch : -)

> I have a concern regarding the system behavior when both the total
> memory.high limit and the new toptier_high limit are breached.
> 
> If both mem_high_ok and toptier_high are false, memory_high_write()
> invokes try_to_free_mem_cgroup_pages() with reclaim_nodes set to NULL
> to target all nodes. Under these conditions, the reclaimer might attempt
> to satisfy the target bytes by demoting pages from the top-tier to lower
> tiers. While this fulfills the toptier_high requirement, it fails to
> reduce the total memory charge for the cgroup because the counter tracks
> the sum across all tiers. Consequently, since the total memory usage
> remains unchanged, the reclaimer will likely become trapped in the loop
> until it reaches MAX_RECLAIM_RETRIES and other situations (e.g.,
> both !reclaimed && !nr_retries–), leading to excessive CPU consumption
> without successfully bringing the cgroup below its total memory limit,
> or causing all top-tier pages demoted to far-tier, or causing premature
> OOM kills.

I agree with everything you mentioned above. However, I would like to note
that my series preserves the default behavior for when memory.high
is breached (since toptier_high is always <= memory.high), so
memory_high_write() would previously have this behavior as well where
shrink_folio_list would prefer to demote as opposed to swapping and
lead to the infinite loop.

In that sense I think that it might make sense to introduce a fix for this
that is orthogonal to this series. AFAICT I don't think this is introducing
any new harmful behaviors.

> Given your tier-aware memcg limits, I think it is better to reclaim from
> lower tiers to swap to satisfy mem_high_ok by setting the allowed nodemask
> to far-tier nodes. Then demote pages from top tiers to ensure
> toptier_high is okay. This also prevents reclaiming pages directly from
> top tiers to swap and ensures that demotion actually contributes to
> reaching the targeted memory state without unnecessary performance
> penalties.

If I understand this correctly, this would mean that each loop would:
1. swap out low tier
2. demote top tier

And repeat this cycle until we meet the memory.high limit?

I think this makes sense. I will note that once again I think that this
change is orthogonal to this series, as it deals with the memory.high
violation case and not the toptier violation case. Note that if only
toptier limit is violated, demotion from the toptier does make sense,
since in this case it will shrink the metric we care about.

> To address the issue where a memcg exceeds its total limit and demotion
> cannot help to relief the memory memcg pressure, I am considering to
> introduce a reclaim_options setting that prevents page demotion by
> setting sc.no_demote = 1. I have a local patch for this and am preparing
> it for submission.

I think this makes sense. Please do CC me in the patch if/when you do
send it upstream!

> Please let me know if I have misunderstood any part of your
> implementation or if you see any issues with this proposed adjustment.

I think you understood my patch completely as I intended : -)
From my POV though, I just felt that the issues you mentioned actually have
to do with the standard memory reclaim infrastructure, and not necessarily
with the toptier high semantics.

And please let me know if you feel that I have not represented your
perspective as well! I hope you have a great day!!
Joshua

[RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs
[RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter
[RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates
[RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier
[RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware
[RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware