[PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()

Bing Jiao posted 2 patches 1 month ago
[PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()
Posted by Bing Jiao 1 month ago
When the preferred demotion node does not have enough free space,
alloc_demote_folio() attempts to allocate from fallback nodes.
Currently, it lacks a mechanism to distribute these fallback allocations,
which can lead to unbalanced memory pressure across fallback nodes.

Balance the allocation by randomly selecting a new preferred node from
the fallback nodes if the initial allocation from the old preferred
node fails.

Signed-off-by: Bing Jiao <bingjiao@google.com>
---
 mm/vmscan.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 81828fa625ed..db2413c4bd26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1009,6 +1009,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
 	if (dst)
 		return dst;

+	/* Randomly select a node from fallback nodes for balanced allocation */
+	if (allowed_mask) {
+		mtc->nid = node_random(allowed_mask);
+		node_clear(mtc->nid, *allowed_mask);
+	}
 	mtc->gfp_mask &= ~__GFP_THISNODE;
 	mtc->nmask = allowed_mask;

--
2.52.0.358.g0dd7633a29-goog
Re: [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()
Posted by Donet Tom 1 month ago
On 1/7/26 12:58 PM, Bing Jiao wrote:
> When the preferred demotion node does not have enough free space,
> alloc_demote_folio() attempts to allocate from fallback nodes.
> Currently, it lacks a mechanism to distribute these fallback allocations,
> which can lead to unbalanced memory pressure across fallback nodes.
>
> Balance the allocation by randomly selecting a new preferred node from
> the fallback nodes if the initial allocation from the old preferred
> node fails.
>
> Signed-off-by: Bing Jiao <bingjiao@google.com>
> ---
>   mm/vmscan.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 81828fa625ed..db2413c4bd26 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1009,6 +1009,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
>   	if (dst)
>   		return dst;
>
> +	/* Randomly select a node from fallback nodes for balanced allocation */
> +	if (allowed_mask) {
> +		mtc->nid = node_random(allowed_mask);


This random selection can cause allocations to fall back to distant 
memory even when the nearer demotion target has sufficient free memory, 
correct? Could this also lead to increased promotion latency?


> +		node_clear(mtc->nid, *allowed_mask);
> +	}
>   	mtc->gfp_mask &= ~__GFP_THISNODE;
>   	mtc->nmask = allowed_mask;
>
> --
> 2.52.0.358.g0dd7633a29-goog
>
>
Re: [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()
Posted by Bing Jiao 1 month ago
On Thu, Jan 08, 2026 at 06:14:02PM +0530, Donet Tom wrote:
>
> On 1/7/26 12:58 PM, Bing Jiao wrote:
> > +	/* Randomly select a node from fallback nodes for balanced allocation */
> > +	if (allowed_mask) {
> > +		mtc->nid = node_random(allowed_mask);
>
>
> This random selection can cause allocations to fall back to distant memory
> even when the nearer demotion target has sufficient free memory, correct?
> Could this also lead to increased promotion latency?

Hi Donet,

Thanks for your questions.

Yes, the random selection could select a distant node and lead to
incresed promotion latency.

I just realized that the the fallback allocation should not weighted
by a single metric, such as node distance, capacity, free space.
We need a thoroughly study before changing alloc_demote_folio().

Best,
Bing
Re: [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()
Posted by Joshua Hahn 1 month ago
On Fri, 9 Jan 2026 23:45:57 +0000 Bing Jiao <bingjiao@google.com> wrote:

> On Thu, Jan 08, 2026 at 06:14:02PM +0530, Donet Tom wrote:
> >
> > On 1/7/26 12:58 PM, Bing Jiao wrote:
> > > +	/* Randomly select a node from fallback nodes for balanced allocation */
> > > +	if (allowed_mask) {
> > > +		mtc->nid = node_random(allowed_mask);
> >
> >
> > This random selection can cause allocations to fall back to distant memory
> > even when the nearer demotion target has sufficient free memory, correct?
> > Could this also lead to increased promotion latency?
> 
> Hi Donet,
> 
> Thanks for your questions.
> 
> Yes, the random selection could select a distant node and lead to
> incresed promotion latency.
> 
> I just realized that the the fallback allocation should not weighted
> by a single metric, such as node distance, capacity, free space.

Hello Bing, I hope you are doing well!

Yes -- this is also what I believe, and I think this idea of "how should we
select demotion / allocation targets" is something that is a difficult problem
(and one that may not have a single solution that "just works").

It's also a question that I have been thinking about, and what was discussed
in part at LSFMMBPF last year. At the time, I made some auto-tuning weights [1]
for weighted interleave based on bandwidth capacity, since the main benefit of
weighted interleave is to distribute memory accesses across multiple nodes
to maximize how much bandwidth the system can use at once. A follow-up was to
think about how these weights could change over time, and what heuristics
should be used to determine how the weights are selected.

Ultimately, we agreed that the heuristics should probably be delegated to
userspace, since there are just so many scenarios that could change what
metrics should take priority. (Jonathan Corbet wrote a great summary of the
discussion in an LWN article [2])

Coming back to this patchset, I think that all of the ideas above apply
nicely here as well. What nodes should be selected for demotion and how they
should be weighted is a difficult question, and one that is probably best
answered by userspace and what workload they expect to use on their specific
system.

What I do believe though, is that an unweighted random selection / round-robin
approach to selecting demotion targets might lead to some unexpected
performance implications.

> We need a thoroughly study before changing alloc_demote_folio().

So I think this is the way to go : -)
Although, I'm not actively exploring this at the moment ;)

Please let me know what you think, I hope you have a great day!
Joshua

[1] https://lore.kernel.org/all/20250109185048.28587-1-joshua.hahnjy@gmail.com/
[2] https://lwn.net/Articles/1016842/
Re: [PATCH v1 1/2] mm/vmscan: balance demotion allocation in alloc_demote_folio()
Posted by Bing Jiao 3 weeks, 6 days ago
On Fri, Jan 09, 2026 at 04:52:28PM -0800, Joshua Hahn wrote:
> On Fri, 9 Jan 2026 23:45:57 +0000 Bing Jiao <bingjiao@google.com> wrote:
>
> > On Thu, Jan 08, 2026 at 06:14:02PM +0530, Donet Tom wrote:
> > >
> > > On 1/7/26 12:58 PM, Bing Jiao wrote:
> > > > +	/* Randomly select a node from fallback nodes for balanced allocation */
> > > > +	if (allowed_mask) {
> > > > +		mtc->nid = node_random(allowed_mask);
> > >
> > >
> > > This random selection can cause allocations to fall back to distant memory
> > > even when the nearer demotion target has sufficient free memory, correct?
> > > Could this also lead to increased promotion latency?
> >
> > Hi Donet,
> >
> > Thanks for your questions.
> >
> > Yes, the random selection could select a distant node and lead to
> > incresed promotion latency.
> >
> > I just realized that the the fallback allocation should not weighted
> > by a single metric, such as node distance, capacity, free space.
>
> Hello Bing, I hope you are doing well!
>
> Yes -- this is also what I believe, and I think this idea of "how should we
> select demotion / allocation targets" is something that is a difficult problem
> (and one that may not have a single solution that "just works").
>
> It's also a question that I have been thinking about, and what was discussed
> in part at LSFMMBPF last year. At the time, I made some auto-tuning weights [1]
> for weighted interleave based on bandwidth capacity, since the main benefit of
> weighted interleave is to distribute memory accesses across multiple nodes
> to maximize how much bandwidth the system can use at once. A follow-up was to
> think about how these weights could change over time, and what heuristics
> should be used to determine how the weights are selected.
>
> Ultimately, we agreed that the heuristics should probably be delegated to
> userspace, since there are just so many scenarios that could change what
> metrics should take priority. (Jonathan Corbet wrote a great summary of the
> discussion in an LWN article [2])
>
> Coming back to this patchset, I think that all of the ideas above apply
> nicely here as well. What nodes should be selected for demotion and how they
> should be weighted is a difficult question, and one that is probably best
> answered by userspace and what workload they expect to use on their specific
> system.
>
> What I do believe though, is that an unweighted random selection / round-robin
> approach to selecting demotion targets might lead to some unexpected
> performance implications.
>
> > We need a thoroughly study before changing alloc_demote_folio().
>
> So I think this is the way to go : -)
> Although, I'm not actively exploring this at the moment ;)
>
> Please let me know what you think, I hope you have a great day!
> Joshua
>
> [1] https://lore.kernel.org/all/20250109185048.28587-1-joshua.hahnjy@gmail.com/
> [2] https://lwn.net/Articles/1016842/

Hi Joshua, hope you had a great weekend!

I appreciate you sharing that information. I really enjoyed reading these
articles and discussions.

It makes sense to assume users understand their requirements, but I think
the kernel needs internal heuristics for weight adjustment. Because
users often lack the comprehensive and immediate information necessary
to update their configration in a timely manner, unless the system has
an omniscient administrator who can oversee and (pre)allocate resource
for all tasks running on that system. Therefore, I think it is still
necessary to have kernel on weight adjustment.

I will think more about this and explore it further from userspace,
kernel space, or using a hybrid approach.

Thank you again for the sharing!

Best,
Bing