[v1] mm: reliable huge page allocator

[PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Johannes Weiner 11 months ago

The page allocator groups requests by migratetype to stave off
fragmentation. However, in practice this is routinely defeated by the
fact that it gives up *before* invoking reclaim and compaction - which
may well produce suitable pages. As a result, fragmentation of
physical memory is a common ongoing process in many load scenarios.

Fragmentation deteriorates compaction's ability to produce huge
pages. Depending on the lifetime of the fragmenting allocations, those
effects can be long-lasting or even permanent, requiring drastic
measures like forcible idle states or even reboots as the only
reliable ways to recover the address space for THP production.

In a kernel build test with supplemental THP pressure, the THP
allocation rate steadily declines over 15 runs:

    thp_fault_alloc
    61988
    56474
    57258
    50187
    52388
    55409
    52925
    47648
    43669
    40621
    36077
    41721
    36685
    34641
    33215

This is a hurdle in adopting THP in any environment where hosts are
shared between multiple overlapping workloads (cloud environments),
and rarely experience true idle periods. To make THP a reliable and
predictable optimization, there needs to be a stronger guarantee to
avoid such fragmentation.

Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
is enforced on the allocator fastpath and the reclaiming slowpath.

For now, fallbacks are permitted to avert OOMs. There is a plan to add
defrag_mode=2 to prefer OOMs over fragmentation, but this requires
additional prep work in compaction and the reserve management to make
it ready for all possible allocation contexts.

The following test results are from a kernel build with periodic
bursts of THP allocations, over 15 runs:

                                        vanilla    defrag_mode=1
@claimer[unmovable]:                        189              103
@claimer[movable]:                           92              103
@claimer[reclaimable]:                      207               61
@pollute[unmovable from movable]:            25                0
@pollute[unmovable from reclaimable]:        28                0
@pollute[movable from unmovable]:         38835                0
@pollute[movable from reclaimable]:      147136                0
@pollute[reclaimable from unmovable]:       178                0
@pollute[reclaimable from movable]:          33                0
@steal[unmovable from movable]:              11                0
@steal[unmovable from reclaimable]:           5                0
@steal[reclaimable from unmovable]:         107                0
@steal[reclaimable from movable]:            90                0
@steal[movable from reclaimable]:           354                0
@steal[movable from unmovable]:             130                0

Both types of polluting fallbacks are eliminated in this workload.

Interestingly, whole block conversions are reduced as well. This is
because once a block is claimed for a type, its empty space remains
available for future allocations, instead of being padded with
fallbacks; this allows the native type to group up instead of
spreading out to new blocks. The assumption in the allocator has been
that pollution from movable allocations is less harmful than from
other types, since they can be reclaimed or migrated out should the
space be needed. However, since fallbacks occur *before*
reclaim/compaction is invoked, movable pollution will still cause
non-movable allocations to spread out and claim more blocks.

Without fragmentation, THP rates hold steady with defrag_mode=1:

    thp_fault_alloc
    32478
    20725
    45045
    32130
    14018
    21711
    40791
    29134
    34458
    45381
    28305
    17265
    22584
    28454
    30850

While the downward trend is eliminated, the keen reader will of course
notice that the baseline rate is much smaller than the vanilla
kernel's to begin with. This is due to deficiencies in how reclaim and
compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
to which smaller allocations are competing with THPs for pageblocks,
while making no effort themselves to reclaim or compact beyond their
own request size. This effect already exists with the current usage of
ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
block stealing much more strongly.

Subsequent patches will address defrag_mode reclaim strategy to raise
the THP success baseline above the vanilla kernel.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/sysctl/vm.rst |  9 +++++++++
 mm/page_alloc.c                         | 27 +++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index ec6343ee4248..e169dbf48180 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -29,6 +29,7 @@ files can be found in mm/swap.c.
 - compaction_proactiveness
 - compaction_proactiveness_leeway
 - compact_unevictable_allowed
+- defrag_mode
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -162,6 +163,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
 to compaction, which would block the task from becoming active until the fault
 is resolved.
 
+defrag_mode
+===========
+
+When set to 1, the page allocator tries harder to avoid fragmentation
+and maintain the ability to produce huge pages / higher-order pages.
+
+It is recommended to enable this right after boot, as fragmentation,
+once it occurred, can be long-lasting or even permanent.
 
 dirty_background_bytes
 ======================
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f0404941886..9a02772c2461 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,6 +273,7 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 static int watermark_boost_factor __read_mostly = 15000;
 static int watermark_scale_factor = 10;
+static int defrag_mode;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -3389,6 +3390,11 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 	 */
 	alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM);
 
+	if (defrag_mode) {
+		alloc_flags |= ALLOC_NOFRAGMENT;
+		return alloc_flags;
+	}
+
 #ifdef CONFIG_ZONE_DMA32
 	if (!zone)
 		return alloc_flags;
@@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				continue;
 		}
 
-		if (no_fallback && nr_online_nodes > 1 &&
+		if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
 		    zone != zonelist_zone(ac->preferred_zoneref)) {
 			int local_nid;
 
@@ -3591,7 +3597,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * It's possible on a UMA machine to get through all zones that are
 	 * fragmented. If avoiding fragmentation, reset and try again.
 	 */
-	if (no_fallback) {
+	if (no_fallback && !defrag_mode) {
 		alloc_flags &= ~ALLOC_NOFRAGMENT;
 		goto retry;
 	}
@@ -4128,6 +4134,9 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 
 	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
 
+	if (defrag_mode)
+		alloc_flags |= ALLOC_NOFRAGMENT;
+
 	return alloc_flags;
 }
 
@@ -4510,6 +4519,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
+	/* Reclaim/compaction failed to prevent the fallback */
+	if (defrag_mode) {
+		alloc_flags &= ALLOC_NOFRAGMENT;
+		goto retry;
+	}
 
 	/*
 	 * Deal with possible cpuset update races or zonelist updates to avoid
@@ -6286,6 +6300,15 @@ static const struct ctl_table page_alloc_sysctl_table[] = {
 		.extra1		= SYSCTL_ONE,
 		.extra2		= SYSCTL_THREE_THOUSAND,
 	},
+	{
+		.procname	= "defrag_mode",
+		.data		= &defrag_mode,
+		.maxlen		= sizeof(defrag_mode),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "percpu_pagelist_high_fraction",
 		.data		= &percpu_pagelist_high_fraction,
-- 
2.48.1

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Brendan Jackman 10 months, 3 weeks ago

On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> +	/* Reclaim/compaction failed to prevent the fallback */
> +	if (defrag_mode) {
> +		alloc_flags &= ALLOC_NOFRAGMENT;
> +		goto retry;
> +	}

I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
here (i.e. should this be ~ALLOC_NOFRAGMENT)?

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Johannes Weiner 10 months, 3 weeks ago

On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > +	/* Reclaim/compaction failed to prevent the fallback */
> > +	if (defrag_mode) {
> > +		alloc_flags &= ALLOC_NOFRAGMENT;
> > +		goto retry;
> > +	}
> 
> I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> here (i.e. should this be ~ALLOC_NOFRAGMENT)?

Yes, it should be. Thanks for catching that.

Note that this happens right before OOM, and __alloc_pages_may_oom()
does another allocation attempt without the flag set. In fact, I was
briefly debating whether I need the explicit retry here at all, but
then decided it's clearer and more future proof than quietly relying
on that OOM attempt, which is really only there to check for racing
frees. But this is most likely what hid this during testing.

What might be more of an issue is retrying without ALLOC_CPUSET and
then potentially violating cgroup placement rules too readily -
e.g. OOM only does that for __GFP_NOFAIL.

---

From e81c2086ee8e4b9f2750b821e104d3b5174b81f2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat, 22 Mar 2025 19:21:45 -0400
Subject: [PATCH] mm: page_alloc: fix defrag_mode's last allocation before OOM

Brendan points out that defrag_mode doesn't properly clear
ALLOC_NOFRAGMENT on its last-ditch attempt to allocate.

This is not necessarily a practical issue because it's followed by
__alloc_pages_may_oom(), which does its own attempt at the freelist
without ALLOC_NOFRAGMENT set. However, this is restricted to the high
watermark instead of the usual min mark (since it's merely to check
for racing frees). While this usually works - we just ran a full set
of reclaim/compaction, after all, and likely failed due to a lack of
pageblocks rather than watermarks - it's not as reliable as intended.

A more practical implication is retrying with the other flags cleared,
which means ALLOC_CPUSET is cleared, which can violate placement rules
defined by cgroup policy - OOM usually only does this for GFP_NOFAIL.

Reported-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c01998cb3a0..b9ee0c00eea5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4544,7 +4544,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

 	/* Reclaim/compaction failed to prevent the fallback */
 	if (defrag_mode) {
-		alloc_flags &= ALLOC_NOFRAGMENT;
+		alloc_flags &= ~ALLOC_NOFRAGMENT;
 		goto retry;
 	}

-- 
2.49.0

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Johannes Weiner 10 months, 3 weeks ago

On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > +	/* Reclaim/compaction failed to prevent the fallback */
> > > +	if (defrag_mode) {
> > > +		alloc_flags &= ALLOC_NOFRAGMENT;
> > > +		goto retry;
> > > +	}
> > 
> > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > here (i.e. should this be ~ALLOC_NOFRAGMENT)?

Please ignore my previous email, this is actually a much more severe
issue than I thought at first. The screwed up clearing is bad, but
this will also not check the flag before retrying, which means the
thread will retry reclaim/compaction and never reach OOM.

This code has weeks of load testing, with workloads fine-tuned to
*avoid* OOM. A blatant OOM test shows this problem immediately.

A simple fix, but I'll put it through the wringer before sending it.

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Johannes Weiner 10 months, 3 weeks ago

On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > +	/* Reclaim/compaction failed to prevent the fallback */
> > > > +	if (defrag_mode) {
> > > > +		alloc_flags &= ALLOC_NOFRAGMENT;
> > > > +		goto retry;
> > > > +	}
> > > 
> > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
> 
> Please ignore my previous email, this is actually a much more severe
> issue than I thought at first. The screwed up clearing is bad, but
> this will also not check the flag before retrying, which means the
> thread will retry reclaim/compaction and never reach OOM.
> 
> This code has weeks of load testing, with workloads fine-tuned to
> *avoid* OOM. A blatant OOM test shows this problem immediately.
> 
> A simple fix, but I'll put it through the wringer before sending it.

Ok, here is the patch. I verified this with intentional OOMing 100
times in a loop; this would previously lock up on first try in
defrag_mode, but kills and recovers reliably with this applied.

I also re-ran the full THP benchmarks, to verify that erroneous
looping here did not accidentally contribute to fragmentation
avoidance and thus THP success & latency rates. They were in fact not;
the improvements claimed for defrag_mode are unchanged with this fix:

                                                VANILLA    defrag_mode=1-OOMFIX
Hugealloc Time mean               52739.45 (    +0.00%)   27342.44 (   -48.15%)
Hugealloc Time stddev             56541.26 (    +0.00%)   33227.16 (   -41.23%)
Kbuild Real time                    197.47 (    +0.00%)     196.32 (    -0.58%)
Kbuild User time                   1240.49 (    +0.00%)    1231.89 (    -0.69%)
Kbuild System time                   70.08 (    +0.00%)      58.75 (   -15.95%)
THP fault alloc                   46727.07 (    +0.00%)   62669.93 (   +34.12%)
THP fault fallback                21910.60 (    +0.00%)    5966.40 (   -72.77%)
Direct compact fail                 195.80 (    +0.00%)      50.53 (   -73.81%)
Direct compact success                7.93 (    +0.00%)       4.07 (   -43.28%)
Compact daemon scanned migrate  3369601.27 (    +0.00%) 1588238.93 (   -52.87%)
Compact daemon scanned free     5075474.47 (    +0.00%) 1441944.27 (   -71.59%)
Compact direct scanned migrate   161787.27 (    +0.00%)   64838.53 (   -59.92%)
Compact direct scanned free      163467.53 (    +0.00%)   37243.00 (   -77.22%)
Compact total migrate scanned   3531388.53 (    +0.00%) 1653077.47 (   -53.19%)
Compact total free scanned      5238942.00 (    +0.00%) 1479187.27 (   -71.77%)
Alloc stall                        2371.07 (    +0.00%)     553.00 (   -76.64%)
Pages kswapd scanned            2160926.73 (    +0.00%) 4052539.93 (   +87.54%)
Pages kswapd reclaimed           533191.07 (    +0.00%)  765447.47 (   +43.56%)
Pages direct scanned             400450.33 (    +0.00%)  358933.93 (   -10.37%)
Pages direct reclaimed            94441.73 (    +0.00%)   26991.60 (   -71.42%)
Pages total scanned             2561377.07 (    +0.00%) 4411473.87 (   +72.23%)
Pages total reclaimed            627632.80 (    +0.00%)  792439.07 (   +26.26%)
Swap out                          47959.53 (    +0.00%)  128511.80 (  +167.96%)
Swap in                            7276.00 (    +0.00%)   27736.20 (  +281.16%)
File refaults                    138043.00 (    +0.00%)  206198.40 (   +49.37%)

Many thanks for your careful review, Brendan.

---

From c84651a46910448c6cfaf44885644fdb215f7f6a Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat, 22 Mar 2025 19:21:45 -0400
Subject: [PATCH] mm: page_alloc: fix defrag_mode's retry & OOM path

Brendan points out that defrag_mode doesn't properly clear
ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking
closer, the problem is actually more severe: it doesn't actually
*check* whether it's already retried, and keeps looping. This means
the OOM path is never taken, and the thread can loop indefinitely.

This is verified with an intentional OOM test on defrag_mode=1, which
results in the machine hanging. After this patch, it triggers the OOM
kill reliably and recovers.

Clear ALLOC_NOFRAGMENT properly, and only retry once.

Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
Reported-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c01998cb3a0..582364d42906 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4543,8 +4543,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 
 	/* Reclaim/compaction failed to prevent the fallback */
-	if (defrag_mode) {
-		alloc_flags &= ALLOC_NOFRAGMENT;
+	if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT)) {
+		alloc_flags &= ~ALLOC_NOFRAGMENT;
 		goto retry;
 	}
 
-- 
2.49.0

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Brendan Jackman 10 months, 3 weeks ago

On Sun Mar 23, 2025 at 4:46 AM CET, Johannes Weiner wrote:
> On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > > +	/* Reclaim/compaction failed to prevent the fallback */
> > > > > +	if (defrag_mode) {
> > > > > +		alloc_flags &= ALLOC_NOFRAGMENT;
> > > > > +		goto retry;
> > > > > +	}
> > > > 
> > > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
> > 
> > Please ignore my previous email, this is actually a much more severe
> > issue than I thought at first. The screwed up clearing is bad, but
> > this will also not check the flag before retrying, which means the
> > thread will retry reclaim/compaction and never reach OOM.
> > 
> > This code has weeks of load testing, with workloads fine-tuned to
> > *avoid* OOM. A blatant OOM test shows this problem immediately.
> > 
> > A simple fix, but I'll put it through the wringer before sending it.
>
> Ok, here is the patch. I verified this with intentional OOMing 100
> times in a loop; this would previously lock up on first try in
> defrag_mode, but kills and recovers reliably with this applied.
>
> I also re-ran the full THP benchmarks, to verify that erroneous
> looping here did not accidentally contribute to fragmentation
> avoidance and thus THP success & latency rates. They were in fact not;
> the improvements claimed for defrag_mode are unchanged with this fix:

Sounds good :)

Off topic, but could you share some details about the
tests/benchmarks you're running here? Do you have any links e.g. to
the scripts you're using to run them?

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Johannes Weiner 10 months, 2 weeks ago

Hi Brendan,

On Sun, Mar 23, 2025 at 07:04:29PM +0100, Brendan Jackman wrote:
> On Sun Mar 23, 2025 at 4:46 AM CET, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> > > On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > > > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > > > +	/* Reclaim/compaction failed to prevent the fallback */
> > > > > > +	if (defrag_mode) {
> > > > > > +		alloc_flags &= ALLOC_NOFRAGMENT;
> > > > > > +		goto retry;
> > > > > > +	}
> > > > > 
> > > > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
> > > 
> > > Please ignore my previous email, this is actually a much more severe
> > > issue than I thought at first. The screwed up clearing is bad, but
> > > this will also not check the flag before retrying, which means the
> > > thread will retry reclaim/compaction and never reach OOM.
> > > 
> > > This code has weeks of load testing, with workloads fine-tuned to
> > > *avoid* OOM. A blatant OOM test shows this problem immediately.
> > > 
> > > A simple fix, but I'll put it through the wringer before sending it.
> >
> > Ok, here is the patch. I verified this with intentional OOMing 100
> > times in a loop; this would previously lock up on first try in
> > defrag_mode, but kills and recovers reliably with this applied.
> >
> > I also re-ran the full THP benchmarks, to verify that erroneous
> > looping here did not accidentally contribute to fragmentation
> > avoidance and thus THP success & latency rates. They were in fact not;
> > the improvements claimed for defrag_mode are unchanged with this fix:
> 
> Sounds good :)
> 
> Off topic, but could you share some details about the
> tests/benchmarks you're running here? Do you have any links e.g. to
> the scripts you're using to run them?

Sure! The numbers I quoted here are from a dual workload of kernel
build and THP allocation bursts. The kernel build is an x86_64
defconfig, -j16 on 8 cores (no ht). I boot this machine with mem=1800M
to make sure there is some memory pressure, but not hopeless
thrashing. Filesystem and conventional swap on an older SATA SSD.

While the kernel builds, every 20s another worker mmaps 80M, madvises
for THP, measures the time to memset-fault the range in, and unmaps.

THP policy is upstream defaults: enabled=always, defrag=madvise. So
the kernel build itself will also optimistically consume THPs, but
only the burst allocations will direct reclaim/compact for them.

Aside from that - and this is a lot less scientific - I just run the
patches on the machines I use every day, looking for interactivity
problems, kswapd or kcompactd going crazy, and generally paying
attention to how well they cope under pressure compared to upstream.
My desktop is an 8G ARM machine (with zswap), so it's almost always
under some form of memory pressure. It's also using 16k pages and
order-11 pageblocks (32M THPs), which adds extra spice.

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Zi Yan 11 months ago

On 13 Mar 2025, at 17:05, Johannes Weiner wrote:

> The page allocator groups requests by migratetype to stave off
> fragmentation. However, in practice this is routinely defeated by the
> fact that it gives up *before* invoking reclaim and compaction - which
> may well produce suitable pages. As a result, fragmentation of
> physical memory is a common ongoing process in many load scenarios.
>
> Fragmentation deteriorates compaction's ability to produce huge
> pages. Depending on the lifetime of the fragmenting allocations, those
> effects can be long-lasting or even permanent, requiring drastic
> measures like forcible idle states or even reboots as the only
> reliable ways to recover the address space for THP production.
>
> In a kernel build test with supplemental THP pressure, the THP
> allocation rate steadily declines over 15 runs:
>
>     thp_fault_alloc
>     61988
>     56474
>     57258
>     50187
>     52388
>     55409
>     52925
>     47648
>     43669
>     40621
>     36077
>     41721
>     36685
>     34641
>     33215
>
> This is a hurdle in adopting THP in any environment where hosts are
> shared between multiple overlapping workloads (cloud environments),
> and rarely experience true idle periods. To make THP a reliable and
> predictable optimization, there needs to be a stronger guarantee to
> avoid such fragmentation.
>
> Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
> its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
> is enforced on the allocator fastpath and the reclaiming slowpath.
>
> For now, fallbacks are permitted to avert OOMs. There is a plan to add
> defrag_mode=2 to prefer OOMs over fragmentation, but this requires
> additional prep work in compaction and the reserve management to make
> it ready for all possible allocation contexts.
>
> The following test results are from a kernel build with periodic
> bursts of THP allocations, over 15 runs:
>
>                                         vanilla    defrag_mode=1
> @claimer[unmovable]:                        189              103
> @claimer[movable]:                           92              103
> @claimer[reclaimable]:                      207               61
> @pollute[unmovable from movable]:            25                0
> @pollute[unmovable from reclaimable]:        28                0
> @pollute[movable from unmovable]:         38835                0
> @pollute[movable from reclaimable]:      147136                0
> @pollute[reclaimable from unmovable]:       178                0
> @pollute[reclaimable from movable]:          33                0
> @steal[unmovable from movable]:              11                0
> @steal[unmovable from reclaimable]:           5                0
> @steal[reclaimable from unmovable]:         107                0
> @steal[reclaimable from movable]:            90                0
> @steal[movable from reclaimable]:           354                0
> @steal[movable from unmovable]:             130                0
>
> Both types of polluting fallbacks are eliminated in this workload.
>
> Interestingly, whole block conversions are reduced as well. This is
> because once a block is claimed for a type, its empty space remains
> available for future allocations, instead of being padded with
> fallbacks; this allows the native type to group up instead of
> spreading out to new blocks. The assumption in the allocator has been
> that pollution from movable allocations is less harmful than from
> other types, since they can be reclaimed or migrated out should the
> space be needed. However, since fallbacks occur *before*
> reclaim/compaction is invoked, movable pollution will still cause
> non-movable allocations to spread out and claim more blocks.
>
> Without fragmentation, THP rates hold steady with defrag_mode=1:
>
>     thp_fault_alloc
>     32478
>     20725
>     45045
>     32130
>     14018
>     21711
>     40791
>     29134
>     34458
>     45381
>     28305
>     17265
>     22584
>     28454
>     30850
>
> While the downward trend is eliminated, the keen reader will of course
> notice that the baseline rate is much smaller than the vanilla
> kernel's to begin with. This is due to deficiencies in how reclaim and
> compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
> to which smaller allocations are competing with THPs for pageblocks,
> while making no effort themselves to reclaim or compact beyond their
> own request size. This effect already exists with the current usage of
> ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
> block stealing much more strongly.
>
> Subsequent patches will address defrag_mode reclaim strategy to raise
> the THP success baseline above the vanilla kernel.

All makes sense to me. But is there a better name than defrag_mode?
It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag.
Or it actually means the THP defrag mode?

>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/admin-guide/sysctl/vm.rst |  9 +++++++++
>  mm/page_alloc.c                         | 27 +++++++++++++++++++++++--
>  2 files changed, 34 insertions(+), 2 deletions(-)
>

When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(),
ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder
if this could reduce the anti-fragmentation effort for NUMA systems. Basically,
falling back to a remote node for allocation would fragment the remote node,
even the remote node is trying hard to not fragment itself. Have you tested
on a NUMA system?

Thanks.

Best Regards,
Yan, Zi

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Johannes Weiner 11 months ago

On Fri, Mar 14, 2025 at 02:54:03PM -0400, Zi Yan wrote:
> On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
> 
> > The page allocator groups requests by migratetype to stave off
> > fragmentation. However, in practice this is routinely defeated by the
> > fact that it gives up *before* invoking reclaim and compaction - which
> > may well produce suitable pages. As a result, fragmentation of
> > physical memory is a common ongoing process in many load scenarios.
> >
> > Fragmentation deteriorates compaction's ability to produce huge
> > pages. Depending on the lifetime of the fragmenting allocations, those
> > effects can be long-lasting or even permanent, requiring drastic
> > measures like forcible idle states or even reboots as the only
> > reliable ways to recover the address space for THP production.
> >
> > In a kernel build test with supplemental THP pressure, the THP
> > allocation rate steadily declines over 15 runs:
> >
> >     thp_fault_alloc
> >     61988
> >     56474
> >     57258
> >     50187
> >     52388
> >     55409
> >     52925
> >     47648
> >     43669
> >     40621
> >     36077
> >     41721
> >     36685
> >     34641
> >     33215
> >
> > This is a hurdle in adopting THP in any environment where hosts are
> > shared between multiple overlapping workloads (cloud environments),
> > and rarely experience true idle periods. To make THP a reliable and
> > predictable optimization, there needs to be a stronger guarantee to
> > avoid such fragmentation.
> >
> > Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
> > its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
> > is enforced on the allocator fastpath and the reclaiming slowpath.
> >
> > For now, fallbacks are permitted to avert OOMs. There is a plan to add
> > defrag_mode=2 to prefer OOMs over fragmentation, but this requires
> > additional prep work in compaction and the reserve management to make
> > it ready for all possible allocation contexts.
> >
> > The following test results are from a kernel build with periodic
> > bursts of THP allocations, over 15 runs:
> >
> >                                         vanilla    defrag_mode=1
> > @claimer[unmovable]:                        189              103
> > @claimer[movable]:                           92              103
> > @claimer[reclaimable]:                      207               61
> > @pollute[unmovable from movable]:            25                0
> > @pollute[unmovable from reclaimable]:        28                0
> > @pollute[movable from unmovable]:         38835                0
> > @pollute[movable from reclaimable]:      147136                0
> > @pollute[reclaimable from unmovable]:       178                0
> > @pollute[reclaimable from movable]:          33                0
> > @steal[unmovable from movable]:              11                0
> > @steal[unmovable from reclaimable]:           5                0
> > @steal[reclaimable from unmovable]:         107                0
> > @steal[reclaimable from movable]:            90                0
> > @steal[movable from reclaimable]:           354                0
> > @steal[movable from unmovable]:             130                0
> >
> > Both types of polluting fallbacks are eliminated in this workload.
> >
> > Interestingly, whole block conversions are reduced as well. This is
> > because once a block is claimed for a type, its empty space remains
> > available for future allocations, instead of being padded with
> > fallbacks; this allows the native type to group up instead of
> > spreading out to new blocks. The assumption in the allocator has been
> > that pollution from movable allocations is less harmful than from
> > other types, since they can be reclaimed or migrated out should the
> > space be needed. However, since fallbacks occur *before*
> > reclaim/compaction is invoked, movable pollution will still cause
> > non-movable allocations to spread out and claim more blocks.
> >
> > Without fragmentation, THP rates hold steady with defrag_mode=1:
> >
> >     thp_fault_alloc
> >     32478
> >     20725
> >     45045
> >     32130
> >     14018
> >     21711
> >     40791
> >     29134
> >     34458
> >     45381
> >     28305
> >     17265
> >     22584
> >     28454
> >     30850
> >
> > While the downward trend is eliminated, the keen reader will of course
> > notice that the baseline rate is much smaller than the vanilla
> > kernel's to begin with. This is due to deficiencies in how reclaim and
> > compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
> > to which smaller allocations are competing with THPs for pageblocks,
> > while making no effort themselves to reclaim or compact beyond their
> > own request size. This effect already exists with the current usage of
> > ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
> > block stealing much more strongly.
> >
> > Subsequent patches will address defrag_mode reclaim strategy to raise
> > the THP success baseline above the vanilla kernel.
> 
> All makes sense to me. But is there a better name than defrag_mode?
> It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag.
> Or it actually means the THP defrag mode?

Thanks for taking a look!

I'm not set on defrag_mode, but I also couldn't think of anything
better.

The proximity to the THP flag name strikes me as beneficial, since
it's an established term for "try harder to make huge pages".

Suggestions welcome :)

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst |  9 +++++++++
> >  mm/page_alloc.c                         | 27 +++++++++++++++++++++++--
> >  2 files changed, 34 insertions(+), 2 deletions(-)
> >
> 
> When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(),
> ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder
> if this could reduce the anti-fragmentation effort for NUMA systems. Basically,
> falling back to a remote node for allocation would fragment the remote node,
> even the remote node is trying hard to not fragment itself. Have you tested
> on a NUMA system?

There is this hunk in the patch:

@@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				continue;
 		}
 
-		if (no_fallback && nr_online_nodes > 1 &&
+		if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
 		    zone != zonelist_zone(ac->preferred_zoneref)) {
 			int local_nid;
 
So it shouldn't clear the flag when spilling into the next node.

Am I missing something?

Re: [PATCH 3/5] mm: page_alloc: defrag_mode

Posted by Zi Yan 11 months ago

On 14 Mar 2025, at 16:50, Johannes Weiner wrote:

> On Fri, Mar 14, 2025 at 02:54:03PM -0400, Zi Yan wrote:
>> On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
>>
>>> The page allocator groups requests by migratetype to stave off
>>> fragmentation. However, in practice this is routinely defeated by the
>>> fact that it gives up *before* invoking reclaim and compaction - which
>>> may well produce suitable pages. As a result, fragmentation of
>>> physical memory is a common ongoing process in many load scenarios.
>>>
>>> Fragmentation deteriorates compaction's ability to produce huge
>>> pages. Depending on the lifetime of the fragmenting allocations, those
>>> effects can be long-lasting or even permanent, requiring drastic
>>> measures like forcible idle states or even reboots as the only
>>> reliable ways to recover the address space for THP production.
>>>
>>> In a kernel build test with supplemental THP pressure, the THP
>>> allocation rate steadily declines over 15 runs:
>>>
>>>     thp_fault_alloc
>>>     61988
>>>     56474
>>>     57258
>>>     50187
>>>     52388
>>>     55409
>>>     52925
>>>     47648
>>>     43669
>>>     40621
>>>     36077
>>>     41721
>>>     36685
>>>     34641
>>>     33215
>>>
>>> This is a hurdle in adopting THP in any environment where hosts are
>>> shared between multiple overlapping workloads (cloud environments),
>>> and rarely experience true idle periods. To make THP a reliable and
>>> predictable optimization, there needs to be a stronger guarantee to
>>> avoid such fragmentation.
>>>
>>> Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
>>> its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
>>> is enforced on the allocator fastpath and the reclaiming slowpath.
>>>
>>> For now, fallbacks are permitted to avert OOMs. There is a plan to add
>>> defrag_mode=2 to prefer OOMs over fragmentation, but this requires
>>> additional prep work in compaction and the reserve management to make
>>> it ready for all possible allocation contexts.
>>>
>>> The following test results are from a kernel build with periodic
>>> bursts of THP allocations, over 15 runs:
>>>
>>>                                         vanilla    defrag_mode=1
>>> @claimer[unmovable]:                        189              103
>>> @claimer[movable]:                           92              103
>>> @claimer[reclaimable]:                      207               61
>>> @pollute[unmovable from movable]:            25                0
>>> @pollute[unmovable from reclaimable]:        28                0
>>> @pollute[movable from unmovable]:         38835                0
>>> @pollute[movable from reclaimable]:      147136                0
>>> @pollute[reclaimable from unmovable]:       178                0
>>> @pollute[reclaimable from movable]:          33                0
>>> @steal[unmovable from movable]:              11                0
>>> @steal[unmovable from reclaimable]:           5                0
>>> @steal[reclaimable from unmovable]:         107                0
>>> @steal[reclaimable from movable]:            90                0
>>> @steal[movable from reclaimable]:           354                0
>>> @steal[movable from unmovable]:             130                0
>>>
>>> Both types of polluting fallbacks are eliminated in this workload.
>>>
>>> Interestingly, whole block conversions are reduced as well. This is
>>> because once a block is claimed for a type, its empty space remains
>>> available for future allocations, instead of being padded with
>>> fallbacks; this allows the native type to group up instead of
>>> spreading out to new blocks. The assumption in the allocator has been
>>> that pollution from movable allocations is less harmful than from
>>> other types, since they can be reclaimed or migrated out should the
>>> space be needed. However, since fallbacks occur *before*
>>> reclaim/compaction is invoked, movable pollution will still cause
>>> non-movable allocations to spread out and claim more blocks.
>>>
>>> Without fragmentation, THP rates hold steady with defrag_mode=1:
>>>
>>>     thp_fault_alloc
>>>     32478
>>>     20725
>>>     45045
>>>     32130
>>>     14018
>>>     21711
>>>     40791
>>>     29134
>>>     34458
>>>     45381
>>>     28305
>>>     17265
>>>     22584
>>>     28454
>>>     30850
>>>
>>> While the downward trend is eliminated, the keen reader will of course
>>> notice that the baseline rate is much smaller than the vanilla
>>> kernel's to begin with. This is due to deficiencies in how reclaim and
>>> compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
>>> to which smaller allocations are competing with THPs for pageblocks,
>>> while making no effort themselves to reclaim or compact beyond their
>>> own request size. This effect already exists with the current usage of
>>> ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
>>> block stealing much more strongly.
>>>
>>> Subsequent patches will address defrag_mode reclaim strategy to raise
>>> the THP success baseline above the vanilla kernel.
>>
>> All makes sense to me. But is there a better name than defrag_mode?
>> It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag.
>> Or it actually means the THP defrag mode?
>
> Thanks for taking a look!
>
> I'm not set on defrag_mode, but I also couldn't think of anything
> better.
>
> The proximity to the THP flag name strikes me as beneficial, since
> it's an established term for "try harder to make huge pages".
>
> Suggestions welcome :)
>
>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>> ---
>>>  Documentation/admin-guide/sysctl/vm.rst |  9 +++++++++
>>>  mm/page_alloc.c                         | 27 +++++++++++++++++++++++--
>>>  2 files changed, 34 insertions(+), 2 deletions(-)
>>>
>>
>> When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(),
>> ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder
>> if this could reduce the anti-fragmentation effort for NUMA systems. Basically,
>> falling back to a remote node for allocation would fragment the remote node,
>> even the remote node is trying hard to not fragment itself. Have you tested
>> on a NUMA system?
>
> There is this hunk in the patch:
>
> @@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  				continue;
>  		}
>
> -		if (no_fallback && nr_online_nodes > 1 &&
> +		if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
>  		    zone != zonelist_zone(ac->preferred_zoneref)) {
>  			int local_nid;
>
> So it shouldn't clear the flag when spilling into the next node.
>
> Am I missing something?

Oh, I missed that part. Thank you for pointing it out.

Best Regards,
Yan, Zi