[v1] mm: reliable huge page allocator

[PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Johannes Weiner 11 months ago

The previous patch added pageblock_order reclaim to kswapd/kcompactd,
which helps, but produces only one block at a time. Allocation stalls
and THP failure rates are still higher than they could be.

To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
the watermarking for kswapd & kcompactd: instead of targeting the high
watermark in order-0 pages and checking for one suitable block, simply
require that the high watermark is entirely met in pageblocks.

To this end, track the number of free pages within contiguous
pageblocks, then change pgdat_balanced() and compact_finished() to
check watermarks against this new value.

This further reduces THP latencies and allocation stalls, and improves
THP success rates against the previous patch:

                                       DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean               34300.36 (    +0.00%)   28904.00 (   -15.73%)
Hugealloc Time stddev             36390.42 (    +0.00%)   33464.37 (    -8.04%)
Kbuild Real time                    196.13 (    +0.00%)     196.59 (    +0.23%)
Kbuild User time                   1234.74 (    +0.00%)    1231.67 (    -0.25%)
Kbuild System time                   62.62 (    +0.00%)      59.10 (    -5.54%)
THP fault alloc                   57054.53 (    +0.00%)   63223.67 (   +10.81%)
THP fault fallback                11581.40 (    +0.00%)    5412.47 (   -53.26%)
Direct compact fail                 107.80 (    +0.00%)      59.07 (   -44.79%)
Direct compact success                4.53 (    +0.00%)       2.80 (   -31.33%)
Direct compact success rate %         3.20 (    +0.00%)       3.99 (   +18.66%)
Compact daemon scanned migrate  5461033.93 (    +0.00%) 2267500.33 (   -58.48%)
Compact daemon scanned free     5824897.93 (    +0.00%) 2339773.00 (   -59.83%)
Compact direct scanned migrate    58336.93 (    +0.00%)   47659.93 (   -18.30%)
Compact direct scanned free       32791.87 (    +0.00%)   40729.67 (   +24.21%)
Compact total migrate scanned   5519370.87 (    +0.00%) 2315160.27 (   -58.05%)
Compact total free scanned      5857689.80 (    +0.00%) 2380502.67 (   -59.36%)
Alloc stall                        2424.60 (    +0.00%)     638.87 (   -73.62%)
Pages kswapd scanned            2657018.33 (    +0.00%) 4002186.33 (   +50.63%)
Pages kswapd reclaimed           559583.07 (    +0.00%)  718577.80 (   +28.41%)
Pages direct scanned             722094.07 (    +0.00%)  355172.73 (   -50.81%)
Pages direct reclaimed           107257.80 (    +0.00%)   31162.80 (   -70.95%)
Pages total scanned             3379112.40 (    +0.00%) 4357359.07 (   +28.95%)
Pages total reclaimed            666840.87 (    +0.00%)  749740.60 (   +12.43%)
Swap out                          77238.20 (    +0.00%)  110084.33 (   +42.53%)
Swap in                           11712.80 (    +0.00%)   24457.00 (  +108.80%)
File refaults                    143438.80 (    +0.00%)  188226.93 (   +31.22%)

Also of note is that compaction work overall is reduced. The reason
for this is that when free pageblocks are more readily available,
allocations are also much more likely to get physically placed in LRU
order, instead of being forced to scavenge free space here and there.
This means that reclaim by itself has better chances of freeing up
whole blocks, and the system relies less on compaction.

Comparing all changes to the vanilla kernel:

                                                VANILLA DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean               52739.45 (    +0.00%)   28904.00 (   -45.19%)
Hugealloc Time stddev             56541.26 (    +0.00%)   33464.37 (   -40.81%)
Kbuild Real time                    197.47 (    +0.00%)     196.59 (    -0.44%)
Kbuild User time                   1240.49 (    +0.00%)    1231.67 (    -0.71%)
Kbuild System time                   70.08 (    +0.00%)      59.10 (   -15.45%)
THP fault alloc                   46727.07 (    +0.00%)   63223.67 (   +35.30%)
THP fault fallback                21910.60 (    +0.00%)    5412.47 (   -75.29%)
Direct compact fail                 195.80 (    +0.00%)      59.07 (   -69.48%)
Direct compact success                7.93 (    +0.00%)       2.80 (   -57.46%)
Direct compact success rate %         3.51 (    +0.00%)       3.99 (   +10.49%)
Compact daemon scanned migrate  3369601.27 (    +0.00%) 2267500.33 (   -32.71%)
Compact daemon scanned free     5075474.47 (    +0.00%) 2339773.00 (   -53.90%)
Compact direct scanned migrate   161787.27 (    +0.00%)   47659.93 (   -70.54%)
Compact direct scanned free      163467.53 (    +0.00%)   40729.67 (   -75.08%)
Compact total migrate scanned   3531388.53 (    +0.00%) 2315160.27 (   -34.44%)
Compact total free scanned      5238942.00 (    +0.00%) 2380502.67 (   -54.56%)
Alloc stall                        2371.07 (    +0.00%)     638.87 (   -73.02%)
Pages kswapd scanned            2160926.73 (    +0.00%) 4002186.33 (   +85.21%)
Pages kswapd reclaimed           533191.07 (    +0.00%)  718577.80 (   +34.77%)
Pages direct scanned             400450.33 (    +0.00%)  355172.73 (   -11.31%)
Pages direct reclaimed            94441.73 (    +0.00%)   31162.80 (   -67.00%)
Pages total scanned             2561377.07 (    +0.00%) 4357359.07 (   +70.12%)
Pages total reclaimed            627632.80 (    +0.00%)  749740.60 (   +19.46%)
Swap out                          47959.53 (    +0.00%)  110084.33 (  +129.53%)
Swap in                            7276.00 (    +0.00%)   24457.00 (  +236.10%)
File refaults                    138043.00 (    +0.00%)  188226.93 (   +36.35%)

THP allocation latencies and %sys time are down dramatically.

THP allocation failures are down from nearly 50% to 8.5%. And to
recall previous data points, the success rates are steady and reliable
without the cumulative deterioration of fragmentation events.

Compaction work is down overall. Direct compaction work especially is
drastically reduced. As an aside, its success rate of 4% indicates
there is room for improvement. For now it's good to rely on it less.

Reclaim work is up overall, however direct reclaim work is down. Part
of the increase can be attributed to a higher use of THPs, which due
to internal fragmentation increase the memory footprint. This is not
necessarily an unexpected side-effect for users of THP.

However, taken both points together, there may well be some
opportunities for fine tuning in the reclaim/compaction coordination.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  1 +
 mm/compaction.c        | 37 ++++++++++++++++++++++++++++++-------
 mm/internal.h          |  1 +
 mm/page_alloc.c        | 29 +++++++++++++++++++++++------
 mm/vmscan.c            | 15 ++++++++++++++-
 mm/vmstat.c            |  1 +
 6 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dbb0ad69e17f..37c29f3fbca8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -138,6 +138,7 @@ enum numa_stat_item {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
+	NR_FREE_PAGES_BLOCKS,
 	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
 	NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_ACTIVE_ANON,
diff --git a/mm/compaction.c b/mm/compaction.c
index 036353ef1878..4a2ccb82d0b2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 	if (!pageblock_aligned(cc->migrate_pfn))
 		return COMPACT_CONTINUE;
 
+	/*
+	 * When defrag_mode is enabled, make kcompactd target
+	 * watermarks in whole pageblocks. Because they can be stolen
+	 * without polluting, no further fallback checks are needed.
+	 */
+	if (defrag_mode && !cc->direct_compaction) {
+		if (__zone_watermark_ok(cc->zone, cc->order,
+					high_wmark_pages(cc->zone),
+					cc->highest_zoneidx, cc->alloc_flags,
+					zone_page_state(cc->zone,
+							NR_FREE_PAGES_BLOCKS)))
+			return COMPACT_SUCCESS;
+
+		return COMPACT_CONTINUE;
+	}
+
 	/* Direct compactor: Is a suitable page free? */
 	ret = COMPACT_NO_SUITABLE_PAGE;
 	for (order = cc->order; order < NR_PAGE_ORDERS; order++) {
@@ -2496,13 +2512,19 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 static enum compact_result
 compaction_suit_allocation_order(struct zone *zone, unsigned int order,
 				 int highest_zoneidx, unsigned int alloc_flags,
-				 bool async)
+				 bool async, bool kcompactd)
 {
+	unsigned long free_pages;
 	unsigned long watermark;
 
+	if (kcompactd && defrag_mode)
+		free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+	else
+		free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
 	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
-	if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
-			      alloc_flags))
+	if (__zone_watermark_ok(zone, order, watermark, highest_zoneidx,
+				alloc_flags, free_pages))
 		return COMPACT_SUCCESS;
 
 	/*
@@ -2558,7 +2580,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 		ret = compaction_suit_allocation_order(cc->zone, cc->order,
 						       cc->highest_zoneidx,
 						       cc->alloc_flags,
-						       cc->mode == MIGRATE_ASYNC);
+						       cc->mode == MIGRATE_ASYNC,
+						       !cc->direct_compaction);
 		if (ret != COMPACT_CONTINUE)
 			return ret;
 	}
@@ -3062,7 +3085,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 		ret = compaction_suit_allocation_order(zone,
 				pgdat->kcompactd_max_order,
 				highest_zoneidx, ALLOC_WMARK_MIN,
-				false);
+				false, true);
 		if (ret == COMPACT_CONTINUE)
 			return true;
 	}
@@ -3085,7 +3108,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		.mode = MIGRATE_SYNC_LIGHT,
 		.ignore_skip_hint = false,
 		.gfp_mask = GFP_KERNEL,
-		.alloc_flags = ALLOC_WMARK_MIN,
+		.alloc_flags = ALLOC_WMARK_HIGH,
 	};
 	enum compact_result ret;
 
@@ -3105,7 +3128,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 
 		ret = compaction_suit_allocation_order(zone,
 				cc.order, zoneid, cc.alloc_flags,
-				false);
+				false, true);
 		if (ret != COMPACT_CONTINUE)
 			continue;
 
diff --git a/mm/internal.h b/mm/internal.h
index 2f52a65272c1..286520a424fe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -536,6 +536,7 @@ extern char * const zone_names[MAX_NR_ZONES];
 DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled);
 
 extern int min_free_kbytes;
+extern int defrag_mode;
 
 void setup_per_zone_wmarks(void);
 void calculate_min_free_kbytes(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a0d8f871e56..c33c08e278f9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,7 +273,7 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 static int watermark_boost_factor __read_mostly = 15000;
 static int watermark_scale_factor = 10;
-static int defrag_mode;
+int defrag_mode;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -660,16 +660,20 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      bool tail)
 {
 	struct free_area *area = &zone->free_area[order];
+	int nr_pages = 1 << order;
 
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
-		     get_pageblock_migratetype(page), migratetype, 1 << order);
+		     get_pageblock_migratetype(page), migratetype, nr_pages);
 
 	if (tail)
 		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
 	else
 		list_add(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
+
+	if (order >= pageblock_order && !is_migrate_isolate(migratetype))
+		__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
 }
 
 /*
@@ -681,24 +685,34 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 				     unsigned int order, int old_mt, int new_mt)
 {
 	struct free_area *area = &zone->free_area[order];
+	int nr_pages = 1 << order;
 
 	/* Free page moving can fail, so it happens before the type update */
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
-		     get_pageblock_migratetype(page), old_mt, 1 << order);
+		     get_pageblock_migratetype(page), old_mt, nr_pages);
 
 	list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
 
-	account_freepages(zone, -(1 << order), old_mt);
-	account_freepages(zone, 1 << order, new_mt);
+	account_freepages(zone, -nr_pages, old_mt);
+	account_freepages(zone, nr_pages, new_mt);
+
+	if (order >= pageblock_order &&
+	    is_migrate_isolate(old_mt) != is_migrate_isolate(new_mt)) {
+		if (!is_migrate_isolate(old_mt))
+			nr_pages = -nr_pages;
+		__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
+	}
 }
 
 static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
 					     unsigned int order, int migratetype)
 {
+	int nr_pages = 1 << order;
+
         VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
-		     get_pageblock_migratetype(page), migratetype, 1 << order);
+		     get_pageblock_migratetype(page), migratetype, nr_pages);
 
 	/* clear reported state and update reported page count */
 	if (page_reported(page))
@@ -708,6 +722,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
+
+	if (order >= pageblock_order && !is_migrate_isolate(migratetype))
+		__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3370bdca6868..b5c7dfc2b189 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 	 * meet watermarks.
 	 */
 	for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
+		unsigned long free_pages;
+
 		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
 			mark = promo_wmark_pages(zone);
 		else
 			mark = high_wmark_pages(zone);
-		if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
+
+		/*
+		 * In defrag_mode, watermarks must be met in whole
+		 * blocks to avoid polluting allocator fallbacks.
+		 */
+		if (defrag_mode)
+			free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+		else
+			free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+		if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
+					0, free_pages))
 			return true;
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..ed49a86348f7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1190,6 +1190,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item counters */
 	"nr_free_pages",
+	"nr_free_pages_blocks",
 	"nr_zone_inactive_anon",
 	"nr_zone_active_anon",
 	"nr_zone_inactive_file",
-- 
2.48.1

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Vlastimil Babka 10 months ago

On 3/13/25 22:05, Johannes Weiner wrote:
> The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> which helps, but produces only one block at a time. Allocation stalls
> and THP failure rates are still higher than they could be.
> 
> To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> the watermarking for kswapd & kcompactd: instead of targeting the high
> watermark in order-0 pages and checking for one suitable block, simply
> require that the high watermark is entirely met in pageblocks.

Hrm.

> ---
>  include/linux/mmzone.h |  1 +
>  mm/compaction.c        | 37 ++++++++++++++++++++++++++++++-------
>  mm/internal.h          |  1 +
>  mm/page_alloc.c        | 29 +++++++++++++++++++++++------
>  mm/vmscan.c            | 15 ++++++++++++++-
>  mm/vmstat.c            |  1 +
>  6 files changed, 70 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index dbb0ad69e17f..37c29f3fbca8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -138,6 +138,7 @@ enum numa_stat_item {
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> +	NR_FREE_PAGES_BLOCKS,
>  	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
>  	NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
>  	NR_ZONE_ACTIVE_ANON,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 036353ef1878..4a2ccb82d0b2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
>  	if (!pageblock_aligned(cc->migrate_pfn))
>  		return COMPACT_CONTINUE;
>  
> +	/*
> +	 * When defrag_mode is enabled, make kcompactd target
> +	 * watermarks in whole pageblocks. Because they can be stolen
> +	 * without polluting, no further fallback checks are needed.
> +	 */
> +	if (defrag_mode && !cc->direct_compaction) {
> +		if (__zone_watermark_ok(cc->zone, cc->order,
> +					high_wmark_pages(cc->zone),
> +					cc->highest_zoneidx, cc->alloc_flags,
> +					zone_page_state(cc->zone,
> +							NR_FREE_PAGES_BLOCKS)))
> +			return COMPACT_SUCCESS;
> +
> +		return COMPACT_CONTINUE;
> +	}

Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
to pageblock size? If not, and it's X pageblocks and a half, we will rarely
have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
put us below high wmark quickly and then we never satisfy this?

Doesn't then happen that with defrag_mode, in practice kcompactd basically
always runs until scanners met?

Maybe the check could instead e.g. compare NR_FREE_PAGES (aligned down to
pageblock size, or even with some extra slack?) with NR_FREE_PAGES_BLOCKS?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
>  	 * meet watermarks.
>  	 */
>  	for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> +		unsigned long free_pages;
> +
>  		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>  			mark = promo_wmark_pages(zone);
>  		else
>  			mark = high_wmark_pages(zone);
> -		if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))

I think you just removed the only user of this _safe() function. Is the
cpu-drift control it does no longer necessary?

> +
> +		/*
> +		 * In defrag_mode, watermarks must be met in whole
> +		 * blocks to avoid polluting allocator fallbacks.
> +		 */
> +		if (defrag_mode)
> +			free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> +		else
> +			free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +		if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> +					0, free_pages))
>  			return true;
>  	}
>

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Johannes Weiner 10 months ago

On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> On 3/13/25 22:05, Johannes Weiner wrote:
> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> > which helps, but produces only one block at a time. Allocation stalls
> > and THP failure rates are still higher than they could be.
> > 
> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> > the watermarking for kswapd & kcompactd: instead of targeting the high
> > watermark in order-0 pages and checking for one suitable block, simply
> > require that the high watermark is entirely met in pageblocks.
> 
> Hrm.

Hrm!

> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> >  	if (!pageblock_aligned(cc->migrate_pfn))
> >  		return COMPACT_CONTINUE;
> >  
> > +	/*
> > +	 * When defrag_mode is enabled, make kcompactd target
> > +	 * watermarks in whole pageblocks. Because they can be stolen
> > +	 * without polluting, no further fallback checks are needed.
> > +	 */
> > +	if (defrag_mode && !cc->direct_compaction) {
> > +		if (__zone_watermark_ok(cc->zone, cc->order,
> > +					high_wmark_pages(cc->zone),
> > +					cc->highest_zoneidx, cc->alloc_flags,
> > +					zone_page_state(cc->zone,
> > +							NR_FREE_PAGES_BLOCKS)))
> > +			return COMPACT_SUCCESS;
> > +
> > +		return COMPACT_CONTINUE;
> > +	}
> 
> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> put us below high wmark quickly and then we never satisfy this?

The high watermark is not aligned, but why does it have to be? It's a
binary condition: met or not met. Compaction continues until it's met.

NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
it'll really work until align_up(highmark, pageblock_nr_pages), as
that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
that seems reasonable, no?

The allocator side is using low/min, so we have the conventional
hysteresis between consumer and producer.

For illustration, on my 2G test box, the watermarks in DMA32 look like
this:

  pages free     212057
        boost    0
        min      11164		(21.8 blocks)
        low      13955		(27.3 blocks)
        high     16746		(32.7 blocks)
        promo    19537
        spanned  456704
        present  455680
        managed  431617		(843.1 blocks)

So there are several blocks between the kblahds wakeup and sleep. The
first allocation to cut into a whole free block will decrease
NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
the remaining space won't change that counter. So the distance between
the watermarks didn't fundamentally change (modulo block rounding).

> Doesn't then happen that with defrag_mode, in practice kcompactd basically
> always runs until scanners met?

Tracing kcompactd calls to compaction_finished() with defrag_mode:

@[COMPACT_CONTINUE]: 6955
@[COMPACT_COMPLETE]: 19
@[COMPACT_PARTIAL_SKIPPED]: 1
@[COMPACT_SUCCESS]: 17
@wakeuprequests: 3

Of course, similar to kswapd, it might not reach the watermarks and
keep running if there is a continuous stream of allocations consuming
the blocks it's making. Hence the ratio between wakeups & continues.

But when demand stops, it'll balance the high mark and quit.

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> >  	 * meet watermarks.
> >  	 */
> >  	for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> > +		unsigned long free_pages;
> > +
> >  		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> >  			mark = promo_wmark_pages(zone);
> >  		else
> >  			mark = high_wmark_pages(zone);
> > -		if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
> 
> I think you just removed the only user of this _safe() function. Is the
> cpu-drift control it does no longer necessary?

Ah good catch. This should actually use zone_page_state_snapshot()
below depending on z->percpu_drift_mark.

This is active when there are enough cpus for the vmstat pcp deltas to
exceed low-min. Afaics this is still necessary, but also still
requires a lot of CPUs to matter (>212 cpus with 64G of memory).

I'll send a follow-up fix.

> > +		/*
> > +		 * In defrag_mode, watermarks must be met in whole
> > +		 * blocks to avoid polluting allocator fallbacks.
> > +		 */
> > +		if (defrag_mode)
> > +			free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> > +		else
> > +			free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +		if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> > +					0, free_pages))
> >  			return true;
> >  	}
> >

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Vlastimil Babka 10 months ago

On 4/11/25 17:39, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
>> On 3/13/25 22:05, Johannes Weiner wrote:
>> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
>> > which helps, but produces only one block at a time. Allocation stalls
>> > and THP failure rates are still higher than they could be.
>> > 
>> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
>> > the watermarking for kswapd & kcompactd: instead of targeting the high
>> > watermark in order-0 pages and checking for one suitable block, simply
>> > require that the high watermark is entirely met in pageblocks.
>> 
>> Hrm.
> 
> Hrm!
> 
>> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
>> >  	if (!pageblock_aligned(cc->migrate_pfn))
>> >  		return COMPACT_CONTINUE;
>> >  
>> > +	/*
>> > +	 * When defrag_mode is enabled, make kcompactd target
>> > +	 * watermarks in whole pageblocks. Because they can be stolen
>> > +	 * without polluting, no further fallback checks are needed.
>> > +	 */
>> > +	if (defrag_mode && !cc->direct_compaction) {
>> > +		if (__zone_watermark_ok(cc->zone, cc->order,
>> > +					high_wmark_pages(cc->zone),
>> > +					cc->highest_zoneidx, cc->alloc_flags,
>> > +					zone_page_state(cc->zone,
>> > +							NR_FREE_PAGES_BLOCKS)))
>> > +			return COMPACT_SUCCESS;
>> > +
>> > +		return COMPACT_CONTINUE;
>> > +	}
>> 
>> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
>> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
>> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
>> put us below high wmark quickly and then we never satisfy this?
> 
> The high watermark is not aligned, but why does it have to be? It's a
> binary condition: met or not met. Compaction continues until it's met.

What I mean is, kswapd will reclaim until the high watermark, which would be
32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
there's concurrent freeing pushing it above the high wmark)

> NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
> it'll really work until align_up(highmark, pageblock_nr_pages), as
> that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
> that seems reasonable, no?

How can it snap if it doesn't have enough free pages? Unlike kswapd,
kcompactd doesn't create them, only defragments.

> The allocator side is using low/min, so we have the conventional
> hysteresis between consumer and producer.

Sure but we cap kswapd at high wmark and the hunk quoted above also uses
high wmark so there's no hysteresis happening between kswapd and kcompactd?

> For illustration, on my 2G test box, the watermarks in DMA32 look like
> this:
> 
>   pages free     212057
>         boost    0
>         min      11164		(21.8 blocks)
>         low      13955		(27.3 blocks)
>         high     16746		(32.7 blocks)
>         promo    19537
>         spanned  456704
>         present  455680
>         managed  431617		(843.1 blocks)
> 
> So there are several blocks between the kblahds wakeup and sleep. The
> first allocation to cut into a whole free block will decrease
> NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
> the remaining space won't change that counter. So the distance between
> the watermarks didn't fundamentally change (modulo block rounding).
> 
>> Doesn't then happen that with defrag_mode, in practice kcompactd basically
>> always runs until scanners met?
> 
> Tracing kcompactd calls to compaction_finished() with defrag_mode:
> 
> @[COMPACT_CONTINUE]: 6955
> @[COMPACT_COMPLETE]: 19
> @[COMPACT_PARTIAL_SKIPPED]: 1
> @[COMPACT_SUCCESS]: 17
> @wakeuprequests: 3

OK that doesn't look that bad.

> Of course, similar to kswapd, it might not reach the watermarks and
> keep running if there is a continuous stream of allocations consuming
> the blocks it's making. Hence the ratio between wakeups & continues.
> 
> But when demand stops, it'll balance the high mark and quit.

Again, since kcompactd can only defragment free space and not create it, it
may be trying in vain?

[*] now when checking the code between kswapd and kcompactd handover, I
think I found a another problem?

we have:
kswapd_try_to_sleep()
  prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
   pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
    - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
      we were only reclaiming until now and didn't wake up kcompactd and
      this actually prevents the wake up?

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Johannes Weiner 10 months ago

On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
> On 4/11/25 17:39, Johannes Weiner wrote:
> > On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> >> On 3/13/25 22:05, Johannes Weiner wrote:
> >> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> >> >  	if (!pageblock_aligned(cc->migrate_pfn))
> >> >  		return COMPACT_CONTINUE;
> >> >  
> >> > +	/*
> >> > +	 * When defrag_mode is enabled, make kcompactd target
> >> > +	 * watermarks in whole pageblocks. Because they can be stolen
> >> > +	 * without polluting, no further fallback checks are needed.
> >> > +	 */
> >> > +	if (defrag_mode && !cc->direct_compaction) {
> >> > +		if (__zone_watermark_ok(cc->zone, cc->order,
> >> > +					high_wmark_pages(cc->zone),
> >> > +					cc->highest_zoneidx, cc->alloc_flags,
> >> > +					zone_page_state(cc->zone,
> >> > +							NR_FREE_PAGES_BLOCKS)))
> >> > +			return COMPACT_SUCCESS;
> >> > +
> >> > +		return COMPACT_CONTINUE;
> >> > +	}
> >> 
> >> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> >> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> >> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> >> put us below high wmark quickly and then we never satisfy this?
> > 
> > The high watermark is not aligned, but why does it have to be? It's a
> > binary condition: met or not met. Compaction continues until it's met.
> 
> What I mean is, kswapd will reclaim until the high watermark, which would be
> 32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
> of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
> there's concurrent freeing pushing it above the high wmark)

Ah, but kswapd also uses the (rounded up) NR_FREE_PAGES_BLOCKS check.

Buckle up...

> > Of course, similar to kswapd, it might not reach the watermarks and
> > keep running if there is a continuous stream of allocations consuming
> > the blocks it's making. Hence the ratio between wakeups & continues.
> > 
> > But when demand stops, it'll balance the high mark and quit.
> 
> Again, since kcompactd can only defragment free space and not create it, it
> may be trying in vain?
> 
> [*] now when checking the code between kswapd and kcompactd handover, I
> think I found a another problem?
> 
> we have:
> kswapd_try_to_sleep()
>   prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
>    pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
>     - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
>       we were only reclaiming until now and didn't wake up kcompactd and
>       this actually prevents the wake up?

Correct, so as per above, kswapd also does the NR_FREE_PAGES_BLOCKS
check. At first, at least. So it continues to produce adequate scratch
space and won't leave compaction high and dry on a watermark it cannot
meet. They are indeed coordinated in this aspect.

As far as the *handoff* to kcompactd goes, I've been pulling my hair
over this for a very long time. You're correct about the graph
above. And actually, this is the case before defrag_mode too: if you
wake kswapd with, say, an order-8, it will do pgdat_balanced() checks
against that, seemingly reclaim until the request can succeed, *then*
wake kcompactd and sleep. WTF?

But kswapd has this:

	/*
	 * Fragmentation may mean that the system cannot be rebalanced for
	 * high-order allocations. If twice the allocation size has been
	 * reclaimed then recheck watermarks only at order-0 to prevent
	 * excessive reclaim. Assume that a process requested a high-order
	 * can direct reclaim/compact.
	 */
	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
		sc->order = 0;

Ignore the comment and just consider the code. What it does for higher
orders (whether defrag_mode is enabled or not), is reclaim a gap for
the order, ensure order-0 is met (but most likely it is), then enter
the sleep path - wake kcompactd and wait for more work.

Effectively, as long as there are pending higher-order requests
looping in the allocator, kswapd does this:

1) reclaim a compaction gap delta
2) wake kcompactd
3) goto 1

This pipelining seems to work *very* well in practice, especially when
there is a large number of concurrent requests.

In the huge allocator original series, I tried to convert kswapd to
use compaction_suitable() to hand over quicker. However, this ran into
scaling issues with higher allocation concurrency: maintaining just a
single, static compact gap when there could be hundreds of allocation
requests waiting for compaction results falls apart fast.

The current code has it right. The comments might be a bit dated and
maybe it could use some fine tuning. But generally, as long as there
are incoming wakeups from the allocator, it makes sense to keep making
more space for compaction as well.

I think Mel was playing 4d chess with this stuff.

[ I kept direct reclaim/compaction out of this defrag_mode series, but
  testing suggests the same is likely true for the direct path.

  Direct reclaim bails from compaction_ready() if there is a static
  compaction gap for that order. But once the gap for a given order is
  there, you can get a thundering herd of direct compactors storming
  on this gap, most of which will then fail compaction_suitable().

  A pipeline of "reclaim gap delta, direct compact, retry" seems to
  make more sense there as well. With adequate checks to prevent
  excessive reclaim in corner cases of course... ]

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Johannes Weiner 10 months ago

On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
> > [*] now when checking the code between kswapd and kcompactd handover, I
> > think I found a another problem?
> > 
> > we have:
> > kswapd_try_to_sleep()
> >   prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
> >    pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
> >     - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
> >       we were only reclaiming until now and didn't wake up kcompactd and
> >       this actually prevents the wake up?

Coming back to this, there is indeed a defrag_mode issue. My
apologies, I misunderstood what you were pointing at.

Like I said, kswapd reverts to order-0 in some other place to go to
sleep and trigger the handoff. At that point, defrag_mode also needs
to revert to NR_FREE_PAGES.

It's curious that this didn't stand out in testing. On the contrary,
kcompactd was still doing the vast majority of the compaction work. It
looks like kswapd and direct workers on their own manage to balance
NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given
the low number of kcompactd wakeups, the consumers keep it going.

So testing with this:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc422ad830d6..c2aa0a4b67de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
 		/*
 		 * In defrag_mode, watermarks must be met in whole
 		 * blocks to avoid polluting allocator fallbacks.
+		 *
+		 * When kswapd has compact gap, check regular
+		 * NR_FREE_PAGES and hand over to kcompactd.
 		 */
-		if (defrag_mode)
+		if (defrag_mode && order)
 			item = NR_FREE_PAGES_BLOCKS;
 		else
 			item = NR_FREE_PAGES;

I'm getting the following results:

                                fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE
Hugealloc Time mean                       79381.34 (    +0.00%)    88126.12 (   +11.02%)
Hugealloc Time stddev                     85852.16 (    +0.00%)   135366.75 (   +57.67%)
Kbuild Real time                            249.35 (    +0.00%)      226.71 (    -9.04%)
Kbuild User time                           1249.16 (    +0.00%)     1249.37 (    +0.02%)
Kbuild System time                          171.76 (    +0.00%)      166.93 (    -2.79%)
THP fault alloc                           51666.87 (    +0.00%)    52685.60 (    +1.97%)
THP fault fallback                        16970.00 (    +0.00%)    15951.87 (    -6.00%)
Direct compact fail                         166.53 (    +0.00%)      178.93 (    +7.40%)
Direct compact success                       17.13 (    +0.00%)        4.13 (   -71.69%)
Compact daemon scanned migrate          3095413.33 (    +0.00%)  9231239.53 (  +198.22%)
Compact daemon scanned free             2155966.53 (    +0.00%)  7053692.87 (  +227.17%)
Compact direct scanned migrate           265642.47 (    +0.00%)    68388.33 (   -74.26%)
Compact direct scanned free              130252.60 (    +0.00%)    55634.87 (   -57.29%)
Compact total migrate scanned           3361055.80 (    +0.00%)  9299627.87 (  +176.69%)
Compact total free scanned              2286219.13 (    +0.00%)  7109327.73 (  +210.96%)
Alloc stall                                1890.80 (    +0.00%)     6297.60 (  +232.94%)
Pages kswapd scanned                    9043558.80 (    +0.00%)  5952576.73 (   -34.18%)
Pages kswapd reclaimed                  1891708.67 (    +0.00%)  1030645.00 (   -45.52%)
Pages direct scanned                    1017090.60 (    +0.00%)  2688047.60 (  +164.29%)
Pages direct reclaimed                    92682.60 (    +0.00%)   309770.53 (  +234.22%)
Pages total scanned                    10060649.40 (    +0.00%)  8640624.33 (   -14.11%)
Pages total reclaimed                   1984391.27 (    +0.00%)  1340415.53 (   -32.45%)
Swap out                                 884585.73 (    +0.00%)   417781.93 (   -52.77%)
Swap in                                  287106.27 (    +0.00%)    95589.73 (   -66.71%)
File refaults                            551697.60 (    +0.00%)   426474.80 (   -22.70%)

Work has shifted from direct to kcompactd. In aggregate there is more
compaction happening. Meanwhile aggregate reclaim and swapping drops
quite substantially. %sys is down, so this is just more efficient.

Reclaim and swapping are down substantially, which is great. But the
reclaim work that remains has shifted somewhat to direct reclaim,
which is unfortunate. THP delivery is also a tad worse, but still much
better than !defrag_mode, so not too concerning. That part deserves a
bit more thought.

Overall, this looks good, though. I'll send a proper patch next week.

Thanks for the review, Vlastimil.

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Vlastimil Babka 10 months ago

On 4/13/25 04:20, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote:
>> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
>> > [*] now when checking the code between kswapd and kcompactd handover, I
>> > think I found a another problem?
>> > 
>> > we have:
>> > kswapd_try_to_sleep()
>> >   prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
>> >    pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
>> >     - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
>> >       we were only reclaiming until now and didn't wake up kcompactd and
>> >       this actually prevents the wake up?
> 
> Coming back to this, there is indeed a defrag_mode issue. My
> apologies, I misunderstood what you were pointing at.
> 
> Like I said, kswapd reverts to order-0 in some other place to go to
> sleep and trigger the handoff. At that point, defrag_mode also needs
> to revert to NR_FREE_PAGES.
> 
> It's curious that this didn't stand out in testing. On the contrary,
> kcompactd was still doing the vast majority of the compaction work. It
> looks like kswapd and direct workers on their own manage to balance
> NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given
> the low number of kcompactd wakeups, the consumers keep it going.
> 
> So testing with this:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc422ad830d6..c2aa0a4b67de 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
>  		/*
>  		 * In defrag_mode, watermarks must be met in whole
>  		 * blocks to avoid polluting allocator fallbacks.
> +		 *
> +		 * When kswapd has compact gap, check regular
> +		 * NR_FREE_PAGES and hand over to kcompactd.
>  		 */
> -		if (defrag_mode)
> +		if (defrag_mode && order)
>  			item = NR_FREE_PAGES_BLOCKS;
>  		else
>  			item = NR_FREE_PAGES;
> 
> I'm getting the following results:
> 
>                                 fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE
> Hugealloc Time mean                       79381.34 (    +0.00%)    88126.12 (   +11.02%)
> Hugealloc Time stddev                     85852.16 (    +0.00%)   135366.75 (   +57.67%)
> Kbuild Real time                            249.35 (    +0.00%)      226.71 (    -9.04%)
> Kbuild User time                           1249.16 (    +0.00%)     1249.37 (    +0.02%)
> Kbuild System time                          171.76 (    +0.00%)      166.93 (    -2.79%)
> THP fault alloc                           51666.87 (    +0.00%)    52685.60 (    +1.97%)
> THP fault fallback                        16970.00 (    +0.00%)    15951.87 (    -6.00%)
> Direct compact fail                         166.53 (    +0.00%)      178.93 (    +7.40%)
> Direct compact success                       17.13 (    +0.00%)        4.13 (   -71.69%)
> Compact daemon scanned migrate          3095413.33 (    +0.00%)  9231239.53 (  +198.22%)
> Compact daemon scanned free             2155966.53 (    +0.00%)  7053692.87 (  +227.17%)

However this brings me back to my concern with __compact_finished()
requiring high watermark of NR_FREE_PAGES_BLOCKS. IMHO it can really easily
lead to situations where all free memory is already contiguous, but because
of one or two THP concurrent allocations we're below the high watermark (but
not yet low watermark to wake kswapd again) so further compaction by
kcompactd is simply wasting cpu cycles at that point.
Again I think a comparison of NR_FREE_PAGES_BLOCKS to NR_FREE_PAGES would in
theory work better for determining if all free space is as defragmented as
possible?

> Compact direct scanned migrate           265642.47 (    +0.00%)    68388.33 (   -74.26%)
> Compact direct scanned free              130252.60 (    +0.00%)    55634.87 (   -57.29%)
> Compact total migrate scanned           3361055.80 (    +0.00%)  9299627.87 (  +176.69%)
> Compact total free scanned              2286219.13 (    +0.00%)  7109327.73 (  +210.96%)
> Alloc stall                                1890.80 (    +0.00%)     6297.60 (  +232.94%)
> Pages kswapd scanned                    9043558.80 (    +0.00%)  5952576.73 (   -34.18%)
> Pages kswapd reclaimed                  1891708.67 (    +0.00%)  1030645.00 (   -45.52%)
> Pages direct scanned                    1017090.60 (    +0.00%)  2688047.60 (  +164.29%)
> Pages direct reclaimed                    92682.60 (    +0.00%)   309770.53 (  +234.22%)
> Pages total scanned                    10060649.40 (    +0.00%)  8640624.33 (   -14.11%)
> Pages total reclaimed                   1984391.27 (    +0.00%)  1340415.53 (   -32.45%)
> Swap out                                 884585.73 (    +0.00%)   417781.93 (   -52.77%)
> Swap in                                  287106.27 (    +0.00%)    95589.73 (   -66.71%)
> File refaults                            551697.60 (    +0.00%)   426474.80 (   -22.70%)
> 
> Work has shifted from direct to kcompactd. In aggregate there is more
> compaction happening. Meanwhile aggregate reclaim and swapping drops
> quite substantially. %sys is down, so this is just more efficient.
> 
> Reclaim and swapping are down substantially, which is great. But the
> reclaim work that remains has shifted somewhat to direct reclaim,
> which is unfortunate. THP delivery is also a tad worse, but still much
> better than !defrag_mode, so not too concerning. That part deserves a
> bit more thought.
> 
> Overall, this looks good, though. I'll send a proper patch next week.
> 
> Thanks for the review, Vlastimil.

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Vlastimil Babka 10 months ago

On 4/13/25 04:20, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote:
>> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
>> > [*] now when checking the code between kswapd and kcompactd handover, I
>> > think I found a another problem?
>> > 
>> > we have:
>> > kswapd_try_to_sleep()
>> >   prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
>> >    pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
>> >     - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
>> >       we were only reclaiming until now and didn't wake up kcompactd and
>> >       this actually prevents the wake up?
> 
> Coming back to this, there is indeed a defrag_mode issue. My
> apologies, I misunderstood what you were pointing at.
> 
> Like I said, kswapd reverts to order-0 in some other place to go to
> sleep and trigger the handoff. At that point, defrag_mode also needs
> to revert to NR_FREE_PAGES.

I missed that revert to order-0 and that without it the current code also
wouldn't make sense. But I agree with the fix.

> It's curious that this didn't stand out in testing. On the contrary,
> kcompactd was still doing the vast majority of the compaction work. It
> looks like kswapd and direct workers on their own manage to balance
> NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given
> the low number of kcompactd wakeups, the consumers keep it going.
> 
> So testing with this:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc422ad830d6..c2aa0a4b67de 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
>  		/*
>  		 * In defrag_mode, watermarks must be met in whole
>  		 * blocks to avoid polluting allocator fallbacks.
> +		 *
> +		 * When kswapd has compact gap, check regular
> +		 * NR_FREE_PAGES and hand over to kcompactd.
>  		 */
> -		if (defrag_mode)
> +		if (defrag_mode && order)
>  			item = NR_FREE_PAGES_BLOCKS;
>  		else
>  			item = NR_FREE_PAGES;
> 
> I'm getting the following results:
> 
>                                 fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE
> Hugealloc Time mean                       79381.34 (    +0.00%)    88126.12 (   +11.02%)
> Hugealloc Time stddev                     85852.16 (    +0.00%)   135366.75 (   +57.67%)
> Kbuild Real time                            249.35 (    +0.00%)      226.71 (    -9.04%)
> Kbuild User time                           1249.16 (    +0.00%)     1249.37 (    +0.02%)
> Kbuild System time                          171.76 (    +0.00%)      166.93 (    -2.79%)
> THP fault alloc                           51666.87 (    +0.00%)    52685.60 (    +1.97%)
> THP fault fallback                        16970.00 (    +0.00%)    15951.87 (    -6.00%)
> Direct compact fail                         166.53 (    +0.00%)      178.93 (    +7.40%)
> Direct compact success                       17.13 (    +0.00%)        4.13 (   -71.69%)
> Compact daemon scanned migrate          3095413.33 (    +0.00%)  9231239.53 (  +198.22%)
> Compact daemon scanned free             2155966.53 (    +0.00%)  7053692.87 (  +227.17%)
> Compact direct scanned migrate           265642.47 (    +0.00%)    68388.33 (   -74.26%)
> Compact direct scanned free              130252.60 (    +0.00%)    55634.87 (   -57.29%)
> Compact total migrate scanned           3361055.80 (    +0.00%)  9299627.87 (  +176.69%)
> Compact total free scanned              2286219.13 (    +0.00%)  7109327.73 (  +210.96%)
> Alloc stall                                1890.80 (    +0.00%)     6297.60 (  +232.94%)
> Pages kswapd scanned                    9043558.80 (    +0.00%)  5952576.73 (   -34.18%)
> Pages kswapd reclaimed                  1891708.67 (    +0.00%)  1030645.00 (   -45.52%)
> Pages direct scanned                    1017090.60 (    +0.00%)  2688047.60 (  +164.29%)
> Pages direct reclaimed                    92682.60 (    +0.00%)   309770.53 (  +234.22%)
> Pages total scanned                    10060649.40 (    +0.00%)  8640624.33 (   -14.11%)
> Pages total reclaimed                   1984391.27 (    +0.00%)  1340415.53 (   -32.45%)
> Swap out                                 884585.73 (    +0.00%)   417781.93 (   -52.77%)
> Swap in                                  287106.27 (    +0.00%)    95589.73 (   -66.71%)
> File refaults                            551697.60 (    +0.00%)   426474.80 (   -22.70%)
> 
> Work has shifted from direct to kcompactd. In aggregate there is more
> compaction happening. Meanwhile aggregate reclaim and swapping drops
> quite substantially. %sys is down, so this is just more efficient.
> 
> Reclaim and swapping are down substantially, which is great. But the
> reclaim work that remains has shifted somewhat to direct reclaim,
> which is unfortunate. THP delivery is also a tad worse, but still much
> better than !defrag_mode, so not too concerning. That part deserves a
> bit more thought.
> 
> Overall, this looks good, though. I'll send a proper patch next week.
> 
> Thanks for the review, Vlastimil.

NP!

Re: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks

Posted by Johannes Weiner 11 months ago

Andrew, could you please fold this delta patch?

---

From 3d2ff7b72df9e4f1a31b3cff2ae6a4584c06bdca Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 14 Mar 2025 11:38:41 -0400
Subject: [PATCH] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks fix

Fix squawks from rebasing that affect the behavior of !defrag_mode.

FWIW, it seems to actually have slightly helped the vanilla kernel in
the benchmark. But the point was to not change the default behavior:

                                                VANILLA        WMARKFIX-VANILLA
Hugealloc Time mean               52739.45 (    +0.00%)   62758.21 (   +19.00%)
Hugealloc Time stddev             56541.26 (    +0.00%)   76253.41 (   +34.86%)
Kbuild Real time                    197.47 (    +0.00%)     197.25 (    -0.11%)
Kbuild User time                   1240.49 (    +0.00%)    1241.33 (    +0.07%)
Kbuild System time                   70.08 (    +0.00%)      71.00 (    +1.28%)
THP fault alloc                   46727.07 (    +0.00%)   41492.73 (   -11.20%)
THP fault fallback                21910.60 (    +0.00%)   27146.53 (   +23.90%)
Direct compact fail                 195.80 (    +0.00%)     260.93 (   +33.10%)
Direct compact success                7.93 (    +0.00%)       6.67 (   -14.18%)
Direct compact success rate %         3.51 (    +0.00%)       2.76 (   -16.78%)
Compact daemon scanned migrate  3369601.27 (    +0.00%) 3827734.27 (   +13.60%)
Compact daemon scanned free     5075474.47 (    +0.00%) 5910839.73 (   +16.46%)
Compact direct scanned migrate   161787.27 (    +0.00%)  168271.13 (    +4.01%)
Compact direct scanned free      163467.53 (    +0.00%)  222558.33 (   +36.15%)
Compact total migrate scanned   3531388.53 (    +0.00%) 3996005.40 (   +13.16%)
Compact total free scanned      5238942.00 (    +0.00%) 6133398.07 (   +17.07%)
Alloc stall                        2371.07 (    +0.00%)    2478.00 (    +4.51%)
Pages kswapd scanned            2160926.73 (    +0.00%) 1726204.67 (   -20.12%)
Pages kswapd reclaimed           533191.07 (    +0.00%)  537963.73 (    +0.90%)
Pages direct scanned             400450.33 (    +0.00%)  450004.87 (   +12.37%)
Pages direct reclaimed            94441.73 (    +0.00%)   99193.07 (    +5.03%)
Pages total scanned             2561377.07 (    +0.00%) 2176209.53 (   -15.04%)
Pages total reclaimed            627632.80 (    +0.00%)  637156.80 (    +1.52%)
Swap out                          47959.53 (    +0.00%)   45186.20 (    -5.78%)
Swap in                            7276.00 (    +0.00%)    7109.40 (    -2.29%)
File refaults                    138043.00 (    +0.00%)  145238.73 (    +5.21%)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/compaction.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4a2ccb82d0b2..a481755791a9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -3075,6 +3075,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 	struct zone *zone;
 	enum zone_type highest_zoneidx = pgdat->kcompactd_highest_zoneidx;
 	enum compact_result ret;
+	unsigned int alloc_flags = defrag_mode ?
+		ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN;
 
 	for (zoneid = 0; zoneid <= highest_zoneidx; zoneid++) {
 		zone = &pgdat->node_zones[zoneid];
@@ -3084,7 +3086,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 
 		ret = compaction_suit_allocation_order(zone,
 				pgdat->kcompactd_max_order,
-				highest_zoneidx, ALLOC_WMARK_MIN,
+				highest_zoneidx, alloc_flags,
 				false, true);
 		if (ret == COMPACT_CONTINUE)
 			return true;
@@ -3108,7 +3110,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		.mode = MIGRATE_SYNC_LIGHT,
 		.ignore_skip_hint = false,
 		.gfp_mask = GFP_KERNEL,
-		.alloc_flags = ALLOC_WMARK_HIGH,
+		.alloc_flags = defrag_mode ? ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN,
 	};
 	enum compact_result ret;
 
-- 
2.48.1

[PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers
[PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing
[PATCH 3/5] mm: page_alloc: defrag_mode
[PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance
[PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks