The previous patch added pageblock_order reclaim to kswapd/kcompactd,
which helps, but produces only one block at a time. Allocation stalls
and THP failure rates are still higher than they could be.
To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
the watermarking for kswapd & kcompactd: instead of targeting the high
watermark in order-0 pages and checking for one suitable block, simply
require that the high watermark is entirely met in pageblocks.
To this end, track the number of free pages within contiguous
pageblocks, then change pgdat_balanced() and compact_finished() to
check watermarks against this new value.
This further reduces THP latencies and allocation stalls, and improves
THP success rates against the previous patch:
DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%)
Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%)
Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%)
Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%)
Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%)
THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%)
THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%)
Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%)
Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%)
Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%)
Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%)
Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%)
Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%)
Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%)
Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%)
Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%)
Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%)
Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%)
Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%)
Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%)
Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%)
Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%)
Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%)
Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%)
Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%)
File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%)
Also of note is that compaction work overall is reduced. The reason
for this is that when free pageblocks are more readily available,
allocations are also much more likely to get physically placed in LRU
order, instead of being forced to scavenge free space here and there.
This means that reclaim by itself has better chances of freeing up
whole blocks, and the system relies less on compaction.
Comparing all changes to the vanilla kernel:
VANILLA DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP allocation latencies and %sys time are down dramatically.
THP allocation failures are down from nearly 50% to 8.5%. And to
recall previous data points, the success rates are steady and reliable
without the cumulative deterioration of fragmentation events.
Compaction work is down overall. Direct compaction work especially is
drastically reduced. As an aside, its success rate of 4% indicates
there is room for improvement. For now it's good to rely on it less.
Reclaim work is up overall, however direct reclaim work is down. Part
of the increase can be attributed to a higher use of THPs, which due
to internal fragmentation increase the memory footprint. This is not
necessarily an unexpected side-effect for users of THP.
However, taken both points together, there may well be some
opportunities for fine tuning in the reclaim/compaction coordination.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/mmzone.h | 1 +
mm/compaction.c | 37 ++++++++++++++++++++++++++++++-------
mm/internal.h | 1 +
mm/page_alloc.c | 29 +++++++++++++++++++++++------
mm/vmscan.c | 15 ++++++++++++++-
mm/vmstat.c | 1 +
6 files changed, 70 insertions(+), 14 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dbb0ad69e17f..37c29f3fbca8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -138,6 +138,7 @@ enum numa_stat_item {
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
+ NR_FREE_PAGES_BLOCKS,
NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
NR_ZONE_ACTIVE_ANON,
diff --git a/mm/compaction.c b/mm/compaction.c
index 036353ef1878..4a2ccb82d0b2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
if (!pageblock_aligned(cc->migrate_pfn))
return COMPACT_CONTINUE;
+ /*
+ * When defrag_mode is enabled, make kcompactd target
+ * watermarks in whole pageblocks. Because they can be stolen
+ * without polluting, no further fallback checks are needed.
+ */
+ if (defrag_mode && !cc->direct_compaction) {
+ if (__zone_watermark_ok(cc->zone, cc->order,
+ high_wmark_pages(cc->zone),
+ cc->highest_zoneidx, cc->alloc_flags,
+ zone_page_state(cc->zone,
+ NR_FREE_PAGES_BLOCKS)))
+ return COMPACT_SUCCESS;
+
+ return COMPACT_CONTINUE;
+ }
+
/* Direct compactor: Is a suitable page free? */
ret = COMPACT_NO_SUITABLE_PAGE;
for (order = cc->order; order < NR_PAGE_ORDERS; order++) {
@@ -2496,13 +2512,19 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
static enum compact_result
compaction_suit_allocation_order(struct zone *zone, unsigned int order,
int highest_zoneidx, unsigned int alloc_flags,
- bool async)
+ bool async, bool kcompactd)
{
+ unsigned long free_pages;
unsigned long watermark;
+ if (kcompactd && defrag_mode)
+ free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+ else
+ free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
- if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
- alloc_flags))
+ if (__zone_watermark_ok(zone, order, watermark, highest_zoneidx,
+ alloc_flags, free_pages))
return COMPACT_SUCCESS;
/*
@@ -2558,7 +2580,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
ret = compaction_suit_allocation_order(cc->zone, cc->order,
cc->highest_zoneidx,
cc->alloc_flags,
- cc->mode == MIGRATE_ASYNC);
+ cc->mode == MIGRATE_ASYNC,
+ !cc->direct_compaction);
if (ret != COMPACT_CONTINUE)
return ret;
}
@@ -3062,7 +3085,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
pgdat->kcompactd_max_order,
highest_zoneidx, ALLOC_WMARK_MIN,
- false);
+ false, true);
if (ret == COMPACT_CONTINUE)
return true;
}
@@ -3085,7 +3108,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = false,
.gfp_mask = GFP_KERNEL,
- .alloc_flags = ALLOC_WMARK_MIN,
+ .alloc_flags = ALLOC_WMARK_HIGH,
};
enum compact_result ret;
@@ -3105,7 +3128,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
cc.order, zoneid, cc.alloc_flags,
- false);
+ false, true);
if (ret != COMPACT_CONTINUE)
continue;
diff --git a/mm/internal.h b/mm/internal.h
index 2f52a65272c1..286520a424fe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -536,6 +536,7 @@ extern char * const zone_names[MAX_NR_ZONES];
DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled);
extern int min_free_kbytes;
+extern int defrag_mode;
void setup_per_zone_wmarks(void);
void calculate_min_free_kbytes(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a0d8f871e56..c33c08e278f9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,7 +273,7 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
static int watermark_boost_factor __read_mostly = 15000;
static int watermark_scale_factor = 10;
-static int defrag_mode;
+int defrag_mode;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -660,16 +660,20 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
bool tail)
{
struct free_area *area = &zone->free_area[order];
+ int nr_pages = 1 << order;
VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
"page type is %lu, passed migratetype is %d (nr=%d)\n",
- get_pageblock_migratetype(page), migratetype, 1 << order);
+ get_pageblock_migratetype(page), migratetype, nr_pages);
if (tail)
list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
else
list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
+
+ if (order >= pageblock_order && !is_migrate_isolate(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
}
/*
@@ -681,24 +685,34 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
unsigned int order, int old_mt, int new_mt)
{
struct free_area *area = &zone->free_area[order];
+ int nr_pages = 1 << order;
/* Free page moving can fail, so it happens before the type update */
VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
"page type is %lu, passed migratetype is %d (nr=%d)\n",
- get_pageblock_migratetype(page), old_mt, 1 << order);
+ get_pageblock_migratetype(page), old_mt, nr_pages);
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
- account_freepages(zone, -(1 << order), old_mt);
- account_freepages(zone, 1 << order, new_mt);
+ account_freepages(zone, -nr_pages, old_mt);
+ account_freepages(zone, nr_pages, new_mt);
+
+ if (order >= pageblock_order &&
+ is_migrate_isolate(old_mt) != is_migrate_isolate(new_mt)) {
+ if (!is_migrate_isolate(old_mt))
+ nr_pages = -nr_pages;
+ __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
+ }
}
static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
unsigned int order, int migratetype)
{
+ int nr_pages = 1 << order;
+
VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
"page type is %lu, passed migratetype is %d (nr=%d)\n",
- get_pageblock_migratetype(page), migratetype, 1 << order);
+ get_pageblock_migratetype(page), migratetype, nr_pages);
/* clear reported state and update reported page count */
if (page_reported(page))
@@ -708,6 +722,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
__ClearPageBuddy(page);
set_page_private(page, 0);
zone->free_area[order].nr_free--;
+
+ if (order >= pageblock_order && !is_migrate_isolate(migratetype))
+ __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
}
static inline void del_page_from_free_list(struct page *page, struct zone *zone,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3370bdca6868..b5c7dfc2b189 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
* meet watermarks.
*/
for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
+ unsigned long free_pages;
+
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
mark = promo_wmark_pages(zone);
else
mark = high_wmark_pages(zone);
- if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
+
+ /*
+ * In defrag_mode, watermarks must be met in whole
+ * blocks to avoid polluting allocator fallbacks.
+ */
+ if (defrag_mode)
+ free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
+ else
+ free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+ if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
+ 0, free_pages))
return true;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..ed49a86348f7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1190,6 +1190,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
const char * const vmstat_text[] = {
/* enum zone_stat_item counters */
"nr_free_pages",
+ "nr_free_pages_blocks",
"nr_zone_inactive_anon",
"nr_zone_active_anon",
"nr_zone_inactive_file",
--
2.48.1
On 3/13/25 22:05, Johannes Weiner wrote:
> The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> which helps, but produces only one block at a time. Allocation stalls
> and THP failure rates are still higher than they could be.
>
> To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> the watermarking for kswapd & kcompactd: instead of targeting the high
> watermark in order-0 pages and checking for one suitable block, simply
> require that the high watermark is entirely met in pageblocks.
Hrm.
> ---
> include/linux/mmzone.h | 1 +
> mm/compaction.c | 37 ++++++++++++++++++++++++++++++-------
> mm/internal.h | 1 +
> mm/page_alloc.c | 29 +++++++++++++++++++++++------
> mm/vmscan.c | 15 ++++++++++++++-
> mm/vmstat.c | 1 +
> 6 files changed, 70 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index dbb0ad69e17f..37c29f3fbca8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -138,6 +138,7 @@ enum numa_stat_item {
> enum zone_stat_item {
> /* First 128 byte cacheline (assuming 64 bit words) */
> NR_FREE_PAGES,
> + NR_FREE_PAGES_BLOCKS,
> NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
> NR_ZONE_ACTIVE_ANON,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 036353ef1878..4a2ccb82d0b2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> if (!pageblock_aligned(cc->migrate_pfn))
> return COMPACT_CONTINUE;
>
> + /*
> + * When defrag_mode is enabled, make kcompactd target
> + * watermarks in whole pageblocks. Because they can be stolen
> + * without polluting, no further fallback checks are needed.
> + */
> + if (defrag_mode && !cc->direct_compaction) {
> + if (__zone_watermark_ok(cc->zone, cc->order,
> + high_wmark_pages(cc->zone),
> + cc->highest_zoneidx, cc->alloc_flags,
> + zone_page_state(cc->zone,
> + NR_FREE_PAGES_BLOCKS)))
> + return COMPACT_SUCCESS;
> +
> + return COMPACT_CONTINUE;
> + }
Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
to pageblock size? If not, and it's X pageblocks and a half, we will rarely
have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
put us below high wmark quickly and then we never satisfy this?
Doesn't then happen that with defrag_mode, in practice kcompactd basically
always runs until scanners met?
Maybe the check could instead e.g. compare NR_FREE_PAGES (aligned down to
pageblock size, or even with some extra slack?) with NR_FREE_PAGES_BLOCKS?
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> * meet watermarks.
> */
> for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> + unsigned long free_pages;
> +
> if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> mark = promo_wmark_pages(zone);
> else
> mark = high_wmark_pages(zone);
> - if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
I think you just removed the only user of this _safe() function. Is the
cpu-drift control it does no longer necessary?
> +
> + /*
> + * In defrag_mode, watermarks must be met in whole
> + * blocks to avoid polluting allocator fallbacks.
> + */
> + if (defrag_mode)
> + free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> + else
> + free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> + if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> + 0, free_pages))
> return true;
> }
>
On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> On 3/13/25 22:05, Johannes Weiner wrote:
> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
> > which helps, but produces only one block at a time. Allocation stalls
> > and THP failure rates are still higher than they could be.
> >
> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
> > the watermarking for kswapd & kcompactd: instead of targeting the high
> > watermark in order-0 pages and checking for one suitable block, simply
> > require that the high watermark is entirely met in pageblocks.
>
> Hrm.
Hrm!
> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> > if (!pageblock_aligned(cc->migrate_pfn))
> > return COMPACT_CONTINUE;
> >
> > + /*
> > + * When defrag_mode is enabled, make kcompactd target
> > + * watermarks in whole pageblocks. Because they can be stolen
> > + * without polluting, no further fallback checks are needed.
> > + */
> > + if (defrag_mode && !cc->direct_compaction) {
> > + if (__zone_watermark_ok(cc->zone, cc->order,
> > + high_wmark_pages(cc->zone),
> > + cc->highest_zoneidx, cc->alloc_flags,
> > + zone_page_state(cc->zone,
> > + NR_FREE_PAGES_BLOCKS)))
> > + return COMPACT_SUCCESS;
> > +
> > + return COMPACT_CONTINUE;
> > + }
>
> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> put us below high wmark quickly and then we never satisfy this?
The high watermark is not aligned, but why does it have to be? It's a
binary condition: met or not met. Compaction continues until it's met.
NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
it'll really work until align_up(highmark, pageblock_nr_pages), as
that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
that seems reasonable, no?
The allocator side is using low/min, so we have the conventional
hysteresis between consumer and producer.
For illustration, on my 2G test box, the watermarks in DMA32 look like
this:
pages free 212057
boost 0
min 11164 (21.8 blocks)
low 13955 (27.3 blocks)
high 16746 (32.7 blocks)
promo 19537
spanned 456704
present 455680
managed 431617 (843.1 blocks)
So there are several blocks between the kblahds wakeup and sleep. The
first allocation to cut into a whole free block will decrease
NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
the remaining space won't change that counter. So the distance between
the watermarks didn't fundamentally change (modulo block rounding).
> Doesn't then happen that with defrag_mode, in practice kcompactd basically
> always runs until scanners met?
Tracing kcompactd calls to compaction_finished() with defrag_mode:
@[COMPACT_CONTINUE]: 6955
@[COMPACT_COMPLETE]: 19
@[COMPACT_PARTIAL_SKIPPED]: 1
@[COMPACT_SUCCESS]: 17
@wakeuprequests: 3
Of course, similar to kswapd, it might not reach the watermarks and
keep running if there is a continuous stream of allocations consuming
the blocks it's making. Hence the ratio between wakeups & continues.
But when demand stops, it'll balance the high mark and quit.
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> > * meet watermarks.
> > */
> > for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
> > + unsigned long free_pages;
> > +
> > if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
> > mark = promo_wmark_pages(zone);
> > else
> > mark = high_wmark_pages(zone);
> > - if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
>
> I think you just removed the only user of this _safe() function. Is the
> cpu-drift control it does no longer necessary?
Ah good catch. This should actually use zone_page_state_snapshot()
below depending on z->percpu_drift_mark.
This is active when there are enough cpus for the vmstat pcp deltas to
exceed low-min. Afaics this is still necessary, but also still
requires a lot of CPUs to matter (>212 cpus with 64G of memory).
I'll send a follow-up fix.
> > + /*
> > + * In defrag_mode, watermarks must be met in whole
> > + * blocks to avoid polluting allocator fallbacks.
> > + */
> > + if (defrag_mode)
> > + free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
> > + else
> > + free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > + if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> > + 0, free_pages))
> > return true;
> > }
> >
On 4/11/25 17:39, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
>> On 3/13/25 22:05, Johannes Weiner wrote:
>> > The previous patch added pageblock_order reclaim to kswapd/kcompactd,
>> > which helps, but produces only one block at a time. Allocation stalls
>> > and THP failure rates are still higher than they could be.
>> >
>> > To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change
>> > the watermarking for kswapd & kcompactd: instead of targeting the high
>> > watermark in order-0 pages and checking for one suitable block, simply
>> > require that the high watermark is entirely met in pageblocks.
>>
>> Hrm.
>
> Hrm!
>
>> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
>> > if (!pageblock_aligned(cc->migrate_pfn))
>> > return COMPACT_CONTINUE;
>> >
>> > + /*
>> > + * When defrag_mode is enabled, make kcompactd target
>> > + * watermarks in whole pageblocks. Because they can be stolen
>> > + * without polluting, no further fallback checks are needed.
>> > + */
>> > + if (defrag_mode && !cc->direct_compaction) {
>> > + if (__zone_watermark_ok(cc->zone, cc->order,
>> > + high_wmark_pages(cc->zone),
>> > + cc->highest_zoneidx, cc->alloc_flags,
>> > + zone_page_state(cc->zone,
>> > + NR_FREE_PAGES_BLOCKS)))
>> > + return COMPACT_SUCCESS;
>> > +
>> > + return COMPACT_CONTINUE;
>> > + }
>>
>> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
>> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
>> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
>> put us below high wmark quickly and then we never satisfy this?
>
> The high watermark is not aligned, but why does it have to be? It's a
> binary condition: met or not met. Compaction continues until it's met.
What I mean is, kswapd will reclaim until the high watermark, which would be
32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
there's concurrent freeing pushing it above the high wmark)
> NR_FREE_PAGES_BLOCKS moves in pageblock_nr_pages steps. This means
> it'll really work until align_up(highmark, pageblock_nr_pages), as
> that's when NR_FREE_PAGES_BLOCKS snaps above the (unaligned) mark. But
> that seems reasonable, no?
How can it snap if it doesn't have enough free pages? Unlike kswapd,
kcompactd doesn't create them, only defragments.
> The allocator side is using low/min, so we have the conventional
> hysteresis between consumer and producer.
Sure but we cap kswapd at high wmark and the hunk quoted above also uses
high wmark so there's no hysteresis happening between kswapd and kcompactd?
> For illustration, on my 2G test box, the watermarks in DMA32 look like
> this:
>
> pages free 212057
> boost 0
> min 11164 (21.8 blocks)
> low 13955 (27.3 blocks)
> high 16746 (32.7 blocks)
> promo 19537
> spanned 456704
> present 455680
> managed 431617 (843.1 blocks)
>
> So there are several blocks between the kblahds wakeup and sleep. The
> first allocation to cut into a whole free block will decrease
> NR_FREE_PAGES_BLOCK by a whole block. But subsequent allocs that fill
> the remaining space won't change that counter. So the distance between
> the watermarks didn't fundamentally change (modulo block rounding).
>
>> Doesn't then happen that with defrag_mode, in practice kcompactd basically
>> always runs until scanners met?
>
> Tracing kcompactd calls to compaction_finished() with defrag_mode:
>
> @[COMPACT_CONTINUE]: 6955
> @[COMPACT_COMPLETE]: 19
> @[COMPACT_PARTIAL_SKIPPED]: 1
> @[COMPACT_SUCCESS]: 17
> @wakeuprequests: 3
OK that doesn't look that bad.
> Of course, similar to kswapd, it might not reach the watermarks and
> keep running if there is a continuous stream of allocations consuming
> the blocks it's making. Hence the ratio between wakeups & continues.
>
> But when demand stops, it'll balance the high mark and quit.
Again, since kcompactd can only defragment free space and not create it, it
may be trying in vain?
[*] now when checking the code between kswapd and kcompactd handover, I
think I found a another problem?
we have:
kswapd_try_to_sleep()
prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
- with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
we were only reclaiming until now and didn't wake up kcompactd and
this actually prevents the wake up?
On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
> On 4/11/25 17:39, Johannes Weiner wrote:
> > On Fri, Apr 11, 2025 at 10:19:58AM +0200, Vlastimil Babka wrote:
> >> On 3/13/25 22:05, Johannes Weiner wrote:
> >> > @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
> >> > if (!pageblock_aligned(cc->migrate_pfn))
> >> > return COMPACT_CONTINUE;
> >> >
> >> > + /*
> >> > + * When defrag_mode is enabled, make kcompactd target
> >> > + * watermarks in whole pageblocks. Because they can be stolen
> >> > + * without polluting, no further fallback checks are needed.
> >> > + */
> >> > + if (defrag_mode && !cc->direct_compaction) {
> >> > + if (__zone_watermark_ok(cc->zone, cc->order,
> >> > + high_wmark_pages(cc->zone),
> >> > + cc->highest_zoneidx, cc->alloc_flags,
> >> > + zone_page_state(cc->zone,
> >> > + NR_FREE_PAGES_BLOCKS)))
> >> > + return COMPACT_SUCCESS;
> >> > +
> >> > + return COMPACT_CONTINUE;
> >> > + }
> >>
> >> Wonder if this ever succeds in practice. Is high_wmark_pages() even aligned
> >> to pageblock size? If not, and it's X pageblocks and a half, we will rarely
> >> have NR_FREE_PAGES_BLOCKS cover all of that? Also concurrent allocations can
> >> put us below high wmark quickly and then we never satisfy this?
> >
> > The high watermark is not aligned, but why does it have to be? It's a
> > binary condition: met or not met. Compaction continues until it's met.
>
> What I mean is, kswapd will reclaim until the high watermark, which would be
> 32.7 blocks, wake up kcompactd [*] but that can only create up to 32 blocks
> of NR_FREE_PAGES_BLOCKS so it has already lost at that point? (unless
> there's concurrent freeing pushing it above the high wmark)
Ah, but kswapd also uses the (rounded up) NR_FREE_PAGES_BLOCKS check.
Buckle up...
> > Of course, similar to kswapd, it might not reach the watermarks and
> > keep running if there is a continuous stream of allocations consuming
> > the blocks it's making. Hence the ratio between wakeups & continues.
> >
> > But when demand stops, it'll balance the high mark and quit.
>
> Again, since kcompactd can only defragment free space and not create it, it
> may be trying in vain?
>
> [*] now when checking the code between kswapd and kcompactd handover, I
> think I found a another problem?
>
> we have:
> kswapd_try_to_sleep()
> prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
> pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
> - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
> we were only reclaiming until now and didn't wake up kcompactd and
> this actually prevents the wake up?
Correct, so as per above, kswapd also does the NR_FREE_PAGES_BLOCKS
check. At first, at least. So it continues to produce adequate scratch
space and won't leave compaction high and dry on a watermark it cannot
meet. They are indeed coordinated in this aspect.
As far as the *handoff* to kcompactd goes, I've been pulling my hair
over this for a very long time. You're correct about the graph
above. And actually, this is the case before defrag_mode too: if you
wake kswapd with, say, an order-8, it will do pgdat_balanced() checks
against that, seemingly reclaim until the request can succeed, *then*
wake kcompactd and sleep. WTF?
But kswapd has this:
/*
* Fragmentation may mean that the system cannot be rebalanced for
* high-order allocations. If twice the allocation size has been
* reclaimed then recheck watermarks only at order-0 to prevent
* excessive reclaim. Assume that a process requested a high-order
* can direct reclaim/compact.
*/
if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
sc->order = 0;
Ignore the comment and just consider the code. What it does for higher
orders (whether defrag_mode is enabled or not), is reclaim a gap for
the order, ensure order-0 is met (but most likely it is), then enter
the sleep path - wake kcompactd and wait for more work.
Effectively, as long as there are pending higher-order requests
looping in the allocator, kswapd does this:
1) reclaim a compaction gap delta
2) wake kcompactd
3) goto 1
This pipelining seems to work *very* well in practice, especially when
there is a large number of concurrent requests.
In the huge allocator original series, I tried to convert kswapd to
use compaction_suitable() to hand over quicker. However, this ran into
scaling issues with higher allocation concurrency: maintaining just a
single, static compact gap when there could be hundreds of allocation
requests waiting for compaction results falls apart fast.
The current code has it right. The comments might be a bit dated and
maybe it could use some fine tuning. But generally, as long as there
are incoming wakeups from the allocator, it makes sense to keep making
more space for compaction as well.
I think Mel was playing 4d chess with this stuff.
[ I kept direct reclaim/compaction out of this defrag_mode series, but
testing suggests the same is likely true for the direct path.
Direct reclaim bails from compaction_ready() if there is a static
compaction gap for that order. But once the gap for a given order is
there, you can get a thundering herd of direct compactors storming
on this gap, most of which will then fail compaction_suitable().
A pipeline of "reclaim gap delta, direct compact, retry" seems to
make more sense there as well. With adequate checks to prevent
excessive reclaim in corner cases of course... ]
On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote:
> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote:
> > [*] now when checking the code between kswapd and kcompactd handover, I
> > think I found a another problem?
> >
> > we have:
> > kswapd_try_to_sleep()
> > prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd()
> > pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true
> > - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but
> > we were only reclaiming until now and didn't wake up kcompactd and
> > this actually prevents the wake up?
Coming back to this, there is indeed a defrag_mode issue. My
apologies, I misunderstood what you were pointing at.
Like I said, kswapd reverts to order-0 in some other place to go to
sleep and trigger the handoff. At that point, defrag_mode also needs
to revert to NR_FREE_PAGES.
It's curious that this didn't stand out in testing. On the contrary,
kcompactd was still doing the vast majority of the compaction work. It
looks like kswapd and direct workers on their own manage to balance
NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given
the low number of kcompactd wakeups, the consumers keep it going.
So testing with this:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc422ad830d6..c2aa0a4b67de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
/*
* In defrag_mode, watermarks must be met in whole
* blocks to avoid polluting allocator fallbacks.
+ *
+ * When kswapd has compact gap, check regular
+ * NR_FREE_PAGES and hand over to kcompactd.
*/
- if (defrag_mode)
+ if (defrag_mode && order)
item = NR_FREE_PAGES_BLOCKS;
else
item = NR_FREE_PAGES;
I'm getting the following results:
fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE
Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%)
Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%)
Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%)
Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%)
Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%)
THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%)
THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%)
Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%)
Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%)
Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%)
Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%)
Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%)
Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%)
Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%)
Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%)
Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%)
Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%)
Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%)
Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%)
Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%)
Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%)
Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%)
Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%)
Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%)
File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%)
Work has shifted from direct to kcompactd. In aggregate there is more
compaction happening. Meanwhile aggregate reclaim and swapping drops
quite substantially. %sys is down, so this is just more efficient.
Reclaim and swapping are down substantially, which is great. But the
reclaim work that remains has shifted somewhat to direct reclaim,
which is unfortunate. THP delivery is also a tad worse, but still much
better than !defrag_mode, so not too concerning. That part deserves a
bit more thought.
Overall, this looks good, though. I'll send a proper patch next week.
Thanks for the review, Vlastimil.
On 4/13/25 04:20, Johannes Weiner wrote: > On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote: >> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote: >> > [*] now when checking the code between kswapd and kcompactd handover, I >> > think I found a another problem? >> > >> > we have: >> > kswapd_try_to_sleep() >> > prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd() >> > pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true >> > - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but >> > we were only reclaiming until now and didn't wake up kcompactd and >> > this actually prevents the wake up? > > Coming back to this, there is indeed a defrag_mode issue. My > apologies, I misunderstood what you were pointing at. > > Like I said, kswapd reverts to order-0 in some other place to go to > sleep and trigger the handoff. At that point, defrag_mode also needs > to revert to NR_FREE_PAGES. > > It's curious that this didn't stand out in testing. On the contrary, > kcompactd was still doing the vast majority of the compaction work. It > looks like kswapd and direct workers on their own manage to balance > NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given > the low number of kcompactd wakeups, the consumers keep it going. > > So testing with this: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index cc422ad830d6..c2aa0a4b67de 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx) > /* > * In defrag_mode, watermarks must be met in whole > * blocks to avoid polluting allocator fallbacks. > + * > + * When kswapd has compact gap, check regular > + * NR_FREE_PAGES and hand over to kcompactd. > */ > - if (defrag_mode) > + if (defrag_mode && order) > item = NR_FREE_PAGES_BLOCKS; > else > item = NR_FREE_PAGES; > > I'm getting the following results: > > fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE > Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%) > Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%) > Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%) > Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%) > Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%) > THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%) > THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%) > Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%) > Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%) > Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%) > Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%) However this brings me back to my concern with __compact_finished() requiring high watermark of NR_FREE_PAGES_BLOCKS. IMHO it can really easily lead to situations where all free memory is already contiguous, but because of one or two THP concurrent allocations we're below the high watermark (but not yet low watermark to wake kswapd again) so further compaction by kcompactd is simply wasting cpu cycles at that point. Again I think a comparison of NR_FREE_PAGES_BLOCKS to NR_FREE_PAGES would in theory work better for determining if all free space is as defragmented as possible? > Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%) > Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%) > Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%) > Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%) > Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%) > Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%) > Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%) > Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%) > Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%) > Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%) > Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%) > Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%) > Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%) > File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%) > > Work has shifted from direct to kcompactd. In aggregate there is more > compaction happening. Meanwhile aggregate reclaim and swapping drops > quite substantially. %sys is down, so this is just more efficient. > > Reclaim and swapping are down substantially, which is great. But the > reclaim work that remains has shifted somewhat to direct reclaim, > which is unfortunate. THP delivery is also a tad worse, but still much > better than !defrag_mode, so not too concerning. That part deserves a > bit more thought. > > Overall, this looks good, though. I'll send a proper patch next week. > > Thanks for the review, Vlastimil.
On 4/13/25 04:20, Johannes Weiner wrote: > On Fri, Apr 11, 2025 at 02:21:58PM -0400, Johannes Weiner wrote: >> On Fri, Apr 11, 2025 at 06:51:51PM +0200, Vlastimil Babka wrote: >> > [*] now when checking the code between kswapd and kcompactd handover, I >> > think I found a another problem? >> > >> > we have: >> > kswapd_try_to_sleep() >> > prepare_kswapd_sleep() - needs to succeed for wakeup_kcompactd() >> > pgdat_balanced() - needs to be true for prepare_kswapd_sleep() to be true >> > - with defrag_mode we want high watermark of NR_FREE_PAGES_BLOCKS, but >> > we were only reclaiming until now and didn't wake up kcompactd and >> > this actually prevents the wake up? > > Coming back to this, there is indeed a defrag_mode issue. My > apologies, I misunderstood what you were pointing at. > > Like I said, kswapd reverts to order-0 in some other place to go to > sleep and trigger the handoff. At that point, defrag_mode also needs > to revert to NR_FREE_PAGES. I missed that revert to order-0 and that without it the current code also wouldn't make sense. But I agree with the fix. > It's curious that this didn't stand out in testing. On the contrary, > kcompactd was still doing the vast majority of the compaction work. It > looks like kswapd and direct workers on their own manage to balance > NR_FREE_PAGES_BLOCK every so often, and then kswapd hands off. Given > the low number of kcompactd wakeups, the consumers keep it going. > > So testing with this: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index cc422ad830d6..c2aa0a4b67de 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -6747,8 +6747,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx) > /* > * In defrag_mode, watermarks must be met in whole > * blocks to avoid polluting allocator fallbacks. > + * > + * When kswapd has compact gap, check regular > + * NR_FREE_PAGES and hand over to kcompactd. > */ > - if (defrag_mode) > + if (defrag_mode && order) > item = NR_FREE_PAGES_BLOCKS; > else > item = NR_FREE_PAGES; > > I'm getting the following results: > > fallbackspeed/STUPID-DEFRAGMODE fallbackspeed/DEFRAGMODE > Hugealloc Time mean 79381.34 ( +0.00%) 88126.12 ( +11.02%) > Hugealloc Time stddev 85852.16 ( +0.00%) 135366.75 ( +57.67%) > Kbuild Real time 249.35 ( +0.00%) 226.71 ( -9.04%) > Kbuild User time 1249.16 ( +0.00%) 1249.37 ( +0.02%) > Kbuild System time 171.76 ( +0.00%) 166.93 ( -2.79%) > THP fault alloc 51666.87 ( +0.00%) 52685.60 ( +1.97%) > THP fault fallback 16970.00 ( +0.00%) 15951.87 ( -6.00%) > Direct compact fail 166.53 ( +0.00%) 178.93 ( +7.40%) > Direct compact success 17.13 ( +0.00%) 4.13 ( -71.69%) > Compact daemon scanned migrate 3095413.33 ( +0.00%) 9231239.53 ( +198.22%) > Compact daemon scanned free 2155966.53 ( +0.00%) 7053692.87 ( +227.17%) > Compact direct scanned migrate 265642.47 ( +0.00%) 68388.33 ( -74.26%) > Compact direct scanned free 130252.60 ( +0.00%) 55634.87 ( -57.29%) > Compact total migrate scanned 3361055.80 ( +0.00%) 9299627.87 ( +176.69%) > Compact total free scanned 2286219.13 ( +0.00%) 7109327.73 ( +210.96%) > Alloc stall 1890.80 ( +0.00%) 6297.60 ( +232.94%) > Pages kswapd scanned 9043558.80 ( +0.00%) 5952576.73 ( -34.18%) > Pages kswapd reclaimed 1891708.67 ( +0.00%) 1030645.00 ( -45.52%) > Pages direct scanned 1017090.60 ( +0.00%) 2688047.60 ( +164.29%) > Pages direct reclaimed 92682.60 ( +0.00%) 309770.53 ( +234.22%) > Pages total scanned 10060649.40 ( +0.00%) 8640624.33 ( -14.11%) > Pages total reclaimed 1984391.27 ( +0.00%) 1340415.53 ( -32.45%) > Swap out 884585.73 ( +0.00%) 417781.93 ( -52.77%) > Swap in 287106.27 ( +0.00%) 95589.73 ( -66.71%) > File refaults 551697.60 ( +0.00%) 426474.80 ( -22.70%) > > Work has shifted from direct to kcompactd. In aggregate there is more > compaction happening. Meanwhile aggregate reclaim and swapping drops > quite substantially. %sys is down, so this is just more efficient. > > Reclaim and swapping are down substantially, which is great. But the > reclaim work that remains has shifted somewhat to direct reclaim, > which is unfortunate. THP delivery is also a tad worse, but still much > better than !defrag_mode, so not too concerning. That part deserves a > bit more thought. > > Overall, this looks good, though. I'll send a proper patch next week. > > Thanks for the review, Vlastimil. NP!
Andrew, could you please fold this delta patch?
---
From 3d2ff7b72df9e4f1a31b3cff2ae6a4584c06bdca Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 14 Mar 2025 11:38:41 -0400
Subject: [PATCH] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks fix
Fix squawks from rebasing that affect the behavior of !defrag_mode.
FWIW, it seems to actually have slightly helped the vanilla kernel in
the benchmark. But the point was to not change the default behavior:
VANILLA WMARKFIX-VANILLA
Hugealloc Time mean 52739.45 ( +0.00%) 62758.21 ( +19.00%)
Hugealloc Time stddev 56541.26 ( +0.00%) 76253.41 ( +34.86%)
Kbuild Real time 197.47 ( +0.00%) 197.25 ( -0.11%)
Kbuild User time 1240.49 ( +0.00%) 1241.33 ( +0.07%)
Kbuild System time 70.08 ( +0.00%) 71.00 ( +1.28%)
THP fault alloc 46727.07 ( +0.00%) 41492.73 ( -11.20%)
THP fault fallback 21910.60 ( +0.00%) 27146.53 ( +23.90%)
Direct compact fail 195.80 ( +0.00%) 260.93 ( +33.10%)
Direct compact success 7.93 ( +0.00%) 6.67 ( -14.18%)
Direct compact success rate % 3.51 ( +0.00%) 2.76 ( -16.78%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 3827734.27 ( +13.60%)
Compact daemon scanned free 5075474.47 ( +0.00%) 5910839.73 ( +16.46%)
Compact direct scanned migrate 161787.27 ( +0.00%) 168271.13 ( +4.01%)
Compact direct scanned free 163467.53 ( +0.00%) 222558.33 ( +36.15%)
Compact total migrate scanned 3531388.53 ( +0.00%) 3996005.40 ( +13.16%)
Compact total free scanned 5238942.00 ( +0.00%) 6133398.07 ( +17.07%)
Alloc stall 2371.07 ( +0.00%) 2478.00 ( +4.51%)
Pages kswapd scanned 2160926.73 ( +0.00%) 1726204.67 ( -20.12%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 537963.73 ( +0.90%)
Pages direct scanned 400450.33 ( +0.00%) 450004.87 ( +12.37%)
Pages direct reclaimed 94441.73 ( +0.00%) 99193.07 ( +5.03%)
Pages total scanned 2561377.07 ( +0.00%) 2176209.53 ( -15.04%)
Pages total reclaimed 627632.80 ( +0.00%) 637156.80 ( +1.52%)
Swap out 47959.53 ( +0.00%) 45186.20 ( -5.78%)
Swap in 7276.00 ( +0.00%) 7109.40 ( -2.29%)
File refaults 138043.00 ( +0.00%) 145238.73 ( +5.21%)
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/compaction.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 4a2ccb82d0b2..a481755791a9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -3075,6 +3075,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
struct zone *zone;
enum zone_type highest_zoneidx = pgdat->kcompactd_highest_zoneidx;
enum compact_result ret;
+ unsigned int alloc_flags = defrag_mode ?
+ ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN;
for (zoneid = 0; zoneid <= highest_zoneidx; zoneid++) {
zone = &pgdat->node_zones[zoneid];
@@ -3084,7 +3086,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
pgdat->kcompactd_max_order,
- highest_zoneidx, ALLOC_WMARK_MIN,
+ highest_zoneidx, alloc_flags,
false, true);
if (ret == COMPACT_CONTINUE)
return true;
@@ -3108,7 +3110,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = false,
.gfp_mask = GFP_KERNEL,
- .alloc_flags = ALLOC_WMARK_HIGH,
+ .alloc_flags = defrag_mode ? ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN,
};
enum compact_result ret;
--
2.48.1
© 2016 - 2025 Red Hat, Inc.