mm/page_alloc.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
On NUMA systems without bindings, allocations check all nodes for free
space, then wake up the kswapds on all nodes and retry. This ensures
all available space is evenly used before reclaim begins. However,
when one process or certain allocations have node restrictions, they
can cause kswapds on only a subset of nodes to be woken up.
Since kswapd hysteresis targets watermarks that are *higher* than
needed for allocation, even *unrestricted* allocations can now get
suckered onto such nodes that are already pressured. This ends up
concentrating all allocations on them, even when there are idle nodes
available for the unrestricted requests.
This was observed with two numa nodes, where node0 is normal and node1
is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
kswapd on node0 only (since node1 is not eligible); once kswapd0 is
active, the watermarks hover between low and high, and then even the
movable allocations end up on node0, only to be kicked out again;
meanwhile node1 is empty and idle.
Similar behavior is possible when a process with NUMA bindings is
causing selective kswapd wakeups.
To fix this, on NUMA systems augment the (misleading) watermark test
with a check for whether kswapd is already active during the first
iteration through the zonelist. If this fails to place the request,
kswapd must be running everywhere already, and the watermark test is
good enough to decide placement.
With this patch, unrestricted requests successfully make use of node1,
even while kswapd is reclaiming node0 for restricted allocations.
[gourry@gourry.net: don't retry if no kswapds were active]
Signed-off-by: Gregory Price <gourry@gourry.net>
Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cf38d499e045..ffdaf5e30b58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3735,6 +3735,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
struct pglist_data *last_pgdat = NULL;
bool last_pgdat_dirty_ok = false;
bool no_fallback;
+ bool skip_kswapd_nodes = nr_online_nodes > 1;
+ bool skipped_kswapd_nodes = false;
retry:
/*
@@ -3797,6 +3799,19 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
}
}
+ /*
+ * If kswapd is already active on a node, keep looking
+ * for other nodes that might be idle. This can happen
+ * if another process has NUMA bindings and is causing
+ * kswapd wakeups on only some nodes. Avoid accidental
+ * "node_reclaim_mode"-like behavior in this case.
+ */
+ if (skip_kswapd_nodes &&
+ !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+ skipped_kswapd_nodes = true;
+ continue;
+ }
+
cond_accept_memory(zone, order, alloc_flags);
/*
@@ -3888,6 +3903,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
}
}
+ /*
+ * If we skipped over nodes with active kswapds and found no
+ * idle nodes, retry and place anywhere the watermarks permit.
+ */
+ if (skip_kswapd_nodes && skipped_kswapd_nodes) {
+ skip_kswapd_nodes = false;
+ goto retry;
+ }
+
/*
* It's possible on a UMA machine to get through all zones that are
* fragmented. If avoiding fragmentation, reset and try again.
--
2.51.0
On 9/19/25 6:21 PM, Johannes Weiner wrote: > On NUMA systems without bindings, allocations check all nodes for free > space, then wake up the kswapds on all nodes and retry. This ensures > all available space is evenly used before reclaim begins. However, > when one process or certain allocations have node restrictions, they > can cause kswapds on only a subset of nodes to be woken up. > > Since kswapd hysteresis targets watermarks that are *higher* than > needed for allocation, even *unrestricted* allocations can now get > suckered onto such nodes that are already pressured. This ends up > concentrating all allocations on them, even when there are idle nodes > available for the unrestricted requests. > > This was observed with two numa nodes, where node0 is normal and node1 > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes > kswapd on node0 only (since node1 is not eligible); once kswapd0 is > active, the watermarks hover between low and high, and then even the > movable allocations end up on node0, only to be kicked out again; > meanwhile node1 is empty and idle. Is this because node1 is slow tier as Zi suggested, or we're talking about allocations that are from node0's cpu, while allocations on node1's cpu would be fine? Also this sounds like something that ZONELIST_ORDER_ZONE handled until it was removed. But it wouldn't help with the NUMA binding case. > Similar behavior is possible when a process with NUMA bindings is > causing selective kswapd wakeups. > > To fix this, on NUMA systems augment the (misleading) watermark test > with a check for whether kswapd is already active during the first > iteration through the zonelist. If this fails to place the request, > kswapd must be running everywhere already, and the watermark test is > good enough to decide placement. Suppose kswapd finished reclaim already, so this check wouldn't kick in. Wouldn't we be over-pressuring node0 still, just somewhat less? > With this patch, unrestricted requests successfully make use of node1, > even while kswapd is reclaiming node0 for restricted allocations. > > [gourry@gourry.net: don't retry if no kswapds were active] > Signed-off-by: Gregory Price <gourry@gourry.net> > Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/page_alloc.c | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index cf38d499e045..ffdaf5e30b58 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3735,6 +3735,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > struct pglist_data *last_pgdat = NULL; > bool last_pgdat_dirty_ok = false; > bool no_fallback; > + bool skip_kswapd_nodes = nr_online_nodes > 1; > + bool skipped_kswapd_nodes = false; > > retry: > /* > @@ -3797,6 +3799,19 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > } > } > > + /* > + * If kswapd is already active on a node, keep looking > + * for other nodes that might be idle. This can happen > + * if another process has NUMA bindings and is causing > + * kswapd wakeups on only some nodes. Avoid accidental > + * "node_reclaim_mode"-like behavior in this case. > + */ > + if (skip_kswapd_nodes && > + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > + skipped_kswapd_nodes = true; > + continue; > + } > + > cond_accept_memory(zone, order, alloc_flags); > > /* > @@ -3888,6 +3903,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > } > } > > + /* > + * If we skipped over nodes with active kswapds and found no > + * idle nodes, retry and place anywhere the watermarks permit. > + */ > + if (skip_kswapd_nodes && skipped_kswapd_nodes) { > + skip_kswapd_nodes = false; > + goto retry; > + } > + > /* > * It's possible on a UMA machine to get through all zones that are
On Wed, Oct 01, 2025 at 04:59:02PM +0200, Vlastimil Babka wrote: > On 9/19/25 6:21 PM, Johannes Weiner wrote: > > On NUMA systems without bindings, allocations check all nodes for free > > space, then wake up the kswapds on all nodes and retry. This ensures > > all available space is evenly used before reclaim begins. However, > > when one process or certain allocations have node restrictions, they > > can cause kswapds on only a subset of nodes to be woken up. > > > > Since kswapd hysteresis targets watermarks that are *higher* than > > needed for allocation, even *unrestricted* allocations can now get > > suckered onto such nodes that are already pressured. This ends up > > concentrating all allocations on them, even when there are idle nodes > > available for the unrestricted requests. > > > > This was observed with two numa nodes, where node0 is normal and node1 > > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes > > kswapd on node0 only (since node1 is not eligible); once kswapd0 is > > active, the watermarks hover between low and high, and then even the > > movable allocations end up on node0, only to be kicked out again; > > meanwhile node1 is empty and idle. > > Is this because node1 is slow tier as Zi suggested, or we're talking > about allocations that are from node0's cpu, while allocations on > node1's cpu would be fine? It applies in either case. The impetus for this fix was from behavior in a tiered system, but this seems like a general NUMA problem to me. Say you have a VM where you use an extra node for runtime resizing, making it ZONE_MOVABLE to keep it hotpluggable. > > Similar behavior is possible when a process with NUMA bindings is > > causing selective kswapd wakeups. > > > > To fix this, on NUMA systems augment the (misleading) watermark test > > with a check for whether kswapd is already active during the first > > iteration through the zonelist. If this fails to place the request, > > kswapd must be running everywhere already, and the watermark test is > > good enough to decide placement. > > Suppose kswapd finished reclaim already, so this check wouldn't kick in. > Wouldn't we be over-pressuring node0 still, just somewhat less? Yes. And we've seen that to a degree, where kswapd goes to sleep intermittently and the occasional (high - low) batch of fresh pages makes it into node0 until kswapd is woken up again. It still fixed the big picture pathological case, though, where *everything* was just concentrated on node0. So I figured why complicate it. But there would be room for some hysteresis. Another option could be, instead of checking kswapds, to check the watermarks against the high thresholds on that first zonelist iteration. After all, that's where a recently-gone-to-sleep would leave the watermark level. But it would need a fudge factor too, to account for the fact that kswapd might overreclaim past the high watermark. And the overreclaim factor is something that has historically fluctuated quite a bit between systems and kernel versions. So this could be too fragile. Kswapd being active is a very definitive signal comparably.
On Wed, Oct 01, 2025 at 04:59:02PM +0200, Vlastimil Babka wrote: > On 9/19/25 6:21 PM, Johannes Weiner wrote: > > > > Since kswapd hysteresis targets watermarks that are *higher* than > > needed for allocation, even *unrestricted* allocations can now get > > suckered onto such nodes that are already pressured. This ends up > > concentrating all allocations on them, even when there are idle nodes > > available for the unrestricted requests. > > > > This was observed with two numa nodes, where node0 is normal and node1 > > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes > > kswapd on node0 only (since node1 is not eligible); once kswapd0 is > > active, the watermarks hover between low and high, and then even the > > movable allocations end up on node0, only to be kicked out again; > > meanwhile node1 is empty and idle. > > Is this because node1 is slow tier as Zi suggested, or we're talking > about allocations that are from node0's cpu, while allocations on > node1's cpu would be fine? > > Also this sounds like something that ZONELIST_ORDER_ZONE handled until > it was removed. But it wouldn't help with the NUMA binding case. > node1 is a cpu-less memory node with 100% ZONE_MOVABLE memory. Our first theory was that this was a zone-order vs node-order issue, but we found this kswapd thrashing to be the issue instead. No mempolicy was in use here, it's all grounded in GFP/ZONE interactions. > > Similar behavior is possible when a process with NUMA bindings is > > causing selective kswapd wakeups. > > > > To fix this, on NUMA systems augment the (misleading) watermark test > > with a check for whether kswapd is already active during the first > > iteration through the zonelist. If this fails to place the request, > > kswapd must be running everywhere already, and the watermark test is > > good enough to decide placement. > > Suppose kswapd finished reclaim already, so this check wouldn't kick in. > Wouldn't we be over-pressuring node0 still, just somewhat less? > This is the current and desired behavior when nodes are not in exclusive zones. We still want the allocations to kick kswapd to reclaim/age/demote cold folios from the local node to the remote node. But when that happens, and the remote node is not pressured, there's no reason to wait for reclaim before servicing an allocation. Once all the nodes are pressured (all kswapd is running), we end up back in the position of preferring to wait for a page on the local node rather than wait for a page on the remote node. There will obviously be some transient sleep/wake of kswapd, but that's already the case. The key observation here is this patch allows for fallback allocations on remote nodes when nodes have exclusive zone memberships (node0=NORMAL, node1=MOVABLE). ~Gregory
On 19 Sep 2025, at 12:21, Johannes Weiner wrote: > On NUMA systems without bindings, allocations check all nodes for free > space, then wake up the kswapds on all nodes and retry. This ensures > all available space is evenly used before reclaim begins. However, > when one process or certain allocations have node restrictions, they > can cause kswapds on only a subset of nodes to be woken up. > > Since kswapd hysteresis targets watermarks that are *higher* than > needed for allocation, even *unrestricted* allocations can now get > suckered onto such nodes that are already pressured. This ends up > concentrating all allocations on them, even when there are idle nodes > available for the unrestricted requests. This is because we build the zonelist from node 0 to the last node and getting free pages always follows zonelist order, right? > > This was observed with two numa nodes, where node0 is normal and node1 > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes > kswapd on node0 only (since node1 is not eligible); once kswapd0 is > active, the watermarks hover between low and high, and then even the > movable allocations end up on node0, only to be kicked out again; > meanwhile node1 is empty and idle. > > Similar behavior is possible when a process with NUMA bindings is > causing selective kswapd wakeups. > > To fix this, on NUMA systems augment the (misleading) watermark test > with a check for whether kswapd is already active during the first > iteration through the zonelist. If this fails to place the request, > kswapd must be running everywhere already, and the watermark test is > good enough to decide placement. > > With this patch, unrestricted requests successfully make use of node1, > even while kswapd is reclaiming node0 for restricted allocations. Thinking about this from memory tiering POV, when a fast node (e.g., node 0, and assume node 1 is a slow node) is evicting cold pages using kswapd, unrestricted programs will see performance degradation after your change. Since before the change, they start from a fast node, but now they start from a slow node. Maybe kernel wants to shuffle zonelist based on the emptiness of each zone, trying to spread allocations across all zones. For memory tiering, spreading allocation should be done within a tier. Since even with this fix, in a case where there are 3 nodes, node 0 is heavily used by restricted allocations, node 2 will be unused until node 1 is full for unrestricted allocations and unnecessary kswapd wake on node 1 can happen. These are just my thoughts when reading the commit log. > > [gourry@gourry.net: don't retry if no kswapds were active] > Signed-off-by: Gregory Price <gourry@gourry.net> > Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/page_alloc.c | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index cf38d499e045..ffdaf5e30b58 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3735,6 +3735,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > struct pglist_data *last_pgdat = NULL; > bool last_pgdat_dirty_ok = false; > bool no_fallback; > + bool skip_kswapd_nodes = nr_online_nodes > 1; > + bool skipped_kswapd_nodes = false; > > retry: > /* > @@ -3797,6 +3799,19 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > } > } > > + /* > + * If kswapd is already active on a node, keep looking > + * for other nodes that might be idle. This can happen > + * if another process has NUMA bindings and is causing > + * kswapd wakeups on only some nodes. Avoid accidental > + * "node_reclaim_mode"-like behavior in this case. > + */ > + if (skip_kswapd_nodes && > + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) { > + skipped_kswapd_nodes = true; > + continue; > + } > + > cond_accept_memory(zone, order, alloc_flags); > > /* > @@ -3888,6 +3903,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > } > } > > + /* > + * If we skipped over nodes with active kswapds and found no > + * idle nodes, retry and place anywhere the watermarks permit. > + */ > + if (skip_kswapd_nodes && skipped_kswapd_nodes) { > + skip_kswapd_nodes = false; > + goto retry; > + } > + > /* > * It's possible on a UMA machine to get through all zones that are > * fragmented. If avoiding fragmentation, reset and try again. > -- > 2.51.0 The change looks reasonable to me. Acked-by: Zi Yan <ziy@nvidia.com> Best Regards, Yan, Zi
Sorry I missed your reply :( On Fri, Sep 19, 2025 at 01:18:28PM -0400, Zi Yan wrote: > On 19 Sep 2025, at 12:21, Johannes Weiner wrote: > > > On NUMA systems without bindings, allocations check all nodes for free > > space, then wake up the kswapds on all nodes and retry. This ensures > > all available space is evenly used before reclaim begins. However, > > when one process or certain allocations have node restrictions, they > > can cause kswapds on only a subset of nodes to be woken up. > > > > Since kswapd hysteresis targets watermarks that are *higher* than > > needed for allocation, even *unrestricted* allocations can now get > > suckered onto such nodes that are already pressured. This ends up > > concentrating all allocations on them, even when there are idle nodes > > available for the unrestricted requests. > > This is because we build the zonelist from node 0 to the last node > and getting free pages always follows zonelist order, right? Yes, exactly. > > This was observed with two numa nodes, where node0 is normal and node1 > > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes > > kswapd on node0 only (since node1 is not eligible); once kswapd0 is > > active, the watermarks hover between low and high, and then even the > > movable allocations end up on node0, only to be kicked out again; > > meanwhile node1 is empty and idle. > > > > Similar behavior is possible when a process with NUMA bindings is > > causing selective kswapd wakeups. > > > > To fix this, on NUMA systems augment the (misleading) watermark test > > with a check for whether kswapd is already active during the first > > iteration through the zonelist. If this fails to place the request, > > kswapd must be running everywhere already, and the watermark test is > > good enough to decide placement. > > > > With this patch, unrestricted requests successfully make use of node1, > > even while kswapd is reclaiming node0 for restricted allocations. > > Thinking about this from memory tiering POV, when a fast node (e.g., node 0, > and assume node 1 is a slow node) is evicting cold pages using kswapd, > unrestricted programs will see performance degradation after your change. > Since before the change, they start from a fast node, but now they start from > a slow node. I don't think that's quite right. The default local-first NUMA policy absent any bindings or zone restrictions is that you first fill node0, *then* you fill node1, *then* kswapd is woken up on both nodes - at which point new allocations would go wherever there is room in order of preference. I'm just making it so that iff kswapd0 is woken prematurely due to restrictions, we still fill node1. In either case, node1 is only filled when node0 space is exhausted. > Maybe kernel wants to shuffle zonelist based on the emptiness of each zone, > trying to spread allocations across all zones. For memory tiering, > spreading allocation should be done within a tier. Since even with this fix, > in a case where there are 3 nodes, node 0 is heavily used by restricted > allocations, node 2 will be unused until node 1 is full for unrestricted > allocations and unnecessary kswapd wake on node 1 can happen. Kswapd on node1 only wakes once node2 is watermark-full as well. This is the intended behavior of the "local first" numa policy. I'm not trying to implement interleaving, it's purely about the quirk that watermarks alone are not reliable predictors for whether a node is full or not if kswapd is running. So we would expect to see fill node0 -> fill node1 -> fill node2 -> wake all sleeping kswapds - without restricted allocations in the vanilla kernel - with restricted allocations after this patch.
© 2016 - 2025 Red Hat, Inc.