[v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

[PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

Posted by Dmitry Ilvokhin 4 days, 11 hours ago

When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
migratetype fallbacks and keep pageblocks clean. The allocator relies on
reclaim and compaction to free pages of the correct type before allowing
fallback as a last resort.

However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
direct reclaim or compaction. With defrag_mode=1, these allocations hit
the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.

This causes a large number of SLUB allocation failures for
skbuff_head_cache under network-heavy workloads, despite free memory
being available in other migratetype freelists.

Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
__GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
fallbacks and should not cause fragmentation.

Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
Changes in v2:

- Add check for __GFP_KSWAPD_RECLAIM.
- Picked up Johannes acked-by tag.

v1: https://lore.kernel.org/all/20260518163736.173910-1-d@ilvokhin.com/

 mm/page_alloc.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..c5a077de1be0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4811,8 +4811,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim)
+	if (!can_direct_reclaim) {
+		/*
+		 * Reclaim/compaction cannot run, so defrag_mode's strategy
+		 * of enforcing ALLOC_NOFRAGMENT cannot be fulfilled. Allow
+		 * fallbacks rather than failing the allocation outright.
+		 */
+		if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT) &&
+		    (gfp_mask & __GFP_KSWAPD_RECLAIM)) {
+			alloc_flags &= ~ALLOC_NOFRAGMENT;
+			goto retry;
+		}
 		goto nopage;
+	}
 
 	/* Avoid recursion of direct reclaim */
 	if (current->flags & PF_MEMALLOC)
-- 
2.53.0-Meta

Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

Posted by Andrew Morton 3 days ago

On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> migratetype fallbacks and keep pageblocks clean. The allocator relies on
> reclaim and compaction to free pages of the correct type before allowing
> fallback as a last resort.
> 
> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> direct reclaim or compaction. With defrag_mode=1, these allocations hit
> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> 
> This causes a large number of SLUB allocation failures for
> skbuff_head_cache under network-heavy workloads, despite free memory
> being available in other migratetype freelists.

That sounds painful.

> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> fallbacks and should not cause fragmentation.

How serious is this to our users when running real-world workloads?

> Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

Posted by Dmitry Ilvokhin 2 days, 11 hours ago

On Thu, May 21, 2026 at 04:59:10PM -0700, Andrew Morton wrote:
> On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> > When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> > migratetype fallbacks and keep pageblocks clean. The allocator relies on
> > reclaim and compaction to free pages of the correct type before allowing
> > fallback as a last resort.
> > 
> > However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> > direct reclaim or compaction. With defrag_mode=1, these allocations hit
> > the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> > ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> > 
> > This causes a large number of SLUB allocation failures for
> > skbuff_head_cache under network-heavy workloads, despite free memory
> > being available in other migratetype freelists.
> 
> That sounds painful.
> 
> > Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> > reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
> > speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> > __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> > fallbacks and should not cause fragmentation.
> 
> How serious is this to our users when running real-world workloads?

We observed it on a few of the Meta workloads that adopted
defrag_mode=1.

For the service under load there were 85509 SLUB allocation failures
messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
for skbuff_head_cache, despite free pages being available in other
migratetype freelists (~13 GB free).

Since it is networking path from the practical point of view, this means
dropped packets, failed RPC requests, tail latency spikes and overall
service degradation.

> 
> > Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
> > 
> > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>

Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

Posted by Andrew Morton 1 day, 21 hours ago

On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> > How serious is this to our users when running real-world workloads?
> 
> We observed it on a few of the Meta workloads that adopted
> defrag_mode=1.
> 
> For the service under load there were 85509 SLUB allocation failures
> messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
> for skbuff_head_cache, despite free pages being available in other
> migratetype freelists (~13 GB free).

For a single machine, I assume.

> Since it is networking path from the practical point of view, this means
> dropped packets, failed RPC requests, tail latency spikes and overall
> service degradation.

OK, thanks.   I assume 12 failures per second isn't a disaster, and that
there's no need to fast-track this into 7.1?

Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

Posted by Dmitry Ilvokhin 1 day, 10 hours ago

On Fri, May 22, 2026 at 07:54:26PM -0700, Andrew Morton wrote:
> On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> > > How serious is this to our users when running real-world workloads?
> > 
> > We observed it on a few of the Meta workloads that adopted
> > defrag_mode=1.
> > 
> > For the service under load there were 85509 SLUB allocation failures
> > messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
> > for skbuff_head_cache, despite free pages being available in other
> > migratetype freelists (~13 GB free).
> 
> For a single machine, I assume.

Yes, all of that data is from a single machine.

> 
> > Since it is networking path from the practical point of view, this means
> > dropped packets, failed RPC requests, tail latency spikes and overall
> > service degradation.
> 
> OK, thanks.   I assume 12 failures per second isn't a disaster, and that
> there's no need to fast-track this into 7.1?

Yes, I agree. No need to fast-track this.