mm/page_alloc.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
migratetype fallbacks and keep pageblocks clean. The allocator relies on
reclaim and compaction to free pages of the correct type before allowing
fallback as a last resort.
However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
direct reclaim or compaction. With defrag_mode=1, these allocations hit
the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
This causes a large number of SLUB allocation failures for
skbuff_head_cache under network-heavy workloads, despite free memory
being available in other migratetype freelists.
Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
__GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
fallbacks and should not cause fragmentation.
Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
Changes in v2:
- Add check for __GFP_KSWAPD_RECLAIM.
- Picked up Johannes acked-by tag.
v1: https://lore.kernel.org/all/20260518163736.173910-1-d@ilvokhin.com/
mm/page_alloc.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..c5a077de1be0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4811,8 +4811,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}
/* Caller is not willing to reclaim, we can't balance anything */
- if (!can_direct_reclaim)
+ if (!can_direct_reclaim) {
+ /*
+ * Reclaim/compaction cannot run, so defrag_mode's strategy
+ * of enforcing ALLOC_NOFRAGMENT cannot be fulfilled. Allow
+ * fallbacks rather than failing the allocation outright.
+ */
+ if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT) &&
+ (gfp_mask & __GFP_KSWAPD_RECLAIM)) {
+ alloc_flags &= ~ALLOC_NOFRAGMENT;
+ goto retry;
+ }
goto nopage;
+ }
/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
--
2.53.0-Meta
On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> migratetype fallbacks and keep pageblocks clean. The allocator relies on
> reclaim and compaction to free pages of the correct type before allowing
> fallback as a last resort.
>
> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> direct reclaim or compaction. With defrag_mode=1, these allocations hit
> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
>
> This causes a large number of SLUB allocation failures for
> skbuff_head_cache under network-heavy workloads, despite free memory
> being available in other migratetype freelists.
That sounds painful.
> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> fallbacks and should not cause fragmentation.
How serious is this to our users when running real-world workloads?
> Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu, May 21, 2026 at 04:59:10PM -0700, Andrew Morton wrote:
> On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> > When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> > migratetype fallbacks and keep pageblocks clean. The allocator relies on
> > reclaim and compaction to free pages of the correct type before allowing
> > fallback as a last resort.
> >
> > However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> > direct reclaim or compaction. With defrag_mode=1, these allocations hit
> > the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> > ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> >
> > This causes a large number of SLUB allocation failures for
> > skbuff_head_cache under network-heavy workloads, despite free memory
> > being available in other migratetype freelists.
>
> That sounds painful.
>
> > Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> > reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
> > speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> > __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> > fallbacks and should not cause fragmentation.
>
> How serious is this to our users when running real-world workloads?
We observed it on a few of the Meta workloads that adopted
defrag_mode=1.
For the service under load there were 85509 SLUB allocation failures
messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
for skbuff_head_cache, despite free pages being available in other
migratetype freelists (~13 GB free).
Since it is networking path from the practical point of view, this means
dropped packets, failed RPC requests, tail latency spikes and overall
service degradation.
>
> > Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
> >
> > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > How serious is this to our users when running real-world workloads? > > We observed it on a few of the Meta workloads that adopted > defrag_mode=1. > > For the service under load there were 85509 SLUB allocation failures > messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations > for skbuff_head_cache, despite free pages being available in other > migratetype freelists (~13 GB free). For a single machine, I assume. > Since it is networking path from the practical point of view, this means > dropped packets, failed RPC requests, tail latency spikes and overall > service degradation. OK, thanks. I assume 12 failures per second isn't a disaster, and that there's no need to fast-track this into 7.1?
On Fri, May 22, 2026 at 07:54:26PM -0700, Andrew Morton wrote: > On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote: > > > > How serious is this to our users when running real-world workloads? > > > > We observed it on a few of the Meta workloads that adopted > > defrag_mode=1. > > > > For the service under load there were 85509 SLUB allocation failures > > messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations > > for skbuff_head_cache, despite free pages being available in other > > migratetype freelists (~13 GB free). > > For a single machine, I assume. Yes, all of that data is from a single machine. > > > Since it is networking path from the practical point of view, this means > > dropped packets, failed RPC requests, tail latency spikes and overall > > service degradation. > > OK, thanks. I assume 12 failures per second isn't a disaster, and that > there's no need to fast-track this into 7.1? Yes, I agree. No need to fast-track this.
© 2016 - 2026 Red Hat, Inc.