[v4] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

[PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

Posted by Matthew Brost 1 month, 2 weeks ago

TTM allocations at higher orders can drive Xe into a pathological
reclaim loop when memory is fragmented:

kswapd → shrinker → eviction → rebind (exec ioctl) → repeat

In this state, reclaim is triggered despite substantial free memory,
but fails to produce contiguous higher-order pages. The Xe shrinker then
evicts active buffer objects, increasing faulting and rebind activity
and further feeding the loop. The result is high CPU overhead and poor
GPU forward progress.

This issue was first reported in [1] and independently observed
internally and by Google.

A simple reproducer is:

- Boot an iGPU system with mem=8G
- Launch 10 Chrome tabs running the WebGL aquarium demo
- Configure each tab with ~5k fish

Under this workload, ftrace shows a continuous loop of:

xe_shrinker_scan (kswapd)
xe_vma_rebind_exec

Performance degrades significantly, with each tab dropping to ~2 FPS on
PTL (Ubuntu 24.04).

At the same time, /proc/buddyinfo shows substantial free memory but no
higher-order availability. For example, the Normal zone:

Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0

This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
indicating severe fragmentation.

This series addresses the issue in two ways:

TTM: Restrict direct reclaim to beneficial_order. Larger allocations
use __GFP_NORETRY to fail quickly rather than triggering reclaim.

Xe: Introduce a heuristic in the shrinker to avoid eviction when
running under kswapd and the system appears memory-rich but
fragmented.

With these changes, the reclaim/eviction loop is eliminated. The same
workload improves to ~10 FPS per tab (Ubuntu 24.04) or ~15 FPS per tab
(Ubuntu 24.10), and kswapd activity subsides.

Buddyinfo after applying this series shows restored higher-order
availability:

Count: 8526 7067 3092 1959 1292 660 194 28 20 13 1

Matt

v2:
 - Layer with core MM / TTM helpers (Thomas)
v4:
 - Fix build (CI)

[1] https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Cc: Dave Chinner <david@fromorbit.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
CC: dri-devel@lists.freedesktop.org
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Matthew Brost (6):
  mm: Wire up order in shrink_control
  mm: Introduce zone_maybe_fragmented_in_shrinker()
  drm/ttm: Issue direct reclaim at beneficial_order
  drm/ttm: Introduce ttm_bo_shrink_kswap_maybe_fragmented()
  drm/xe: Set TTM device beneficial_order to 9 (2M)
  drm/xe: Avoid shrinker reclaim from kswapd under fragmentation

 drivers/gpu/drm/ttm/ttm_bo_util.c | 38 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_pool.c    |  4 ++--
 drivers/gpu/drm/xe/xe_device.c    |  3 ++-
 drivers/gpu/drm/xe/xe_shrinker.c  |  3 +++
 include/drm/ttm/ttm_bo.h          |  2 ++
 include/linux/shrinker.h          |  3 +++
 include/linux/vmstat.h            | 12 ++++++++++
 mm/internal.h                     |  4 ++--
 mm/shrinker.c                     | 13 +++++++----
 mm/vmscan.c                       |  7 +++---
 10 files changed, 76 insertions(+), 13 deletions(-)

-- 
2.34.1

Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

Posted by Dave Chinner 1 month, 2 weeks ago

On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote:
> TTM allocations at higher orders can drive Xe into a pathological
> reclaim loop when memory is fragmented:
> 
> kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> 
> In this state, reclaim is triggered despite substantial free memory,
> but fails to produce contiguous higher-order pages. The Xe shrinker then
> evicts active buffer objects, increasing faulting and rebind activity
> and further feeding the loop. The result is high CPU overhead and poor
> GPU forward progress.
> 
> This issue was first reported in [1] and independently observed
> internally and by Google.
> 
> A simple reproducer is:
> 
> - Boot an iGPU system with mem=8G
> - Launch 10 Chrome tabs running the WebGL aquarium demo
> - Configure each tab with ~5k fish
> 
> Under this workload, ftrace shows a continuous loop of:
> 
> xe_shrinker_scan (kswapd)
> xe_vma_rebind_exec
> 
> Performance degrades significantly, with each tab dropping to ~2 FPS on
> PTL (Ubuntu 24.04).
> 
> At the same time, /proc/buddyinfo shows substantial free memory but no
> higher-order availability. For example, the Normal zone:
> 
> Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
> 
> This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
> indicating severe fragmentation.
> 
> This series addresses the issue in two ways:
> 
> TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> use __GFP_NORETRY to fail quickly rather than triggering reclaim.

NACK.

As I have said to the people trying to hack around direct reclaim
for high order allocations being costly for the page cache, fix the
problem with direct reclaim. (e.g.
https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/)

We should not be hacking around a problem in the mm infrastructure
by changing allocation context flags every high order allocation 
call site that needs high order allocations. Understand and fix the
infrastructure problem once and for all.

> Xe: Introduce a heuristic in the shrinker to avoid eviction when
> running under kswapd and the system appears memory-rich but
> fragmented.

NACK on architectural grounds.

Custom heuristics in individual shrinkers to decide whether the
should do what the mm subsystem has asked them to do has -always-
been a mistake to allow. The mm subsystem makes the decision on how
much cache shrinkage needs to occur, the shrinkers just do what they
are told to do.

If we have a problem where a workload causes excessive shrinker
reclaim, then we need to address the problem in the infrastructure
because excessive reclaim affects the performance of -all-
subsystems with shrinkable caches, not just the TTM subsystem.

As it is, I can't review what you've actually implemented because
you only cc'd me on a single patch in the series. In future, please
cc me on the whole patchset because shrinkers need to work as a
coherent whole, not just in isolation....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

Posted by Matthew Brost 1 month, 2 weeks ago

On Fri, May 01, 2026 at 11:42:19AM +1000, Dave Chinner wrote:

Thanks for the feedback. I’m looking into this more, and it’s becoming
clear that this is a hard problem—one that will likely require
coordinated work between DRM and core MM to really sort out. That said,
I do think what I have in place is a reasonable short-term fix.

More below.

> On Thu, Apr 30, 2026 at 12:18:03PM -0700, Matthew Brost wrote:
> > TTM allocations at higher orders can drive Xe into a pathological
> > reclaim loop when memory is fragmented:
> > 
> > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> > 
> > In this state, reclaim is triggered despite substantial free memory,
> > but fails to produce contiguous higher-order pages. The Xe shrinker then
> > evicts active buffer objects, increasing faulting and rebind activity
> > and further feeding the loop. The result is high CPU overhead and poor
> > GPU forward progress.
> > 
> > This issue was first reported in [1] and independently observed
> > internally and by Google.
> > 
> > A simple reproducer is:
> > 
> > - Boot an iGPU system with mem=8G
> > - Launch 10 Chrome tabs running the WebGL aquarium demo
> > - Configure each tab with ~5k fish
> > 
> > Under this workload, ftrace shows a continuous loop of:
> > 
> > xe_shrinker_scan (kswapd)
> > xe_vma_rebind_exec
> > 
> > Performance degrades significantly, with each tab dropping to ~2 FPS on
> > PTL (Ubuntu 24.04).
> > 
> > At the same time, /proc/buddyinfo shows substantial free memory but no
> > higher-order availability. For example, the Normal zone:
> > 
> > Count: 4063 4595 3455 3400 3139 2762 2293 1655 643 0 0
> > 
> > This corresponds to ~2.8GB free memory, but no order-9 (2MB) blocks,
> > indicating severe fragmentation.
> > 
> > This series addresses the issue in two ways:
> > 
> > TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> > use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> 
> NACK.
> 
> As I have said to the people trying to hack around direct reclaim
> for high order allocations being costly for the page cache, fix the
> problem with direct reclaim. (e.g.
> https://lore.kernel.org/linux-xfs/adLlrSZ5oRAa_Hfd@dread/)
> 

I read your response. Maybe this isn't clear what is going here.

At beneficial_order: gfp == __GFP_RECLAIM | __GFP_NORETRY
At order zero: gfp == __GFP_RECLAIM

This roughly existing behavior, the exact changes are here [1].

[1] https://patchwork.freedesktop.org/patch/722247/?series=165329&rev=3

If this is truly a NACK, then we can rethink it—likely by disabling
reclaim at higher orders—but that has its own downsides for DRM and
GPUs. Ideally, you want purgeable BOs to be evicted when a higher-order
allocation fails; you really don’t want to end up in an insane kswap
loop.

> We should not be hacking around a problem in the mm infrastructure
> by changing allocation context flags every high order allocation 
> call site that needs high order allocations. Understand and fix the
> infrastructure problem once and for all.
> 

Well, I agree that we should aim to fix this in core MM, but as the
saying goes, Rome wasn’t built in a day. The fact is that these GFP
flags do exist, and suddenly drawing a line and declaring them no longer
valid feels a bit unfair. I’ll also note that Intel—and I
personally—have an interest in fixing shrinking, so you can expect
follow-up work here.

> > Xe: Introduce a heuristic in the shrinker to avoid eviction when
> > running under kswapd and the system appears memory-rich but
> > fragmented.
> 
> NACK on architectural grounds.
> 
> Custom heuristics in individual shrinkers to decide whether the
> should do what the mm subsystem has asked them to do has -always-
> been a mistake to allow. The mm subsystem makes the decision on how

I’m not going to disagree with using custom heuristics in individual
shrinkers, but I’d wager that most shrinkers sadly already implement
custom heuristics.

> much cache shrinkage needs to occur, the shrinkers just do what they
> are told to do.
> 
> If we have a problem where a workload causes excessive shrinker
> reclaim, then we need to address the problem in the infrastructure
> because excessive reclaim affects the performance of -all-
> subsystems with shrinkable caches, not just the TTM subsystem.
> 

Yes, I agree, and I’ve thought about the implications of simply having
TTM back off when a higher-order allocation fails, even when we actually
have enough memory, and how that would affect everyone. This series at
least fixes the “well, there goes my GUI” problem.

I do have another patch locally that prevents TTM from accidentally
fragmenting memory and triggering the kswap loop, but under enough
pressure I can still get the GUI to lock up for periods of time. With
this series, however, I can’t reproduce that issue.

> As it is, I can't review what you've actually implemented because
> you only cc'd me on a single patch in the series. In future, please
> cc me on the whole patchset because shrinkers need to work as a
> coherent whole, not just in isolation....
> 

Sorry about this - Andrew just said the same thing. Here is PW link [2].

Or:

b4 mbox 20260430191809.2142544-1-matthew.brost@intel.com

[2] https://patchwork.freedesktop.org/series/165329/

If you have any ideas on how to fix this in the core, let’s discuss. I
have a bunch of ideas in my head, but core MM isn’t my native domain.

Matt

> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

Posted by Andrew Morton 1 month, 2 weeks ago

On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> TTM allocations at higher orders can drive Xe into a pathological
> reclaim loop when memory is fragmented:
> 
> kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> 
> In this state, reclaim is triggered despite substantial free memory,
> but fails to produce contiguous higher-order pages. The Xe shrinker then
> evicts active buffer objects, increasing faulting and rebind activity
> and further feeding the loop. The result is high CPU overhead and poor
> GPU forward progress.
> 
> ...
>
> This series addresses the issue in two ways:
> 
> TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> 
> Xe: Introduce a heuristic in the shrinker to avoid eviction when
> running under kswapd and the system appears memory-rich but
> fragmented.

Please cc everyone on all the patches?  It's kind of annoying to have
to hunt around to find out how these proposed changes will be used. 
Personal preference, anyway.

AI review flagged a few possible issues:
	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com

Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

Posted by Matthew Brost 1 month, 2 weeks ago

On Thu, Apr 30, 2026 at 04:01:05PM -0700, Andrew Morton wrote:
> On Thu, 30 Apr 2026 12:18:03 -0700 Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > TTM allocations at higher orders can drive Xe into a pathological
> > reclaim loop when memory is fragmented:
> > 
> > kswapd → shrinker → eviction → rebind (exec ioctl) → repeat
> > 
> > In this state, reclaim is triggered despite substantial free memory,
> > but fails to produce contiguous higher-order pages. The Xe shrinker then
> > evicts active buffer objects, increasing faulting and rebind activity
> > and further feeding the loop. The result is high CPU overhead and poor
> > GPU forward progress.
> > 
> > ...
> >
> > This series addresses the issue in two ways:
> > 
> > TTM: Restrict direct reclaim to beneficial_order. Larger allocations
> > use __GFP_NORETRY to fail quickly rather than triggering reclaim.
> > 
> > Xe: Introduce a heuristic in the shrinker to avoid eviction when
> > running under kswapd and the system appears memory-rich but
> > fragmented.
> 
> Please cc everyone on all the patches?  It's kind of annoying to have
> to hunt around to find out how these proposed changes will be used. 
> Personal preference, anyway.
> 

Will do - we discussed this in the past and thought we landed on Cc
everyone on the cover then individual patches but will blast everyone
going forward.

> AI review flagged a few possible issues:
> 	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com

Idk, who authors sashiko but what make it really nice if you could reply
to it to talk things out.

Looking at replies...

- 'Could this global counter drift significantly'
	this is looks right for multi-CPU which isn't really the target
	here, but will adjust

- 'Additionally, does NR_FREE_PAGES implicitly include CMA pages?'
	this is looks right, will adjust

- 'Can high_wmark_pages(zone) evaluate to zero during early boot'
	theoretically possible (?), but non-issue IMO, certainly a GPU
	shrinker which is current use case this is impossible but maybe
	add a warn_on if high_wmark_pages(zone) returns zero

- 'Is this description accurate?'
	I inverted the TTM kernel doc vs the code, will fix

Matt

Re: [PATCH v4 0/6] mm, drm/ttm, drm/xe: Avoid reclaim/eviction loops under fragmentation

Posted by Andrew Morton 1 month, 2 weeks ago

On Thu, 30 Apr 2026 23:28:08 -0700 Matthew Brost <matthew.brost@intel.com> wrote:

> > AI review flagged a few possible issues:
> > 	https://sashiko.dev/#/patchset/20260430191809.2142544-1-matthew.brost@intel.com
> 
> Idk, who authors sashiko but what make it really nice if you could reply
> to it to talk things out.

It's a gemini 3 thing, based on prompts developed by Roman
Gushchin and Chris Mason and others.  Google is making this available
to kernel developers at a non-trivial expense.

And yes, it would be great if Sashiko were able to learn from our
replies and to fine-tune its checking based on the human corrections. 
I've asked for this a few times but didn't really understand the reply
;)