mm: Reduce direct reclaim stalls with RAM-backed swap

[RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Matt Fleming 1 month, 1 week ago

From: Matt Fleming <mfleming@cloudflare.com>

Hi,

Systems with zram-only swap can spin in direct reclaim for 20-30
minutes without ever invoking the OOM killer. We've hit this repeatedly
in production on machines with 377 GiB RAM and a 377 GiB zram device.

The problem
-----------

should_reclaim_retry() calls zone_reclaimable_pages() to estimate how
much memory is still reclaimable. That estimate includes anonymous
pages, on the assumption that swapping them out frees physical pages.

With disk-backed swap, that's true -- writing a page to disk frees a
page of RAM, and SwapFree accurately reflects how many more pages can
be written. With zram, the free slot count is inaccurate. A 377 GiB
zram device with 10% used reports ~340 GiB of free swap slots, but
filling those slots requires physical RAM that the system doesn't have
-- that's why it's in direct reclaim in the first place.

The reclaimable estimate is off by orders of magnitude.

The fix
-------

This patch introduces two new flags: BLK_FEAT_RAM_BACKED at the block
layer (set by zram and brd) and SWP_RAM_BACKED at the swap layer. When
all active swap devices are RAM-backed, should_reclaim_retry() excludes
anonymous pages from the reclaimable estimate and counts only
file-backed pages. Once file pages are exhausted the watermark check
fails and the kernel falls through to OOM.

Opting to OOM kill something over spinning in direct reclaim optimises
for Mean Time To Recovery (MTTR) and prevents "brownout" situations
where performance is degraded for prolonged periods (we've seen 20-30
minutes degraded system performance).

Design choices and known limitations
-------------------------------------

Why not fix zone_reclaimable_pages() globally?

  Other callers (e.g. balance_pgdat() in kswapd) use the anon-inclusive
  count for different purposes. Changing it globally risks breaking
  kswapd's reclaim decisions in ways that are hard to test. Limiting
  the change to should_reclaim_retry() keeps the blast radius small and
  squarely in the direct reclaim path.

What about mixed swap configurations (zram + disk)?

  When at least one disk-backed swap device is active,
  swap_all_ram_backed is false and the current behaviour is preserved.
  Per-device reclaimable accounting is possible but it's a much larger
  change, and mixed zram+disk configurations are uncommon in practice
  AFAIK.

Can we make zram free space accounting more accurate?

  This is possible but probably the most complicated solution. Swap
  device drivers could provide a callback which RAM-backed drivers
  would use to estimate how much physical memory they could store given
  some average compression ratio (either historic or projected given a
  list of anon pages to swap) and the amount of free physical memory.
  Plus, this wouldn't be constant and would change on every invocation
  of the callback inline with the current compression ratio and the
  amount of free memory.

Build-testing
-------------

Built with defconfig, allnoconfig, allmodconfig, and multiple
randconfig iterations on x86_64 / 7.0-rc2.

Matt Fleming (1):
  mm: Reduce direct reclaim stalls with RAM-backed swap

 drivers/block/brd.c           |  3 ++-
 drivers/block/zram/zram_drv.c |  3 ++-
 include/linux/blkdev.h        |  8 ++++++
 include/linux/swap.h          |  9 +++++++
 mm/page_alloc.c               | 23 ++++++++++++++++-
 mm/swapfile.c                 | 47 ++++++++++++++++++++++++++++++++++-
 6 files changed, 89 insertions(+), 4 deletions(-)

-- 
2.43.0

Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Sergey Senozhatsky 4 weeks ago

On (26/03/03 11:53), Matt Fleming wrote:
> What about mixed swap configurations (zram + disk)?
> 
>   When at least one disk-backed swap device is active,
>   swap_all_ram_backed is false and the current behaviour is preserved.
>   Per-device reclaimable accounting is possible but it's a much larger
>   change, and mixed zram+disk configurations are uncommon in practice
>   AFAIK.

There are also setups where zram is configured with a backing device
(a real physical device) to which zram writeback pages, effectively
releasing pool memory that they occupied.

Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Johannes Weiner 1 month ago

On Tue, Mar 03, 2026 at 11:53:57AM +0000, Matt Fleming wrote:
> When all active swap devices are RAM-backed, should_reclaim_retry()
> excludes anonymous pages from the reclaimable estimate and counts
> only file-backed pages. Once file pages are exhausted the watermark
> check fails and the kernel falls through to OOM.

What about when anon pages *are* reclaimable through compression,
though? Then we'd declare OOM prematurely.

You could make the case that what is reclaimable should have been
reclaimed already by the time we get here. But then you could make the
same case for file pages, and then there is nothing left.

The check is meant to be an optimization. The primary OOM cutoff is
that we aren't able to reclaim anything. This reclaimable check is a
shortcut that says, even if we are reclaiming some, there is not
enough juice in that box to keep squeezing.

Have you looked at what exactly keeps resetting no_progress_loops when
the system is in this state?

I could see an argument that the two checks are not properly aligned
right now. We could be making nominal forward progress on a small,
heavily thrashing cache position only; but we'll keep looping because,
well, look at all this anon memory! (Which isn't being reclaimed.)

If that's the case, a better solution might be to split
did_some_progress into anon and file progress, and only consider the
LRU pages for which reclaim is actually making headway. And ignore
those where we fail to succeed - for whatever reason, really, not just
this particular zram situation.

And if that isn't enough, maybe pass did_some_progress as the actual
page counts instead of a bool, and only consider an LRU type
reclaimable if the last scan cycle reclaimed at least N% of it.

Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Matt Fleming 1 month ago

On Tue, Mar 03, 2026 at 02:35:12PM -0500, Johannes Weiner wrote:
> 
> What about when anon pages *are* reclaimable through compression,
> though? Then we'd declare OOM prematurely.

I agree this RFC is a rather blunt approach which is why I tried to
limit it to zram/brd specifically.

> You could make the case that what is reclaimable should have been
> reclaimed already by the time we get here. But then you could make the
> same case for file pages, and then there is nothing left.
> 
> The check is meant to be an optimization. The primary OOM cutoff is
> that we aren't able to reclaim anything. This reclaimable check is a
> shortcut that says, even if we are reclaiming some, there is not
> enough juice in that box to keep squeezing.
> 
> Have you looked at what exactly keeps resetting no_progress_loops when
> the system is in this state?

I pulled data for some of the worst offenders atm but I couldn't catch
any in this 20-30 min brownout situation. Still, I think this
illustrates the problem...

Across three machines, every reclaim_retry_zone event showed
no_progress_loops = 0 and wmark_check = pass. On the busiest node (141
retry events over 5 minutes), the reclaimable estimate ranged from 4.8M
to 5.3M pages (19-21 GiB). The counter never incremented once.

The reclaimable watermark check also always passes. The traced
reclaimable values (19-21 GiB per zone) trivially exceed the min
watermark (~68 MiB), so should_reclaim_retry() never falls through on
that path either.

Sample output from a bpftrace script [1] on the reclaim_retry_zone
tracepoint (LOOPS = no_progress_loops, WMARK = wmark_check):

  COMM             PID    NODE ORDER    RECLAIMABLE      AVAILABLE      MIN_WMARK LOOPS WMARK
  app1          2133536     4     0        4960156        5013010          17522     0     1
  app2          2337869     5     0        4845655        4901543          17521     0     1
  app3           339457     6     0        4823519        4838900          17522     0     1
  app4          2179800     6     0        4819201        4835085          17522     0     1
  app5          2299092     0     0        3566433        3595953          15821     0     1
  app6          2194373     7     0        5612347        5626651          17521     0     1

Here are the numbers from a 5-minute bpftrace session on a node under
memory pressure:

  should_reclaim_retry:
    141 calls, no_progress_loops = 0 every time, wmark_check = pass every time
    reclaimable estimate: 4.8M - 5.3M pages (19-21 GiB)

  shrink_folio_list (mm_vmscan_lru_shrink_inactive) [2]:
    anon:  52M pages reclaimed / 244M scanned  (21% hit rate)
           53% of scan events reclaimed zero pages
    file:  33M pages reclaimed / 42M scanned   (78% hit rate)
           21% of scan events reclaimed zero pages

    priority distribution peaked at 2-3 (most aggressive levels)

[1] https://gist.github.com/mfleming/167b00bef7e1f4e686a6d32833c42079
[2] https://gist.github.com/mfleming/e31c86d3ab0a883e9053e19010150a13

A second node showed the same pattern: 18% anon scan efficiency vs 90%
file, no_progress_loops = 0, wmark always passes.

> I could see an argument that the two checks are not properly aligned
> right now. We could be making nominal forward progress on a small,
> heavily thrashing cache position only; but we'll keep looping because,
> well, look at all this anon memory! (Which isn't being reclaimed.)
>
> If that's the case, a better solution might be to split
> did_some_progress into anon and file progress, and only consider the
> LRU pages for which reclaim is actually making headway. And ignore
> those where we fail to succeed - for whatever reason, really, not just
> this particular zram situation.

Right. The mm_vmscan_lru_shrink_inactive tracepoint shows the anon LRU
being scanned aggressively at priority 1-3, but only 21% of scanned
pages are reclaimed. Meanwhile file reclaim runs at 78-90% efficiency
but there aren't enough file pages to satisfy the allocation.

> And if that isn't enough, maybe pass did_some_progress as the actual
> page counts instead of a bool, and only consider an LRU type
> reclaimable if the last scan cycle reclaimed at least N% of it.

Nice idea. I'll work on a patch.

Thanks,
Matt

Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Shakeel Butt 1 month, 1 week ago

Hi Matt,

Thanks for the report and one request I have is to avoid cover letter for a
single patch to avoid partitioning the discussion.

On Tue, Mar 03, 2026 at 11:53:57AM +0000, Matt Fleming wrote:
> From: Matt Fleming <mfleming@cloudflare.com>
> 
> Hi,
> 
> Systems with zram-only swap can spin in direct reclaim for 20-30
> minutes without ever invoking the OOM killer. We've hit this repeatedly
> in production on machines with 377 GiB RAM and a 377 GiB zram device.
> 

Have you tried zswap and if you see similar issues with zswap?

> The problem
> -----------
> 
> should_reclaim_retry() calls zone_reclaimable_pages() to estimate how
> much memory is still reclaimable. That estimate includes anonymous
> pages, on the assumption that swapping them out frees physical pages.
> 
> With disk-backed swap, that's true -- writing a page to disk frees a
> page of RAM, and SwapFree accurately reflects how many more pages can
> be written. With zram, the free slot count is inaccurate. A 377 GiB
> zram device with 10% used reports ~340 GiB of free swap slots, but
> filling those slots requires physical RAM that the system doesn't have
> -- that's why it's in direct reclaim in the first place.
> 
> The reclaimable estimate is off by orders of magnitude.
> 

Over the time we (kernel MM community) have implicitly decided to keep the
kernel oom-killer very conservative as adding more heuristics in the reclaim/oom
path makes the kernel more unreliable and punt the aggressiveness of oom-killing
to the userspace as a policy. All major Linux deployments have started using
userspace oom-killers like systemd-oomd, Android's LMKD, fb-oomd or some
internal alternatives. That provides more flexibility to define the
aggressiveness of oom-killing based on your business needs.

Though userspace oom-killers are prone to reliability issues (oom-killer getting
stuck in reclaim or not getting enough CPU), so we (Roman) are working on adding
support for BPF based oom-killer where wen think we can do oom policies more
reliably.

Anyways, I am wondering if you have tried systemd-oomd or some userspace
alternative. If you are interested in BPF oom-killer, we can help with that as
well.

thanks,
Shakeel

Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Jens Axboe 1 month ago

On 3/3/26 7:59 AM, Shakeel Butt wrote:
> Hi Matt,
> 
> Thanks for the report and one request I have is to avoid cover letter
> for a single patch to avoid partitioning the discussion.

No - cover letters are ALWAYS fine.

-- 
Jens Axboe

Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Matt Fleming 1 month ago

On Tue, Mar 03, 2026 at 06:59:04AM -0800, Shakeel Butt wrote:
> Hi Matt,
> 
> Thanks for the report and one request I have is to avoid cover letter for a
> single patch to avoid partitioning the discussion.
 
Noted.

> Have you tried zswap and if you see similar issues with zswap?
 
Yes, we've started experimenting with zswap but that's still in
progress.

> Over the time we (kernel MM community) have implicitly decided to keep the
> kernel oom-killer very conservative as adding more heuristics in the reclaim/oom
> path makes the kernel more unreliable and punt the aggressiveness of oom-killing
> to the userspace as a policy. All major Linux deployments have started using
> userspace oom-killers like systemd-oomd, Android's LMKD, fb-oomd or some
> internal alternatives. That provides more flexibility to define the
> aggressiveness of oom-killing based on your business needs.
> 
> Though userspace oom-killers are prone to reliability issues (oom-killer getting
> stuck in reclaim or not getting enough CPU), so we (Roman) are working on adding
> support for BPF based oom-killer where wen think we can do oom policies more
> reliably.
> 
> Anyways, I am wondering if you have tried systemd-oomd or some userspace
> alternative. If you are interested in BPF oom-killer, we can help with that as
> well.

oomd is also being discussed but so far we haven't experimented with it
yet.

What's the status of BPF oom-killer: is this the latest?

  https://lore.kernel.org/all/20260127024421.494929-1-roman.gushchin@linux.dev/

Thanks,
Matt

Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Posted by Shakeel Butt 1 month ago

On Tue, Mar 03, 2026 at 07:37:54PM +0000, Matt Fleming wrote:
> On Tue, Mar 03, 2026 at 06:59:04AM -0800, Shakeel Butt wrote:
> > Hi Matt,
> > 
> > Thanks for the report and one request I have is to avoid cover letter for a
> > single patch to avoid partitioning the discussion.
>  
> Noted.
> 
> > Have you tried zswap and if you see similar issues with zswap?
>  
> Yes, we've started experimenting with zswap but that's still in
> progress.
> 
> > Over the time we (kernel MM community) have implicitly decided to keep the
> > kernel oom-killer very conservative as adding more heuristics in the reclaim/oom
> > path makes the kernel more unreliable and punt the aggressiveness of oom-killing
> > to the userspace as a policy. All major Linux deployments have started using
> > userspace oom-killers like systemd-oomd, Android's LMKD, fb-oomd or some
> > internal alternatives. That provides more flexibility to define the
> > aggressiveness of oom-killing based on your business needs.
> > 
> > Though userspace oom-killers are prone to reliability issues (oom-killer getting
> > stuck in reclaim or not getting enough CPU), so we (Roman) are working on adding
> > support for BPF based oom-killer where wen think we can do oom policies more
> > reliably.
> > 
> > Anyways, I am wondering if you have tried systemd-oomd or some userspace
> > alternative. If you are interested in BPF oom-killer, we can help with that as
> > well.
> 
> oomd is also being discussed but so far we haven't experimented with it
> yet.
> 
> What's the status of BPF oom-killer: is this the latest?
> 
>   https://lore.kernel.org/all/20260127024421.494929-1-roman.gushchin@linux.dev/

Yes this is the latest and I think Roman is planning to send the next version
soon.