[v4] mm: improve folio refcount scalability

[PATCH v4 0/2] mm: improve folio refcount scalability

Posted by Gladyshev Ilya 2 days, 4 hours ago

This is v4 of the series, fixing some dumb mistakes from v3:
- Fix asserts that were never firing
- Rename set_page_count_as_frozen -> set_page_count_frozen
  (
   I don't really like tthe proposed "init" in the function name.
   For consistency, we can rename init_page_count -> set_page_count_init()
   However, if anyone insists, I will use the proposed init_...() naming
  )
- Set proper frozen value in the second patch
- Use VM_BUG_ON_PAGE instead of VM_BUG_ON

Original cover letter posted below:

Intro
=====
This patch optimizes small file read performance and overall folio refcount
scalability by refactoring page_ref_add_unless [core of folio_try_get].
This is alternative approach to previous attempts to fix small read
performance by avoiding refcount bumps [1][2].

Overview
========
Current refcount implementation is using zero counter as locked (dead/frozen)
state, which required CAS loop for increments to avoid temporary unlocks in
try_get functions. These CAS loops became a serialization point for otherwise
scalable and fast read side.

Proposed implementation separates "locked" logic from the counting, allowing
the use of optimistic fetch_add() instead of CAS. For more details, please
refer to the commit message of the patch itself.

Proposed logic maintains the same public API as before, including all existing
memory barrier guarantees.

Performance
===========
Performance was measured using a simple custom benchmark based on
will-it-scale[3]. This benchmark spawns N pinned threads/processes that
execute the following loop:
``
char buf[]
fd = open(/* same file in tmpfs */);

while (true) {
    pread(fd, buf, /* read size = */ 64, /* offset = */0)
}
``
While this is a synthetic load, it does highlight existing issue and
doesn't differ a lot from benchmarking in [2] patch.

This benchmark measures operations per second in the inner loop and the
results across all workers. Performance was tested on top of v6.15 kernel
on two platforms. Since threads and processes showed similar performance on
both systems, only the thread results are provided below. The performance
improvement scales linearly between the CPU counts shown.

Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1343381 | 1344401 |  +0.1
       2 | 2186160 | 2455837 | +12.3
       5 | 5277092 | 6108030 | +15.7
      10 | 5858123 | 7506328 | +28.1
      12 | 6484445 | 8137706 | +25.5
         /* Cross socket NUMA */
      14 | 3145860 | 4247391 | +35.0
      16 | 2350840 | 4262707 | +81.3
      18 | 2378825 | 4121415 | +73.2
      20 | 2438475 | 4683548 | +92.1
      24 | 2325998 | 4529737 | +94.7

Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1077276 | 1081653 |  +0.4
       5 | 4286838 | 4682513 |  +9.2
      10 | 1698095 | 1902753 | +12.1
      20 | 1662266 | 1921603 | +15.6
      49 | 1486745 | 1828926 | +23.0
      97 | 1617365 | 2052635 | +26.9
         /* Cross socket NUMA */
     105 | 1368319 | 1798862 | +31.5
     136 | 1008071 | 1393055 | +38.2
     168 |  879332 | 1245210 | +41.6
               /* SMT */
     193 |  905432 | 1294833 | +43.0
     289 |  851988 | 1313110 | +54.1
     353 |  771288 | 1347165 | +74.7

[0]: https://lore.kernel.org/lkml/cover.1776350895.git.gorbunov.ivan@h-partners.com/
[1]: https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
[2]: https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
[3]: https://github.com/antonblanchard/will-it-scale

---

Link to v3: https://lore.kernel.org/linux-mm/5dabf3a748fee0c7b142c74367e7586f5db1ed1e@linux.dev/

Gladyshev Ilya (1):
  mm: implement page refcount locking via dedicated bit

Gorbunov Ivan (1):
  mm: drop page refcount zero state semantics

 drivers/pci/p2pdma.c               |  4 +-
 include/linux/mm.h                 |  2 +-
 include/linux/page-flags.h         | 13 +++++++
 include/linux/page_ref.h           | 62 +++++++++++++++++++++++++-----
 kernel/liveupdate/kexec_handover.c |  6 +--
 lib/test_hmm.c                     |  4 +-
 mm/hugetlb.c                       |  2 +-
 mm/internal.h                      |  2 +-
 mm/memremap.c                      |  4 +-
 mm/mm_init.c                       |  6 +--
 mm/page_alloc.c                    |  4 +-
 11 files changed, 82 insertions(+), 27 deletions(-)


base-commit: 2d3090a8aeb596a26935db0955d46c9a5db5c6ce
-- 
2.54.0

Re: [PATCH v4 0/2] mm: improve folio refcount scalability

Posted by Andrew Morton 2 days, 3 hours ago

On Mon, 08 Jun 2026 21:53:01 +0000 "Gladyshev Ilya" <ilya.gladyshev@linux.dev> wrote:

> This patch optimizes small file read performance and overall folio refcount
> scalability by refactoring page_ref_add_unless [core of folio_try_get].
> This is alternative approach to previous attempts to fix small read
> performance by avoiding refcount bumps [1][2].

Thanks.  Nice numbers.

AI review had some things to say:
	https://sashiko.dev/#/patchset/df26082871b4c65b2bd38d409026237c08572836@linux.dev

I'm not sure we want all those new VM_BUG_ON_PAGE() calls in the long
term.  They look like development-time assistance.  Perhaps you could
make those a standalone patch at tail-of-series so we can keep it in
linux-next for a couple of months then throw it away before any
upstreaming?

Re: [PATCH v4 0/2] mm: improve folio refcount scalability

Posted by Gladyshev Ilya 1 day, 7 hours ago

> 
> On Mon, 08 Jun 2026 21:53:01 +0000 "Gladyshev Ilya" <ilya.gladyshev@linux.dev> wrote:
> 
> > 
> > This patch optimizes small file read performance and overall folio refcount
> >  scalability by refactoring page_ref_add_unless [core of folio_try_get].
> >  This is alternative approach to previous attempts to fix small read
> >  performance by avoiding refcount bumps [1][2].
> > 
> Thanks. Nice numbers.
> 
> AI review had some things to say:
>  https://sashiko.dev/#/patchset/df26082871b4c65b2bd38d409026237c08572836@linux.dev

Will look into it, thanks

> I'm not sure we want all those new VM_BUG_ON_PAGE() calls in the long
> term. They look like development-time assistance. Perhaps you could
> make those a standalone patch at tail-of-series so we can keep it in
> linux-next for a couple of months then throw it away before any
> upstreaming?

They are cheap and can catch bugs that are very difficult to debug. So
I'd like to keep them, if possible. Maybe change to WARN_ONCE, as David
suggested.

Re: [PATCH v4 0/2] mm: improve folio refcount scalability

Posted by Pedro Falcato 1 day, 5 hours ago

On Tue, Jun 09, 2026 at 07:02:25PM +0000, Gladyshev Ilya wrote:
> > 
> > On Mon, 08 Jun 2026 21:53:01 +0000 "Gladyshev Ilya" <ilya.gladyshev@linux.dev> wrote:
> > 
> > > 
> > > This patch optimizes small file read performance and overall folio refcount
> > >  scalability by refactoring page_ref_add_unless [core of folio_try_get].
> > >  This is alternative approach to previous attempts to fix small read
> > >  performance by avoiding refcount bumps [1][2].
> > > 
> > Thanks. Nice numbers.
> > 
> > AI review had some things to say:
> >  https://sashiko.dev/#/patchset/df26082871b4c65b2bd38d409026237c08572836@linux.dev
> 
> Will look into it, thanks
> 
> > I'm not sure we want all those new VM_BUG_ON_PAGE() calls in the long
> > term. They look like development-time assistance. Perhaps you could
> > make those a standalone patch at tail-of-series so we can keep it in
> > linux-next for a couple of months then throw it away before any
> > upstreaming?
> 
> They are cheap and can catch bugs that are very difficult to debug. So
> I'd like to keep them, if possible. Maybe change to WARN_ONCE, as David
> suggested.

Fully agree. I think it makes sense to keep these assertions for DEBUG_VM=y
(which no one should be running in prod anyway..)

-- 
Pedro

Re: [PATCH v4 0/2] mm: improve folio refcount scalability

Posted by David Hildenbrand (Arm) 1 day, 15 hours ago

On 6/9/26 00:47, Andrew Morton wrote:
> On Mon, 08 Jun 2026 21:53:01 +0000 "Gladyshev Ilya" <ilya.gladyshev@linux.dev> wrote:
> 
>> This patch optimizes small file read performance and overall folio refcount
>> scalability by refactoring page_ref_add_unless [core of folio_try_get].
>> This is alternative approach to previous attempts to fix small read
>> performance by avoiding refcount bumps [1][2].
> 
> Thanks.  Nice numbers.
> 
> AI review had some things to say:
> 	https://sashiko.dev/#/patchset/df26082871b4c65b2bd38d409026237c08572836@linux.dev
> 
> I'm not sure we want all those new VM_BUG_ON_PAGE() calls in the long
> term.  They look like development-time assistance. 

VM_WARN_ON_ONCE_PAGE is usually more than sufficient to catch stuff early during
testing.

-- 
Cheers,

David