mm: improve folio refcount scalability

[RFC PATCH 0/2] mm: improve folio refcount scalability

Posted by Gladyshev Ilya 1 month, 2 weeks ago

Intro
=====
This patch optimizes small file read performance and overall folio refcount
scalability by refactoring page_ref_add_unless [core of folio_try_get].
This is alternative approach to previous attempts to fix small read
performance by avoiding refcount bumps [1][2].

Overview
========
Current refcount implementation is using zero counter as locked (dead/frozen)
state, which required CAS loop for increments to avoid temporary unlocks in
try_get functions. These CAS loops became a serialization point for otherwise
scalable and fast read side.

Proposed implementation separates "locked" logic from the counting, allowing
the use of optimistic fetch_add() instead of CAS. For more details, please
refer to the commit message of the patch itself.

Proposed logic maintains the same public API as before, including all existing
memory barrier guarantees.

Drawbacks
=========
In theory, an optimistic fetch_add can overflow the atomic_t and reset the
locked state. Currently, this is mitigated via a single CAS operation after
the "failed" fetch_add, which tries to reset the counter to a locked zero.
While this best-effort approach doesn't have any strong guarantees, it's
unrealistic that there will be 2^31 highly contended try_get calls on a locked
folio, and in each of these calls, the CAS operation will fail.

If this guarantee isn't sufficient, it can be improved by performing a full
CAS loop when the counter is approaching overflow.

Performance
===========
Performance was measured using a simple custom benchmark based on
will-it-scale[3]. This benchmark spawns N pinned threads/processes that
execute the following loop:
``
char buf[]
fd = open(/* same file in tmpfs */);

while (true) {
    pread(fd, buf, /* read size = */ 64, /* offset = */0)
}
``
While this is a synthetic load, it does highlight existing issue and
doesn't differ a lot from benchmarking in [2] patch.

This benchmark measures operations per second in the inner loop and the
results across all workers. Performance was tested on top of v6.15 kernel[4]
on two platforms. Since threads and processes showed similar performance on
both systems, only the thread results are provided below. The performance
improvement scales linearly between the CPU counts shown.

Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1343381 | 1344401 |  +0.1
       2 | 2186160 | 2455837 | +12.3
       5 | 5277092 | 6108030 | +15.7
      10 | 5858123 | 7506328 | +28.1
      12 | 6484445 | 8137706 | +25.5
         /* Cross socket NUMA */
      14 | 3145860 | 4247391 | +35.0
      16 | 2350840 | 4262707 | +81.3
      18 | 2378825 | 4121415 | +73.2
      20 | 2438475 | 4683548 | +92.1
      24 | 2325998 | 4529737 | +94.7

Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1077276 | 1081653 |  +0.4
       5 | 4286838 | 4682513 |  +9.2
      10 | 1698095 | 1902753 | +12.1
      20 | 1662266 | 1921603 | +15.6
      49 | 1486745 | 1828926 | +23.0
      97 | 1617365 | 2052635 | +26.9
         /* Cross socket NUMA */
     105 | 1368319 | 1798862 | +31.5
     136 | 1008071 | 1393055 | +38.2
     168 |  879332 | 1245210 | +41.6
               /* SMT */
     193 |  905432 | 1294833 | +43.0
     289 |  851988 | 1313110 | +54.1
     353 |  771288 | 1347165 | +74.7

[1] https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
[3] https://github.com/antonblanchard/will-it-scale
[4] There were no changes to page_ref.h between v6.15 and v6.18 or any
    significant performance changes on the read side in mm/filemap.c

Gladyshev Ilya (2):
  mm: make ref_unless functions unless_zero only
  mm: implement page refcount locking via dedicated bit

 include/linux/mm.h         |  2 +-
 include/linux/page-flags.h |  9 ++++++---
 include/linux/page_ref.h   | 35 ++++++++++++++++++++++++++---------
 3 files changed, 33 insertions(+), 13 deletions(-)

-- 
2.43.0

Re: [RFC PATCH 0/2] mm: improve folio refcount scalability

Posted by Gladyshev Ilya 3 weeks, 5 days ago

Gentle ping on this proposal

> Intro
> =====
> This patch optimizes small file read performance and overall folio refcount
> scalability by refactoring page_ref_add_unless [core of folio_try_get].
> This is alternative approach to previous attempts to fix small read
> performance by avoiding refcount bumps [1][2].
> 
> Overview
> ========
> Current refcount implementation is using zero counter as locked (dead/frozen)
> state, which required CAS loop for increments to avoid temporary unlocks in
> try_get functions. These CAS loops became a serialization point for otherwise
> scalable and fast read side.
> 
> Proposed implementation separates "locked" logic from the counting, allowing
> the use of optimistic fetch_add() instead of CAS. For more details, please
> refer to the commit message of the patch itself.
> 
> Proposed logic maintains the same public API as before, including all existing
> memory barrier guarantees.
> 
> Drawbacks
> =========
> In theory, an optimistic fetch_add can overflow the atomic_t and reset the
> locked state. Currently, this is mitigated via a single CAS operation after
> the "failed" fetch_add, which tries to reset the counter to a locked zero.
> While this best-effort approach doesn't have any strong guarantees, it's
> unrealistic that there will be 2^31 highly contended try_get calls on a locked
> folio, and in each of these calls, the CAS operation will fail.
> 
> If this guarantee isn't sufficient, it can be improved by performing a full
> CAS loop when the counter is approaching overflow.
> 
> Performance
> ===========
> Performance was measured using a simple custom benchmark based on
> will-it-scale[3]. This benchmark spawns N pinned threads/processes that
> execute the following loop:
> ``
> char buf[]
> fd = open(/* same file in tmpfs */);
> 
> while (true) {
>      pread(fd, buf, /* read size = */ 64, /* offset = */0)
> }
> ``
> While this is a synthetic load, it does highlight existing issue and
> doesn't differ a lot from benchmarking in [2] patch.
> 
> This benchmark measures operations per second in the inner loop and the
> results across all workers. Performance was tested on top of v6.15 kernel[4]
> on two platforms. Since threads and processes showed similar performance on
> both systems, only the thread results are provided below. The performance
> improvement scales linearly between the CPU counts shown.
> 
> Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>         1 | 1343381 | 1344401 |  +0.1
>         2 | 2186160 | 2455837 | +12.3
>         5 | 5277092 | 6108030 | +15.7
>        10 | 5858123 | 7506328 | +28.1
>        12 | 6484445 | 8137706 | +25.5
>           /* Cross socket NUMA */
>        14 | 3145860 | 4247391 | +35.0
>        16 | 2350840 | 4262707 | +81.3
>        18 | 2378825 | 4121415 | +73.2
>        20 | 2438475 | 4683548 | +92.1
>        24 | 2325998 | 4529737 | +94.7
> 
> Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>         1 | 1077276 | 1081653 |  +0.4
>         5 | 4286838 | 4682513 |  +9.2
>        10 | 1698095 | 1902753 | +12.1
>        20 | 1662266 | 1921603 | +15.6
>        49 | 1486745 | 1828926 | +23.0
>        97 | 1617365 | 2052635 | +26.9
>           /* Cross socket NUMA */
>       105 | 1368319 | 1798862 | +31.5
>       136 | 1008071 | 1393055 | +38.2
>       168 |  879332 | 1245210 | +41.6
>                 /* SMT */
>       193 |  905432 | 1294833 | +43.0
>       289 |  851988 | 1313110 | +54.1
>       353 |  771288 | 1347165 | +74.7
> 
> [1] https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
> [3] https://github.com/antonblanchard/will-it-scale
> [4] There were no changes to page_ref.h between v6.15 and v6.18 or any
>      significant performance changes on the read side in mm/filemap.c
> 
> Gladyshev Ilya (2):
>    mm: make ref_unless functions unless_zero only
>    mm: implement page refcount locking via dedicated bit
> 
>   include/linux/mm.h         |  2 +-
>   include/linux/page-flags.h |  9 ++++++---
>   include/linux/page_ref.h   | 35 ++++++++++++++++++++++++++---------
>   3 files changed, 33 insertions(+), 13 deletions(-)
>

Re: [RFC PATCH 0/2] mm: improve folio refcount scalability

Posted by Kiryl Shutsemau 3 weeks, 5 days ago

On Mon, Jan 12, 2026 at 11:30:38AM +0300, Gladyshev Ilya wrote:
> Gentle ping on this proposal

I generally like the idea, but I would like to hear from folks who
actually understand serialization.

Also, do you have number for "a full CAS loop when the counter is
approaching overflow" thing?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC PATCH 0/2] mm: improve folio refcount scalability

Posted by Gladyshev Ilya 3 weeks, 5 days ago

On 1/12/2026 2:49 PM, Kiryl Shutsemau wrote:
> On Mon, Jan 12, 2026 at 11:30:38AM +0300, Gladyshev Ilya wrote:
>> Gentle ping on this proposal
> 
> I generally like the idea, but I would like to hear from folks who
> actually understand serialization.
> 
> Also, do you have number for "a full CAS loop when the counter is
> approaching overflow" thing?
> 
I am not sure that overflow is a real problem because you need a very 
specific race condition over a long time to achieve it... But as a 
safeguard, everything lower than 2^31 - #max concurrent accesses (~#num 
cpu) should work, so let's say 2^30

Re: [RFC PATCH 0/2] mm: improve folio refcount scalability

Posted by Kiryl Shutsemau 3 weeks, 4 days ago

On Mon, Jan 12, 2026 at 05:32:10PM +0300, Gladyshev Ilya wrote:
> On 1/12/2026 2:49 PM, Kiryl Shutsemau wrote:
> > On Mon, Jan 12, 2026 at 11:30:38AM +0300, Gladyshev Ilya wrote:
> > > Gentle ping on this proposal
> > 
> > I generally like the idea, but I would like to hear from folks who
> > actually understand serialization.
> > 
> > Also, do you have number for "a full CAS loop when the counter is
> > approaching overflow" thing?
> > 
> I am not sure that overflow is a real problem because you need a very
> specific race condition over a long time to achieve it...

Yes. But if the page is popular for pinning, GUP_PIN_COUNTING_BIAS can
cut the "very long time" substantially.

> But as a safeguard, everything lower than 2^31 - #max concurrent
> accesses (~#num cpu) should work, so let's say 2^30

What I meant is when we put a branch/loop in the hot path, your
performance numbers will likely not look as attractive. Am I wrong?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC PATCH 0/2] mm: improve folio refcount scalability

Posted by Gladyshev Ilya 3 weeks, 4 days ago

On 1/12/2026 7:17 PM, Kiryl Shutsemau wrote:
> On Mon, Jan 12, 2026 at 05:32:10PM +0300, Gladyshev Ilya wrote:
>> On 1/12/2026 2:49 PM, Kiryl Shutsemau wrote:
>>> On Mon, Jan 12, 2026 at 11:30:38AM +0300, Gladyshev Ilya wrote:
>>>> Gentle ping on this proposal
>>>
>>> I generally like the idea, but I would like to hear from folks who
>>> actually understand serialization.
>>>
>>> Also, do you have number for "a full CAS loop when the counter is
>>> approaching overflow" thing?
>>>
>> I am not sure that overflow is a real problem because you need a very
>> specific race condition over a long time to achieve it...
> 
> Yes. But if the page is popular for pinning, GUP_PIN_COUNTING_BIAS can
> cut the "very long time" substantially.
> 
>> But as a safeguard, everything lower than 2^31 - #max concurrent
>> accesses (~#num cpu) should work, so let's say 2^30
> 
> What I meant is when we put a branch/loop in the hot path, your
> performance numbers will likely not look as attractive. Am I wrong?
> 
It would be under the same branch as the single CAS that already exists 
in this patch:

   if (page_count_writable(page)) {
     val = atomic_add_return(nr, &page->_refcount);
     ret = !(val & PAGEREF_LOCKED_BIT);

     if (unlikely(!ret)) {
       atomic_cmpxchg_relaxed(&page->_refcount, val, PAGEREF_LOCKED_BIT);
       /* [Proposed] if (failed && big enough) { CAS loop } */
     }		
   }

Unless the "failed try_lock()" is the hot path somewhere[1], this added 
branch will be hidden under the already existing [unlikely taken] branch

[1]: Which I doubt, because failed try_lock() usually includes heavy 
re-lookup