The current atomic-based page refcount implementation treats zero
counter as dead and requires a compare-and-swap loop in folio_try_get()
to prevent incrementing a dead refcount. This CAS loop acts as a
serialization point and can become a significant bottleneck during
high-frequency file read operations.
This patch introduces FOLIO_LOCKED_BIT to distinguish between a
(temporary) zero refcount and a locked (dead/frozen) state. Because now
incrementing counter doesn't affect it's locked/unlocked state, it is
possible to use an optimistic atomic_fetch_add() in
page_ref_add_unless_zero() that operates independently of the locked bit.
The locked state is handled after the increment attempt, eliminating the
need for the CAS loop.
Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com>
Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com>
---
include/linux/page-flags.h | 5 ++++-
include/linux/page_ref.h | 25 +++++++++++++++++++++----
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7c2195baf4c1..f2a9302104eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -196,6 +196,9 @@ enum pageflags {
#define PAGEFLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1)
+/* Most significant bit in page refcount */
+#define PAGEREF_LOCKED_BIT (1 << 31)
+
#ifndef __GENERATING_BOUNDS_H
#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
@@ -257,7 +260,7 @@ static __always_inline bool page_count_writable(const struct page *page)
* The refcount check also prevents modification attempts to other (r/o)
* tail pages that are not fake heads.
*/
- if (!atomic_read_acquire(&page->_refcount))
+ if (atomic_read_acquire(&page->_refcount) & PAGEREF_LOCKED_BIT)
return false;
return page_fixed_fake_head(page) == page;
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index b0e3f4a4b4b8..98717fd25306 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -64,7 +64,12 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
static inline int page_ref_count(const struct page *page)
{
- return atomic_read(&page->_refcount);
+ int val = atomic_read(&page->_refcount);
+
+ if (unlikely(val & PAGEREF_LOCKED_BIT))
+ return 0;
+
+ return val;
}
/**
@@ -176,6 +181,9 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
{
int ret = atomic_sub_and_test(nr, &page->_refcount);
+ if (ret)
+ ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
+
if (page_ref_tracepoint_active(page_ref_mod_and_test))
__page_ref_mod_and_test(page, -nr, ret);
return ret;
@@ -204,6 +212,9 @@ static inline int page_ref_dec_and_test(struct page *page)
{
int ret = atomic_dec_and_test(&page->_refcount);
+ if (ret)
+ ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_LOCKED_BIT);
+
if (page_ref_tracepoint_active(page_ref_mod_and_test))
__page_ref_mod_and_test(page, -1, ret);
return ret;
@@ -231,11 +242,17 @@ static inline int folio_ref_dec_return(struct folio *folio)
static inline bool page_ref_add_unless_zero(struct page *page, int nr)
{
bool ret = false;
+ int val;
rcu_read_lock();
/* avoid writing to the vmemmap area being remapped */
- if (page_count_writable(page))
- ret = atomic_add_unless(&page->_refcount, nr, 0);
+ if (page_count_writable(page)) {
+ val = atomic_add_return(nr, &page->_refcount);
+ ret = !(val & PAGEREF_LOCKED_BIT);
+
+ if (unlikely(!ret))
+ atomic_cmpxchg_relaxed(&page->_refcount, val, PAGEREF_LOCKED_BIT);
+ }
rcu_read_unlock();
if (page_ref_tracepoint_active(page_ref_mod_unless))
@@ -271,7 +288,7 @@ static inline bool folio_ref_try_add(struct folio *folio, int count)
static inline int page_ref_freeze(struct page *page, int count)
{
- int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
+ int ret = likely(atomic_cmpxchg(&page->_refcount, count, PAGEREF_LOCKED_BIT) == count);
if (page_ref_tracepoint_active(page_ref_freeze))
__page_ref_freeze(page, count, ret);
--
2.43.0
On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote: > The current atomic-based page refcount implementation treats zero > counter as dead and requires a compare-and-swap loop in folio_try_get() > to prevent incrementing a dead refcount. This CAS loop acts as a > serialization point and can become a significant bottleneck during > high-frequency file read operations. > > This patch introduces FOLIO_LOCKED_BIT to distinguish between a > (temporary) zero refcount and a locked (dead/frozen) state. Because now > incrementing counter doesn't affect it's locked/unlocked state, it is > possible to use an optimistic atomic_fetch_add() in > page_ref_add_unless_zero() that operates independently of the locked bit. > The locked state is handled after the increment attempt, eliminating the > need for the CAS loop. > Such a fundamental change needs additional validation to show there's no obvious failures. Have you run this through a model checker to verify the only failure condition is the 2^31 overflow condition you describe? A single benchmark and a short changelog is leaves me very uneasy about such a change. ~Gregory
On 12/19/2025 9:17 PM, Gregory Price wrote:
> On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
>> The current atomic-based page refcount implementation treats zero
>> counter as dead and requires a compare-and-swap loop in folio_try_get()
>> to prevent incrementing a dead refcount. This CAS loop acts as a
>> serialization point and can become a significant bottleneck during
>> high-frequency file read operations.
>>
>> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
>> (temporary) zero refcount and a locked (dead/frozen) state. Because now
>> incrementing counter doesn't affect it's locked/unlocked state, it is
>> possible to use an optimistic atomic_fetch_add() in
>> page_ref_add_unless_zero() that operates independently of the locked bit.
>> The locked state is handled after the increment attempt, eliminating the
>> need for the CAS loop.
>>
>
> Such a fundamental change needs additional validation to show there's no
> obvious failures. Have you run this through a model checker to verify
> the only failure condition is the 2^31 overflow condition you describe?
Aside from extensive logical reasoning, I validated some racy situations
via tools/memory-model model checking:
1. Increment vs. free race (bad output: use-after-free | memory leak)
2. Free vs. free race (bad output: double free | memory leak)
3. Increment vs. freeze (bad output: both fails)
4. Increment vs. unfreeze (bad output: missed increment)
If there are other scenarios you are concerned about, I will model them
as well. You can find the litmus tests at the end of this email.
> A single benchmark and a short changelog is leaves me very uneasy about
> such a change.
This RFC submission was primarily focused on demonstrating the concept
and the performance gain for the reported bottleneck. I will improve the
changelog (and safety reasoning) for later submissions, as well as the
benchmarking side.
---
Note: I used 32 as locked bit in model tests for better readability. It
doesn't affect anything
---
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/free_free_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/free_free_race.litmus
new file mode 100644
index 000000000000..4dc7e899245b
--- /dev/null
+++ b/tools/memory-model/litmus-tests/folio_refcount/free_free_race.litmus
@@ -0,0 +1,37 @@
+C free_vs_free_race
+
+(* Result: Never
+ *
+ * Both P0 and P1 tries to decrement refcount.
+ *
+ * Expected result: only one deallocation (r0 xor r1 == 1)
+ * which is equal to r0 != r1 => bad result is r0 == r1
+*)
+
+{
+ int refcount = 2;
+}
+
+P0(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_dec_and_test(refcount);
+ if (r0) {
+ r0 = atomic_cmpxchg_relaxed(refcount, 0, 32) == 0;
+ }
+}
+
+
+P1(int *refcount)
+{
+ int r1;
+
+ r1 = atomic_dec_and_test(refcount);
+ if (r1) {
+ r1 = atomic_cmpxchg_relaxed(refcount, 0, 32) == 0;
+ }
+}
+
+exists (0:r0 == 1:r1)
+
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/inc_free_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/inc_free_race.litmus
new file mode 100644
index 000000000000..863abba48415
--- /dev/null
+++ b/tools/memory-model/litmus-tests/folio_refcount/inc_free_race.litmus
@@ -0,0 +1,34 @@
+C inc_free_race
+
+(* Result: Never
+ *
+ * P0 tries to decrement free object.
+ * P1 tries to acquire it.
+ * Expected result: one of them failes (r0 xor r1 == 1),
+ * so bad result is r0 == r1
+*)
+
+{
+ int refcount = 1;
+}
+
+P0(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_dec_and_test(refcount);
+ if (r0) {
+ r0 = atomic_cmpxchg_relaxed(refcount, 0, 32) == 0;
+ }
+}
+
+
+P1(int *refcount)
+{
+ int r1;
+
+ r1 = atomic_add_return(1, refcount);
+ r1 = (r1 & (32)) == 0;
+}
+
+exists (0:r0 == 1:r1)
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/inc_freeze_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/inc_freeze_race.litmus
new file mode 100644
index 000000000000..6e3a4112080c
--- /dev/null
+++ b/tools/memory-model/litmus-tests/folio_refcount/inc_freeze_race.litmus
@@ -0,0 +1,31 @@
+C inc_freeze_race
+
+(* Result: Never
+ *
+ * P0 tries to freeze counter with value 3 (can be arbitary).
+ * P1 tries to acquire reference.
+ * Expected result: one of them failes (r0 xor r1 == 1),
+ * so bad result is r0 == r1 (= 0, 1).
+*)
+
+{
+ int refcount = 3;
+}
+
+P0(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_cmpxchg(refcount, 3, 32);
+}
+
+
+P1(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_add_return(1, refcount);
+ r0 = (r0 & (32)) == 0;
+}
+
+exists (0:r0 == 1:r0)
diff --git
a/tools/memory-model/litmus-tests/folio_refcount/inc_unfreeze_race.litmus
b/tools/memory-model/litmus-tests/folio_refcount/inc_unfreeze_race.litmus
new file mode 100644
index 000000000000..f7e2273fe7da
--- /dev/null
+++
b/tools/memory-model/litmus-tests/folio_refcount/inc_unfreeze_race.litmus
@@ -0,0 +1,30 @@
+C inc_unfreeze_race
+
+(* Result: Never
+ *
+ * P0 tries to unfreeze refcount with saved value 3
+ * P1 tries to acquire reference.
+ *
+ * Expected result: P1 fails or in the end refcount is 4
+ * Bad result: Missed refcount
+*)
+
+{
+ int refcount = 32;
+}
+
+P0(int *refcount)
+{
+ smp_store_release(refcount, 3);
+}
+
+
+P1(int *refcount)
+{
+ int r0;
+
+ r0 = atomic_add_return(1, refcount);
+ r0 = (r0 & (32)) == 0;
+}
+
+exists (1:r0=1 /\ refcount != 4)
On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote: > The current atomic-based page refcount implementation treats zero > counter as dead and requires a compare-and-swap loop in folio_try_get() > to prevent incrementing a dead refcount. This CAS loop acts as a > serialization point and can become a significant bottleneck during > high-frequency file read operations. > > This patch introduces FOLIO_LOCKED_BIT to distinguish between a s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/ > (temporary) zero refcount and a locked (dead/frozen) state. Because now > incrementing counter doesn't affect it's locked/unlocked state, it is > possible to use an optimistic atomic_fetch_add() in > page_ref_add_unless_zero() that operates independently of the locked bit. > The locked state is handled after the increment attempt, eliminating the > need for the CAS loop. I don't think I follow. Your trick with the PAGEREF_LOCKED_BIT helps with serialization against page_ref_freeze(), but I don't think it does anything to serialize against freeing the page under you. Like, if the page in the process of freeing, page allocator sets its refcount to zero and your version of page_ref_add_unless_zero() successfully acquirees reference for the freed page. How is it safe? -- Kiryl Shutsemau / Kirill A. Shutemov
On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote: > On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote: >> The current atomic-based page refcount implementation treats zero >> counter as dead and requires a compare-and-swap loop in folio_try_get() >> to prevent incrementing a dead refcount. This CAS loop acts as a >> serialization point and can become a significant bottleneck during >> high-frequency file read operations. >> >> This patch introduces FOLIO_LOCKED_BIT to distinguish between a > > s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/ Ack, thanks >> (temporary) zero refcount and a locked (dead/frozen) state. Because now >> incrementing counter doesn't affect it's locked/unlocked state, it is >> possible to use an optimistic atomic_fetch_add() in >> page_ref_add_unless_zero() that operates independently of the locked bit. >> The locked state is handled after the increment attempt, eliminating the >> need for the CAS loop. > > I don't think I follow. > > Your trick with the PAGEREF_LOCKED_BIT helps with serialization against > page_ref_freeze(), but I don't think it does anything to serialize > against freeing the page under you. > > Like, if the page in the process of freeing, page allocator sets its > refcount to zero and your version of page_ref_add_unless_zero() > successfully acquirees reference for the freed page. > > How is it safe? Page is freed only after a successful page_ref_dec_and_test() call, which will set LOCKED_BIT. This bit will persist until set_page_count(1) is called somewhere in the allocation path [alloc_pages()], and effectively block any "use after free" users.
On Fri, Dec 19, 2025 at 07:18:53PM +0300, Gladyshev Ilya wrote: > On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote: > > On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote: > > > The current atomic-based page refcount implementation treats zero > > > counter as dead and requires a compare-and-swap loop in folio_try_get() > > > to prevent incrementing a dead refcount. This CAS loop acts as a > > > serialization point and can become a significant bottleneck during > > > high-frequency file read operations. > > > > > > This patch introduces FOLIO_LOCKED_BIT to distinguish between a > > > > s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/ > Ack, thanks > > > > (temporary) zero refcount and a locked (dead/frozen) state. Because now > > > incrementing counter doesn't affect it's locked/unlocked state, it is > > > possible to use an optimistic atomic_fetch_add() in > > > page_ref_add_unless_zero() that operates independently of the locked bit. > > > The locked state is handled after the increment attempt, eliminating the > > > need for the CAS loop. > > > > I don't think I follow. > > > > Your trick with the PAGEREF_LOCKED_BIT helps with serialization against > > page_ref_freeze(), but I don't think it does anything to serialize > > against freeing the page under you. > > > > Like, if the page in the process of freeing, page allocator sets its > > refcount to zero and your version of page_ref_add_unless_zero() > > successfully acquirees reference for the freed page. > > > > How is it safe? > > Page is freed only after a successful page_ref_dec_and_test() call, which > will set LOCKED_BIT. This bit will persist until set_page_count(1) is called > somewhere in the allocation path [alloc_pages()], and effectively block any > "use after free" users. Okay, fair enough. But what prevent the following scenario? CPU0 CPU1 page_ref_dec_and_test() atomic_dec_and_test() // refcount=0 page_ref_add_unless_zero() atomic_add_return() // refcount=1, no LOCKED_BIT page_ref_dec_and_test() atomic_dec_and_test() // refcount=0 atomic_cmpxchg(0, LOCKED_BIT) // succeeds atomic_cmpxchg(0, LOCKED_BIT) // fails // return false to caller // Use-after-free: BOOM! -- Kiryl Shutsemau / Kirill A. Shutemov
On 12/19/2025 8:46 PM, Kiryl Shutsemau wrote:
> On Fri, Dec 19, 2025 at 07:18:53PM +0300, Gladyshev Ilya wrote:
>> On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote:
>>> On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote:
>>>> The current atomic-based page refcount implementation treats zero
>>>> counter as dead and requires a compare-and-swap loop in folio_try_get()
>>>> to prevent incrementing a dead refcount. This CAS loop acts as a
>>>> serialization point and can become a significant bottleneck during
>>>> high-frequency file read operations.
>>>>
>>>> This patch introduces FOLIO_LOCKED_BIT to distinguish between a
>>>
>>> s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/
>> Ack, thanks
>>
>>>> (temporary) zero refcount and a locked (dead/frozen) state. Because now
>>>> incrementing counter doesn't affect it's locked/unlocked state, it is
>>>> possible to use an optimistic atomic_fetch_add() in
>>>> page_ref_add_unless_zero() that operates independently of the locked bit.
>>>> The locked state is handled after the increment attempt, eliminating the
>>>> need for the CAS loop.
>>>
>>> I don't think I follow.
>>>
>>> Your trick with the PAGEREF_LOCKED_BIT helps with serialization against
>>> page_ref_freeze(), but I don't think it does anything to serialize
>>> against freeing the page under you.
>>>
>>> Like, if the page in the process of freeing, page allocator sets its
>>> refcount to zero and your version of page_ref_add_unless_zero()
>>> successfully acquirees reference for the freed page.
>>>
>>> How is it safe?
>>
>> Page is freed only after a successful page_ref_dec_and_test() call, which
>> will set LOCKED_BIT. This bit will persist until set_page_count(1) is called
>> somewhere in the allocation path [alloc_pages()], and effectively block any
>> "use after free" users.
>
> Okay, fair enough.
>
> But what prevent the following scenario?
>
> CPU0 CPU1
> page_ref_dec_and_test()
> atomic_dec_and_test() // refcount=0
> page_ref_add_unless_zero()
> atomic_add_return() // refcount=1, no LOCKED_BIT
> page_ref_dec_and_test()
> atomic_dec_and_test() // refcount=0
> atomic_cmpxchg(0, LOCKED_BIT) // succeeds
> atomic_cmpxchg(0, LOCKED_BIT) // fails
> // return false to caller
> // Use-after-free: BOOM!
>
But you can't trust that the page is safe to use after
page_ref_dec_and_test() returns false, if I understood your example
correctly. For example, current implementation can also lead to this
'bug' if you slightly change the order of atomic ops in your example:
Initial refcount value: 1 from CPU 0
CPU 0 CPU 1
page_ref_and_dec() page_ref_add_unless_zero()
atomic_add_return() [1 -> 2]
atomic_dec_and_test() [2 -> 1]
page_ref_dec_and_test()
atomic_dec_and_test() [1 -> 0]
/* page is logically freed here */
return false [cause 1!=0]
// Caller with use after free?
On Fri, Dec 19, 2025 at 10:08:54PM +0300, Gladyshev Ilya wrote: > On 12/19/2025 8:46 PM, Kiryl Shutsemau wrote: > > On Fri, Dec 19, 2025 at 07:18:53PM +0300, Gladyshev Ilya wrote: > > > On 12/19/2025 5:50 PM, Kiryl Shutsemau wrote: > > > > On Fri, Dec 19, 2025 at 12:46:39PM +0000, Gladyshev Ilya wrote: > > > > > The current atomic-based page refcount implementation treats zero > > > > > counter as dead and requires a compare-and-swap loop in folio_try_get() > > > > > to prevent incrementing a dead refcount. This CAS loop acts as a > > > > > serialization point and can become a significant bottleneck during > > > > > high-frequency file read operations. > > > > > > > > > > This patch introduces FOLIO_LOCKED_BIT to distinguish between a > > > > > > > > s/FOLIO_LOCKED_BIT/PAGEREF_LOCKED_BIT/ > > > Ack, thanks > > > > > > > > (temporary) zero refcount and a locked (dead/frozen) state. Because now > > > > > incrementing counter doesn't affect it's locked/unlocked state, it is > > > > > possible to use an optimistic atomic_fetch_add() in > > > > > page_ref_add_unless_zero() that operates independently of the locked bit. > > > > > The locked state is handled after the increment attempt, eliminating the > > > > > need for the CAS loop. > > > > > > > > I don't think I follow. > > > > > > > > Your trick with the PAGEREF_LOCKED_BIT helps with serialization against > > > > page_ref_freeze(), but I don't think it does anything to serialize > > > > against freeing the page under you. > > > > > > > > Like, if the page in the process of freeing, page allocator sets its > > > > refcount to zero and your version of page_ref_add_unless_zero() > > > > successfully acquirees reference for the freed page. > > > > > > > > How is it safe? > > > > > > Page is freed only after a successful page_ref_dec_and_test() call, which > > > will set LOCKED_BIT. This bit will persist until set_page_count(1) is called > > > somewhere in the allocation path [alloc_pages()], and effectively block any > > > "use after free" users. > > > > Okay, fair enough. > > > > But what prevent the following scenario? > > > > CPU0 CPU1 > > page_ref_dec_and_test() > > atomic_dec_and_test() // refcount=0 > > page_ref_add_unless_zero() > > atomic_add_return() // refcount=1, no LOCKED_BIT > > page_ref_dec_and_test() > > atomic_dec_and_test() // refcount=0 > > atomic_cmpxchg(0, LOCKED_BIT) // succeeds > > atomic_cmpxchg(0, LOCKED_BIT) // fails > > // return false to caller > > // Use-after-free: BOOM! > > > But you can't trust that the page is safe to use after > page_ref_dec_and_test() returns false, if I understood your example > correctly. True. My bad. -- Kiryl Shutsemau / Kirill A. Shutemov
© 2016 - 2026 Red Hat, Inc.