From nobody Tue Jun 16 12:41:37 2026 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7E8338553F for ; Mon, 20 Apr 2026 08:02:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776672129; cv=none; b=bEdOJozuP9jo7uL6cN/3x1rO8B+AiU0HJOSGTiiMIGRLWQNJQ6ebLeTzXCNjjFqRkittD/Wg9ZKdfXFOrVXCsYZUyusswE/frp/kQqUc2KFM9n9H31esVCoab19QrXw6bVdoWgNlvLg7aBD91mEzMwBJvcwHm5qNOl+yeYL6bFU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776672129; c=relaxed/simple; bh=qcb2pIyuM4Vn5CJuH9HG0sDr00GArAp40OT+qLsssQM=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=XqvtRyIekXMcVAp9pPGGtwR5QBK16kN0tgR6GO7oxpoCWpIL8m/hABx47kfjeOtRwIaNolFJYzVy7jBbaIROe0ndA4kHjPXR2foKY2iqkZRQ2zgeaqGuDu2TgzP2+3wjsLurezGCT4t9wCC+JO9ZwhmxbTXrQm6lV1dDoceHqbk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=h-partners.com; spf=pass smtp.mailfrom=h-partners.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=h-partners.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=h-partners.com Received: from mail.maildlp.com (unknown [172.18.224.150]) by frasgout.his.huawei.com (SkyGuard) with ESMTPS id 4fzdH11FGLzHnH3h; Mon, 20 Apr 2026 16:01:33 +0800 (CST) Received: from mscpeml500003.china.huawei.com (unknown [7.188.49.51]) by mail.maildlp.com (Postfix) with ESMTPS id B75C440573; Mon, 20 Apr 2026 16:01:58 +0800 (CST) Received: from mscphis04371.huawei.com (10.123.69.39) by mscpeml500003.china.huawei.com (7.188.49.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 20 Apr 2026 11:01:58 +0300 From: Gorbunov Ivan To: CC: , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v2 1/2] mm: drop page refcount zero state semantics Date: Mon, 20 Apr 2026 08:01:18 +0000 Message-ID: <9fd8ebbc0f4f45be611bae0d03dd25dd994233c0.1776350895.git.gorbunov.ivan@h-partners.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: mscpeml500003.china.huawei.com (7.188.49.51) To mscpeml500003.china.huawei.com (7.188.49.51) Content-Type: text/plain; charset="utf-8" Right now 'zero' state could be interpreted in 2 ways 1) Unfrozen page which right now has no explicit owner 2) Frozen page This states can be 'logically' distinguished by operations such as page_ref_add, page_ref_inc, etc. In the first we would want the counter to increase. For example one can write page =3D alloc_frozen_page(...); page_ref_inc(page, 1); But in the second state increasing a counter of a frozen page, shouldn't be= valid at all. Another reason for change is our other patch (mm: implement page refcount l= ocking via dedicated bit) in which frozen pages do not have 0 value in refcount when frozen. This patch proposes 2 changes 1) Deprecate invariant that the value stored in reference count of frozen p= age is 0 (Getter functions folio_ref_count/page_ref_count must still return 0 for fr= ozen pages) 2) Allow modification operations like page_ref_add to be used only with pages with owners We've looked at places where pages are allocated, and they are always initialized via functions like set_page_count(page, 1). However, for clarity, we've added a debug BUG_ON inside modification functions to ensure that they are called only on pages with owners. In future those checks can be improved by replacing operations with their results returning analogs, if needed. Co-developed-by: Gladyshev Ilya Signed-off-by: Gladyshev Ilya Signed-off-by: Gorbunov Ivan Acked-by: Bjorn Helgaas # p2pdma.c --- drivers/pci/p2pdma.c | 2 +- include/linux/page_ref.h | 17 +++++++++++++++++ kernel/liveupdate/kexec_handover.c | 2 +- mm/hugetlb.c | 2 +- mm/mm_init.c | 6 +++--- mm/page_alloc.c | 4 ++-- 6 files changed, 25 insertions(+), 8 deletions(-) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index e0f546166eb8..e060ae7e1644 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -158,7 +158,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct = kobject *kobj, * because we don't want to trigger the * p2pdma_folio_free() path. */ - set_page_count(page, 0); + set_page_count_as_frozen(page); percpu_ref_put(ref); return ret; } diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 94d3f0e71c06..a7a07b61d2ae 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -62,6 +62,11 @@ static inline void __page_ref_unfreeze(struct page *page= , int v) =20 #endif =20 +static inline bool __page_count_is_frozen(int count) +{ + return count =3D=3D 0; +} + static inline int page_ref_count(const struct page *page) { return atomic_read(&page->_refcount); @@ -115,8 +120,14 @@ static inline void init_page_count(struct page *page) set_page_count(page, 1); } =20 +static inline void set_page_count_as_frozen(struct page *page) +{ + set_page_count(page, 0); +} + static inline void page_ref_add(struct page *page, int nr) { + VM_BUG_ON(__page_count_is_frozen(page_count(page))); atomic_add(nr, &page->_refcount); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, nr); @@ -129,6 +140,7 @@ static inline void folio_ref_add(struct folio *folio, i= nt nr) =20 static inline void page_ref_sub(struct page *page, int nr) { + VM_BUG_ON(__page_count_is_frozen(page_count(page))); atomic_sub(nr, &page->_refcount); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, -nr); @@ -142,6 +154,7 @@ static inline void folio_ref_sub(struct folio *folio, i= nt nr) static inline int folio_ref_sub_return(struct folio *folio, int nr) { int ret =3D atomic_sub_return(nr, &folio->_refcount); + VM_BUG_ON(__page_count_is_frozen(ret + nr)); =20 if (page_ref_tracepoint_active(page_ref_mod_and_return)) __page_ref_mod_and_return(&folio->page, -nr, ret); @@ -150,6 +163,7 @@ static inline int folio_ref_sub_return(struct folio *fo= lio, int nr) =20 static inline void page_ref_inc(struct page *page) { + VM_BUG_ON(__page_count_is_frozen(page_count(page))); atomic_inc(&page->_refcount); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, 1); @@ -162,6 +176,7 @@ static inline void folio_ref_inc(struct folio *folio) =20 static inline void page_ref_dec(struct page *page) { + VM_BUG_ON(__page_count_is_frozen(page_count(page))); atomic_dec(&page->_refcount); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, -1); @@ -189,6 +204,7 @@ static inline int folio_ref_sub_and_test(struct folio *= folio, int nr) static inline int page_ref_inc_return(struct page *page) { int ret =3D atomic_inc_return(&page->_refcount); + VM_BUG_ON(__page_count_is_frozen(ret - 1)); =20 if (page_ref_tracepoint_active(page_ref_mod_and_return)) __page_ref_mod_and_return(page, 1, ret); @@ -217,6 +233,7 @@ static inline int folio_ref_dec_and_test(struct folio *= folio) static inline int page_ref_dec_return(struct page *page) { int ret =3D atomic_dec_return(&page->_refcount); + VM_BUG_ON(__page_count_is_frozen(ret + 1)); =20 if (page_ref_tracepoint_active(page_ref_mod_and_return)) __page_ref_mod_and_return(page, -1, ret); diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_h= andover.c index b64f36a45296..36c21f3d8250 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -390,7 +390,7 @@ static void kho_init_folio(struct page *page, unsigned = int order) =20 /* For higher order folios, tail pages get a page count of zero. */ for (unsigned long i =3D 1; i < nr_pages; i++) - set_page_count(page + i, 0); + set_page_count_as_frozen(page + i); =20 if (order > 0) prep_compound_page(page, order); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1d41fa3dd43e..b364fda29111 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3186,7 +3186,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(st= ruct folio *folio, for (pfn =3D head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) { __init_single_page(page, pfn, zone, nid); prep_compound_tail(page, &folio->page, order); - set_page_count(page, 0); + set_page_count_as_frozen(page); } } =20 diff --git a/mm/mm_init.c b/mm/mm_init.c index cec7bb758bdd..e4ec672a9f51 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1066,7 +1066,7 @@ static void __ref __init_zone_device_page(struct page= *page, unsigned long pfn, case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_COHERENT: case MEMORY_DEVICE_PCI_P2PDMA: - set_page_count(page, 0); + set_page_count_as_frozen(page); break; =20 case MEMORY_DEVICE_GENERIC: @@ -1112,7 +1112,7 @@ static void __ref memmap_init_compound(struct page *h= ead, =20 __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); prep_compound_tail(page, head, order); - set_page_count(page, 0); + set_page_count_as_frozen(page); } prep_compound_head(head, order); } @@ -2250,7 +2250,7 @@ void __init init_cma_reserved_pageblock(struct page *= page) =20 do { __ClearPageReserved(p); - set_page_count(p, 0); + set_page_count_as_frozen(p); } while (++p, --i); =20 init_pageblock_migratetype(page, MIGRATE_CMA, false); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 65e702fade61..27734cf795da 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1639,14 +1639,14 @@ void __meminit __free_pages_core(struct page *page,= unsigned int order, for (loop =3D 0; loop < nr_pages; loop++, p++) { VM_WARN_ON_ONCE(PageReserved(p)); __ClearPageOffline(p); - set_page_count(p, 0); + set_page_count_as_frozen(p); } =20 adjust_managed_page_count(page, nr_pages); } else { for (loop =3D 0; loop < nr_pages; loop++, p++) { __ClearPageReserved(p); - set_page_count(p, 0); + set_page_count_as_frozen(p); } =20 /* memblock adjusts totalram_pages() manually. */ --=20 2.43.0 From nobody Tue Jun 16 12:41:37 2026 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B43D3381AFF for ; Mon, 20 Apr 2026 08:02:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776672130; cv=none; b=JQr2uybIXihz5+k/ynrLNTpAEjaylq9/zcjcKIzrAVZ5MkKJZmw3/2TwpF92sy6shRZHMILClP4w3VRs6hfdOjQAUl5HvjEsG10rCARsaRFDZYPLbeQGn4NitkJDIO+8HgcnQNZkpOBX2OSCLxhiJu0BK33QHrjtOtAtvLt6elo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776672130; c=relaxed/simple; bh=iP4BCwo44JjgmCHiLCxFiDHUtemP3SnXxKufljpZ/Dg=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=BUJ4U32p8ax6c0IBPJiC12LFp4/0slOKz/R/UTfnyFSXkUylYxtO2dKmF0WtcmhizO4tpaZm9i/knGol7LPPLHnAEDHNYKzOxKyFJSibd9240UskQnK93bdoS94V7RVZH3WqVsSSoTsDLbWRZ6cppLUIraWk+ofqgc9zl1PwIWA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=h-partners.com; spf=pass smtp.mailfrom=h-partners.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=h-partners.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=h-partners.com Received: from mail.maildlp.com (unknown [172.18.224.83]) by frasgout.his.huawei.com (SkyGuard) with ESMTPS id 4fzdGX5JK7zJ46Bd; Mon, 20 Apr 2026 16:01:08 +0800 (CST) Received: from mscpeml500003.china.huawei.com (unknown [7.188.49.51]) by mail.maildlp.com (Postfix) with ESMTPS id BF04440577; Mon, 20 Apr 2026 16:01:59 +0800 (CST) Received: from mscphis04371.huawei.com (10.123.69.39) by mscpeml500003.china.huawei.com (7.188.49.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 20 Apr 2026 11:01:59 +0300 From: Gorbunov Ivan To: CC: , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v2 2/2] mm: implement page refcount locking via dedicated bit Date: Mon, 20 Apr 2026 08:01:19 +0000 Message-ID: <9936cc799ac8b637ee58ae3bf6ec0e5eeb5306e9.1776350895.git.gorbunov.ivan@h-partners.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: mscpeml500003.china.huawei.com (7.188.49.51) To mscpeml500003.china.huawei.com (7.188.49.51) Content-Type: text/plain; charset="utf-8" From: Gladyshev Ilya The current atomic-based page refcount implementation treats zero counter as dead and requires a compare-and-swap loop in folio_try_get() to prevent incrementing a dead refcount. This CAS loop acts as a serialization point and can become a significant bottleneck during high-frequency file read operations. This patch introduces PAGEREF_FROZEN_BIT to distinguish between a (temporary) zero refcount and a locked (dead/frozen) state. Because now incrementing counter doesn't affect it's locked/unlocked state, it is possible to use an optimistic atomic_add_return() in page_ref_add_unless_zero() that operates independently of the locked bit. The locked state is handled after the increment attempt, eliminating the need for the CAS loop. If locked state is detected after atomic_add(), pageref counter will be reset with CAS loop, eliminating theoretical possibility of overflow. Co-developed-by: Gorbunov Ivan Signed-off-by: Gorbunov Ivan Signed-off-by: Gladyshev Ilya Acked-by: Linus Torvalds --- include/linux/page-flags.h | 13 +++++++++++++ include/linux/page_ref.h | 28 ++++++++++++++++++++++++---- 2 files changed, 37 insertions(+), 4 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 0e03d816e8b9..b3e3da91a90a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -196,6 +196,19 @@ enum pageflags { =20 #define PAGEFLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1) =20 +/* Most significant bit in page refcount */ +#define PAGEREF_FROZEN_BIT BIT(31) + +/* Page reference counter can be in 4 logical states, + * which are described below with their value representation + * state | value + * (1) safe with owners | 1...INT_MAX + * (2) safe with no owners | 0 + * (3) frozen | INT_MIN....-1 + * + * State (2) can be only temporally inside dec_and_test. + */ + #ifndef __GENERATING_BOUNDS_H =20 /* diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index a7a07b61d2ae..32194e953674 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -64,12 +64,17 @@ static inline void __page_ref_unfreeze(struct page *pag= e, int v) =20 static inline bool __page_count_is_frozen(int count) { - return count =3D=3D 0; + return count > 0 && !((count & PAGEREF_FROZEN_BIT) !=3D 0); } =20 static inline int page_ref_count(const struct page *page) { - return atomic_read(&page->_refcount); + int val =3D atomic_read(&page->_refcount); + + if (unlikely(val & PAGEREF_FROZEN_BIT)) + return 0; + + return val; } =20 /** @@ -191,6 +196,9 @@ static inline int page_ref_sub_and_test(struct page *pa= ge, int nr) { int ret =3D atomic_sub_and_test(nr, &page->_refcount); =20 + if (ret) + ret =3D !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_FROZEN_BIT); + if (page_ref_tracepoint_active(page_ref_mod_and_test)) __page_ref_mod_and_test(page, -nr, ret); return ret; @@ -220,6 +228,9 @@ static inline int page_ref_dec_and_test(struct page *pa= ge) { int ret =3D atomic_dec_and_test(&page->_refcount); =20 + if (ret) + ret =3D !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_FROZEN_BIT); + if (page_ref_tracepoint_active(page_ref_mod_and_test)) __page_ref_mod_and_test(page, -1, ret); return ret; @@ -245,9 +256,18 @@ static inline int folio_ref_dec_return(struct folio *f= olio) return page_ref_dec_return(&folio->page); } =20 +#define _PAGEREF_FROZEN_LIMIT ((1 << 30) | PAGEREF_FROZEN_BIT) + static inline bool page_ref_add_unless_zero(struct page *page, int nr) { - bool ret =3D atomic_add_unless(&page->_refcount, nr, 0); + bool ret =3D false; + int val =3D atomic_add_return(nr, &page->_refcount); + // See PAGEREF_FROZEN_BIT declaration in page-flags.h for details + ret =3D !(val & PAGEREF_FROZEN_BIT); + + /* Undo atomic_add() if counter is locked and scary big */ + while (unlikely((unsigned int)val >=3D _PAGEREF_FROZEN_LIMIT)) + val =3D atomic_cmpxchg_relaxed(&page->_refcount, val, PAGEREF_FROZEN_BIT= ); =20 if (page_ref_tracepoint_active(page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); @@ -282,7 +302,7 @@ static inline bool folio_ref_try_add(struct folio *foli= o, int count) =20 static inline int page_ref_freeze(struct page *page, int count) { - int ret =3D likely(atomic_cmpxchg(&page->_refcount, count, 0) =3D=3D coun= t); + int ret =3D likely(atomic_cmpxchg(&page->_refcount, count, PAGEREF_FROZEN= _BIT) =3D=3D count); =20 if (page_ref_tracepoint_active(page_ref_freeze)) __page_ref_freeze(page, count, ret); --=20 2.43.0