[v3] Only free healthy pages in high-order has_hwpoisoned folio

[PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Jiaqi Yan 4 weeks ago

At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
becomes non-HugeTLB, and it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio(). However, there is always a time
window between dissolve_free_hugetlb_folio() frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.

One obvious way to avoid this problem is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce free_has_hwpoisoned() to only free the healthy pages
and to exclude the HWPoison ones in the high-order folio.
The idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages. Instead of freeing
pages one by one, decompose healthy ranges into the largest
possible blocks having different orders. Every block meets the
requirements to be freed via __free_one_page().

free_has_hwpoisoned() has linear time complexity wrt the number
of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison() defines the
overall time complexity. For a 1G hugepage having several
HWPoison pages, free_has_hwpoisoned() takes around 2ms on
average.

Since free_has_hwpoisoned() has nontrivial overhead, it is
wrapped inside free_pages_prepare_has_hwpoisoned() and done
only PG_has_hwpoisoned indicates HWPoison page exists and
after free_pages_prepare() succeeded.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a9646..9393589118604 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -215,6 +215,9 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 unsigned int pageblock_order __read_mostly;
 #endif
 
+static bool free_pages_prepare_has_hwpoisoned(struct page *page,
+					      unsigned int order,
+					      fpi_t fpi_flags);
 static void __free_pages_ok(struct page *page, unsigned int order,
 			    fpi_t fpi_flags);
 
@@ -1568,8 +1571,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	struct zone *zone = page_zone(page);
 
-	if (free_pages_prepare(page, order))
-		free_one_page(zone, page, pfn, order, fpi_flags);
+	if (!free_pages_prepare_has_hwpoisoned(page, order, fpi_flags))
+		return;
+
+	free_one_page(zone, page, pfn, order, fpi_flags);
 }
 
 void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
 	return ret;
 }
 
+/*
+ * Given a range of physically contiguous pages, efficiently free them
+ * block by block. Block order is chosen to meet the PFN alignment
+ * requirement in __free_one_page().
+ */
+static void free_contiguous_pages(struct page *curr, unsigned long nr_pages,
+				  fpi_t fpi_flags)
+{
+	unsigned int order;
+	unsigned int align_order;
+	unsigned int size_order;
+	unsigned long remaining;
+	unsigned long pfn = page_to_pfn(curr);
+	const unsigned long end_pfn = pfn + nr_pages;
+	struct zone *zone = page_zone(curr);
+
+	/*
+	 * This decomposition algorithm at every iteration chooses the
+	 * order to be the minimum of two constraints:
+	 * - Alignment: the largest power-of-two that divides current pfn.
+	 * - Size: the largest power-of-two that fits in the current
+	 *   remaining number of pages.
+	 */
+	while (pfn < end_pfn) {
+		remaining = end_pfn - pfn;
+		align_order = ffs(pfn) - 1;
+		size_order = fls_long(remaining) - 1;
+		order = min(align_order, size_order);
+
+		free_one_page(zone, curr, pfn, order, fpi_flags);
+		curr += (1UL << order);
+		pfn += (1UL << order);
+	}
+
+	VM_WARN_ON(pfn != end_pfn);
+}
+
+/*
+ * Given a high-order compound page containing certain number of HWPoison
+ * pages, free only the healthy ones to buddy allocator.
+ *
+ * Pages must have passed free_pages_prepare(). Even if having HWPoison
+ * pages, breaking down compound page and updating metadata (e.g. page
+ * owner, alloc tag) can be done together during free_pages_prepare(),
+ * which simplifies the splitting here: unlike __split_unmapped_folio(),
+ * there is no need to turn split pages into a compound page or to carry
+ * metadata.
+ *
+ * It calls free_one_page O(2^order) times and cause nontrivial overhead.
+ * So only use this when the compound page really contains HWPoison.
+ *
+ * This implementation doesn't work in memdesc world.
+ */
+static void free_has_hwpoisoned(struct page *page, unsigned int order,
+				fpi_t fpi_flags)
+{
+	struct page *curr = page;
+	struct page *next;
+	unsigned long nr_pages;
+	/*
+	 * Don't assume end points to a valid page. It is only used
+	 * here for pointer arithmetic.
+	 */
+	struct page *end = page + (1 << order);
+	unsigned long total_freed = 0;
+	unsigned long total_hwp = 0;
+
+	VM_WARN_ON(order == 0);
+	VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
+
+	while (curr < end) {
+		next = curr;
+		nr_pages = 0;
+
+		while (next < end && !PageHWPoison(next)) {
+			++next;
+			++nr_pages;
+		}
+
+		if (next != end && PageHWPoison(next)) {
+			clear_page_tag_ref(next);
+			++total_hwp;
+		}
+
+		free_contiguous_pages(curr, nr_pages, fpi_flags);
+		total_freed += nr_pages;
+		if (next == end)
+			break;
+
+		curr = PageHWPoison(next) ? next + 1 : next;
+	}
+
+	VM_WARN_ON(total_freed + total_hwp != (1 << order));
+	pr_info("Freed %#lx pages, excluded %lu hwpoison pages\n",
+		total_freed, total_hwp);
+}
+
+static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
+{
+	if (order == 0 || !PageCompound(page))
+		return false;
+
+	return folio_test_has_hwpoisoned(page_folio(page));
+}
+
+/*
+ * Do free_has_hwpoisoned() when needed after free_pages_prepare().
+ * Returns
+ * - true: free_pages_prepare() is good and caller can proceed freeing.
+ * - false: caller should not free pages for one of the two reasons:
+ *   1. free_pages_prepare() failed so it is not safe to proceed freeing.
+ *   2. this is a compound page having some HWPoison pages, and healthy
+ *      pages are already safely freed.
+ */
+static bool free_pages_prepare_has_hwpoisoned(struct page *page,
+					      unsigned int order,
+					      fpi_t fpi_flags)
+{
+	/*
+	 * free_pages_prepare() clears PAGE_FLAGS_SECOND flags on the
+	 * first tail page of a compound page, which clears PG_has_hwpoisoned.
+	 * So this call must be before free_pages_prepare().
+	 *
+	 * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
+	 * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
+	 * confuse and complaint that the first tail page is still active.
+	 */
+	bool should_fhh = compound_has_hwpoisoned(page, order);
+
+	if (!free_pages_prepare(page, order))
+		return false;
+
+	/*
+	 * After free_pages_prepare() breaks down compound page and deals
+	 * with page metadata (e.g. page owner and page alloc tags),
+	 * free_has_hwpoisoned() can directly use free_one_page() whenever
+	 * it knows the appropriate orders of page blocks to free.
+	 */
+	if (should_fhh) {
+		free_has_hwpoisoned(page, order, fpi_flags);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Free a pcp page
  */
@@ -2940,7 +3091,7 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 		return;
 	}
 
-	if (!free_pages_prepare(page, order))
+	if (!free_pages_prepare_has_hwpoisoned(page, order, fpi_flags))
 		return;
 
 	/*
-- 
2.52.0.457.g6b5491de43-goog

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by David Hildenbrand (Red Hat) 3 weeks, 3 days ago

On 1/12/26 01:49, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> becomes non-HugeTLB, and it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.

Do we really have to have all that complexity in free_frozen_pages().

Can't we hook into __update_and_free_hugetlb_folio() and just free the 
chunks there?

> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio(). However, there is always a time
> window between dissolve_free_hugetlb_folio() frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.

I wonder whether marking the pageblock as isolated before freeing it 
could work?

In that case, nobody will be able to allocate the page before we 
un-isolate it.

Just a thought: but when you are dealing with a possible race, you can 
avoid that race by prohibiting the intermediate allocation from 
happening in the first place.

Also, this is a lot of complexity. Was this issue already hit in the 
past or is it purely theoretical?

-- 
Cheers

David

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Jiaqi Yan 2 weeks, 1 day ago

Sorry for the late reply, Harry, Zi, Miaohe, and David. I was occupied
by some other duty in the past two weeks.

On Thu, Jan 15, 2026 at 12:43 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> On 1/12/26 01:49, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> > becomes non-HugeTLB, and it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
>
>
> Do we really have to have all that complexity in free_frozen_pages().
>
> Can't we hook into __update_and_free_hugetlb_folio() and just free the
> chunks there?

I don't know but I imagine to get chunks you will need the ability to
split a in-use (not free_huge_folio()-ed) hugetlb folio like splitting
a THP folio? Won't that be more complicated and very specific to
hugetlb? This solution is implemented in free_frozen_pages() to avoid
heavylifing splitting.

Also, doing in __update_and_free_hugetlb_folio() definitely won't help
with [1]: to handle a previously split-failed and now to be freed
PG_has_hwpoisoned THP. This solution works.

[1] https://lore.kernel.org/linux-mm/CACw3F529=PC-pwXOX0gbNrnS7HTwXq93oVT=V74J4FHLqcZ-ug@mail.gmail.com/T/#m202f91883b9b70c0346c3076db5b341d02d3f348

>
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio(). However, there is always a time
> > window between dissolve_free_hugetlb_folio() frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
>
> I wonder whether marking the pageblock as isolated before freeing it
> could work?

Maybe... but where can the pageblock be isolated? Is there an existing
candidate, or inventing a scribble pad to stash it?

And then who and when should un-isolate the pageblock, if we don't
want to leak it?

>
> In that case, nobody will be able to allocate the page before we
> un-isolate it.
>
> Just a thought: but when you are dealing with a possible race, you can
> avoid that race by prohibiting the intermediate allocation from
> happening in the first place.
>
>
> Also, this is a lot of complexity. Was this issue already hit in the
> past or is it purely theoretical?

Theoretical for now. But when I work on [2] with William and Harry,
the issue can be reproduced fairly easily [3].

[2] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com/T/#m87c95ddf0b9397b409a9f8fac4e772ecc9abf209
[3] https://lore.kernel.org/linux-mm/20250919155832.1084091-1-william.roche@oracle.com

>
> --
> Cheers
>
> David

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Miaohe Lin 3 weeks, 4 days ago

On 2026/1/12 8:49, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> becomes non-HugeTLB, and it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio(). However, there is always a time
> window between dissolve_free_hugetlb_folio() frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
> 
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce free_has_hwpoisoned() to only free the healthy pages
> and to exclude the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages. Instead of freeing
> pages one by one, decompose healthy ranges into the largest
> possible blocks having different orders. Every block meets the
> requirements to be freed via __free_one_page().
> 
> free_has_hwpoisoned() has linear time complexity wrt the number
> of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison() defines the
> overall time complexity. For a 1G hugepage having several
> HWPoison pages, free_has_hwpoisoned() takes around 2ms on
> average.
> 
> Since free_has_hwpoisoned() has nontrivial overhead, it is
> wrapped inside free_pages_prepare_has_hwpoisoned() and done
> only PG_has_hwpoisoned indicates HWPoison page exists and
> after free_pages_prepare() succeeded.
> 
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>

Thanks for your patch. This patch looks good to me. A few nits below.

> ---
>  mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a9646..9393589118604 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -215,6 +215,9 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>  unsigned int pageblock_order __read_mostly;
>  #endif
>  
> +static bool free_pages_prepare_has_hwpoisoned(struct page *page,
> +					      unsigned int order,
> +					      fpi_t fpi_flags);
>  static void __free_pages_ok(struct page *page, unsigned int order,
>  			    fpi_t fpi_flags);
>  
> @@ -1568,8 +1571,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	unsigned long pfn = page_to_pfn(page);
>  	struct zone *zone = page_zone(page);
>  
> -	if (free_pages_prepare(page, order))
> -		free_one_page(zone, page, pfn, order, fpi_flags);
> +	if (!free_pages_prepare_has_hwpoisoned(page, order, fpi_flags))
> +		return;
> +
> +	free_one_page(zone, page, pfn, order, fpi_flags);

It might be better to write as:

if (free_pages_prepare_has_hwpoisoned(page, order, fpi_flags))
	free_one_page(zone, page, pfn, order, fpi_flags);

just like previous one.

>  }
>  
>  void __meminit __free_pages_core(struct page *page, unsigned int order,
> @@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
>  	return ret;
>  }
>  
> +/*
> + * Given a range of physically contiguous pages, efficiently free them
> + * block by block. Block order is chosen to meet the PFN alignment
> + * requirement in __free_one_page().
> + */
> +static void free_contiguous_pages(struct page *curr, unsigned long nr_pages,
> +				  fpi_t fpi_flags)
> +{
> +	unsigned int order;
> +	unsigned int align_order;
> +	unsigned int size_order;
> +	unsigned long remaining;
> +	unsigned long pfn = page_to_pfn(curr);
> +	const unsigned long end_pfn = pfn + nr_pages;
> +	struct zone *zone = page_zone(curr);
> +
> +	/*
> +	 * This decomposition algorithm at every iteration chooses the
> +	 * order to be the minimum of two constraints:
> +	 * - Alignment: the largest power-of-two that divides current pfn.
> +	 * - Size: the largest power-of-two that fits in the current
> +	 *   remaining number of pages.
> +	 */
> +	while (pfn < end_pfn) {
> +		remaining = end_pfn - pfn;
> +		align_order = ffs(pfn) - 1;
> +		size_order = fls_long(remaining) - 1;
> +		order = min(align_order, size_order);
> +
> +		free_one_page(zone, curr, pfn, order, fpi_flags);
> +		curr += (1UL << order);
> +		pfn += (1UL << order);
> +	}
> +
> +	VM_WARN_ON(pfn != end_pfn);
> +}
> +
> +/*
> + * Given a high-order compound page containing certain number of HWPoison
> + * pages, free only the healthy ones to buddy allocator.
> + *
> + * Pages must have passed free_pages_prepare(). Even if having HWPoison
> + * pages, breaking down compound page and updating metadata (e.g. page
> + * owner, alloc tag) can be done together during free_pages_prepare(),
> + * which simplifies the splitting here: unlike __split_unmapped_folio(),
> + * there is no need to turn split pages into a compound page or to carry
> + * metadata.
> + *
> + * It calls free_one_page O(2^order) times and cause nontrivial overhead.
> + * So only use this when the compound page really contains HWPoison.
> + *
> + * This implementation doesn't work in memdesc world.
> + */
> +static void free_has_hwpoisoned(struct page *page, unsigned int order,
> +				fpi_t fpi_flags)
> +{
> +	struct page *curr = page;
> +	struct page *next;
> +	unsigned long nr_pages;
> +	/*
> +	 * Don't assume end points to a valid page. It is only used
> +	 * here for pointer arithmetic.
> +	 */
> +	struct page *end = page + (1 << order);
> +	unsigned long total_freed = 0;
> +	unsigned long total_hwp = 0;
> +
> +	VM_WARN_ON(order == 0);
> +	VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
> +
> +	while (curr < end) {
> +		next = curr;
> +		nr_pages = 0;
> +
> +		while (next < end && !PageHWPoison(next)) {
> +			++next;
> +			++nr_pages;
> +		}
> +
> +		if (next != end && PageHWPoison(next)) {

A comment why clear_page_tag_ref is needed here should be helpful.

> +			clear_page_tag_ref(next);
> +			++total_hwp;
> +		}
> +
> +		free_contiguous_pages(curr, nr_pages, fpi_flags);
> +		total_freed += nr_pages;
> +		if (next == end)
> +			break;
> +
> +		curr = PageHWPoison(next) ? next + 1 : next;

IIUC, when code reaches here, we must have found a hwpoison page or next will equal to end.
So I think PageHWPoison(next) is always true and above code can be simplified as:

	curr = next + 1;

Thanks.
.

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Jiaqi Yan 2 weeks, 1 day ago

On Wed, Jan 14, 2026 at 7:05 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2026/1/12 8:49, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> > becomes non-HugeTLB, and it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio(). However, there is always a time
> > window between dissolve_free_hugetlb_folio() frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoisoned() to only free the healthy pages
> > and to exclude the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages. Instead of freeing
> > pages one by one, decompose healthy ranges into the largest
> > possible blocks having different orders. Every block meets the
> > requirements to be freed via __free_one_page().
> >
> > free_has_hwpoisoned() has linear time complexity wrt the number
> > of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison() defines the
> > overall time complexity. For a 1G hugepage having several
> > HWPoison pages, free_has_hwpoisoned() takes around 2ms on
> > average.
> >
> > Since free_has_hwpoisoned() has nontrivial overhead, it is
> > wrapped inside free_pages_prepare_has_hwpoisoned() and done
> > only PG_has_hwpoisoned indicates HWPoison page exists and
> > after free_pages_prepare() succeeded.
> >
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>
> Thanks for your patch. This patch looks good to me. A few nits below.

Thanks for taking a look, Miaohe!

>
> > ---
> >  mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 154 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a9646..9393589118604 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -215,6 +215,9 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> >  unsigned int pageblock_order __read_mostly;
> >  #endif
> >
> > +static bool free_pages_prepare_has_hwpoisoned(struct page *page,
> > +                                           unsigned int order,
> > +                                           fpi_t fpi_flags);
> >  static void __free_pages_ok(struct page *page, unsigned int order,
> >                           fpi_t fpi_flags);
> >
> > @@ -1568,8 +1571,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
> >       unsigned long pfn = page_to_pfn(page);
> >       struct zone *zone = page_zone(page);
> >
> > -     if (free_pages_prepare(page, order))
> > -             free_one_page(zone, page, pfn, order, fpi_flags);
> > +     if (!free_pages_prepare_has_hwpoisoned(page, order, fpi_flags))
> > +             return;
> > +
> > +     free_one_page(zone, page, pfn, order, fpi_flags);
>
> It might be better to write as:
>
> if (free_pages_prepare_has_hwpoisoned(page, order, fpi_flags))
>         free_one_page(zone, page, pfn, order, fpi_flags);
>
> just like previous one.

Ack, but I probably won't need this change after merging
free_pages_prepare_has_hwpoisoned() into free_pages_prepare().

>
> >  }
> >
> >  void __meminit __free_pages_core(struct page *page, unsigned int order,
> > @@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
> >       return ret;
> >  }
> >
> > +/*
> > + * Given a range of physically contiguous pages, efficiently free them
> > + * block by block. Block order is chosen to meet the PFN alignment
> > + * requirement in __free_one_page().
> > + */
> > +static void free_contiguous_pages(struct page *curr, unsigned long nr_pages,
> > +                               fpi_t fpi_flags)
> > +{
> > +     unsigned int order;
> > +     unsigned int align_order;
> > +     unsigned int size_order;
> > +     unsigned long remaining;
> > +     unsigned long pfn = page_to_pfn(curr);
> > +     const unsigned long end_pfn = pfn + nr_pages;
> > +     struct zone *zone = page_zone(curr);
> > +
> > +     /*
> > +      * This decomposition algorithm at every iteration chooses the
> > +      * order to be the minimum of two constraints:
> > +      * - Alignment: the largest power-of-two that divides current pfn.
> > +      * - Size: the largest power-of-two that fits in the current
> > +      *   remaining number of pages.
> > +      */
> > +     while (pfn < end_pfn) {
> > +             remaining = end_pfn - pfn;
> > +             align_order = ffs(pfn) - 1;
> > +             size_order = fls_long(remaining) - 1;
> > +             order = min(align_order, size_order);
> > +
> > +             free_one_page(zone, curr, pfn, order, fpi_flags);
> > +             curr += (1UL << order);
> > +             pfn += (1UL << order);
> > +     }
> > +
> > +     VM_WARN_ON(pfn != end_pfn);
> > +}
> > +
> > +/*
> > + * Given a high-order compound page containing certain number of HWPoison
> > + * pages, free only the healthy ones to buddy allocator.
> > + *
> > + * Pages must have passed free_pages_prepare(). Even if having HWPoison
> > + * pages, breaking down compound page and updating metadata (e.g. page
> > + * owner, alloc tag) can be done together during free_pages_prepare(),
> > + * which simplifies the splitting here: unlike __split_unmapped_folio(),
> > + * there is no need to turn split pages into a compound page or to carry
> > + * metadata.
> > + *
> > + * It calls free_one_page O(2^order) times and cause nontrivial overhead.
> > + * So only use this when the compound page really contains HWPoison.
> > + *
> > + * This implementation doesn't work in memdesc world.
> > + */
> > +static void free_has_hwpoisoned(struct page *page, unsigned int order,
> > +                             fpi_t fpi_flags)
> > +{
> > +     struct page *curr = page;
> > +     struct page *next;
> > +     unsigned long nr_pages;
> > +     /*
> > +      * Don't assume end points to a valid page. It is only used
> > +      * here for pointer arithmetic.
> > +      */
> > +     struct page *end = page + (1 << order);
> > +     unsigned long total_freed = 0;
> > +     unsigned long total_hwp = 0;
> > +
> > +     VM_WARN_ON(order == 0);
> > +     VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
> > +
> > +     while (curr < end) {
> > +             next = curr;
> > +             nr_pages = 0;
> > +
> > +             while (next < end && !PageHWPoison(next)) {
> > +                     ++next;
> > +                     ++nr_pages;
> > +             }
> > +
> > +             if (next != end && PageHWPoison(next)) {
>
> A comment why clear_page_tag_ref is needed here should be helpful.

Ack.

>
> > +                     clear_page_tag_ref(next);
> > +                     ++total_hwp;
> > +             }
> > +
> > +             free_contiguous_pages(curr, nr_pages, fpi_flags);
> > +             total_freed += nr_pages;
> > +             if (next == end)
> > +                     break;
> > +
> > +             curr = PageHWPoison(next) ? next + 1 : next;
>
> IIUC, when code reaches here, we must have found a hwpoison page or next will equal to end.
> So I think PageHWPoison(next) is always true and above code can be simplified as:
>
>         curr = next + 1;

Yeah, good catch! Will simplify.

>
> Thanks.
> .

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Harry Yoo 3 weeks, 5 days ago

On Mon, Jan 12, 2026 at 12:49:22AM +0000, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> becomes non-HugeTLB, and it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio(). However, there is always a time
> window between dissolve_free_hugetlb_folio() frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.

I wonder if this is something we want to backport to -stable.

> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce free_has_hwpoisoned() to only free the healthy pages
> and to exclude the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages. Instead of freeing
> pages one by one, decompose healthy ranges into the largest
> possible blocks having different orders. Every block meets the
> requirements to be freed via __free_one_page().
> 
> free_has_hwpoisoned() has linear time complexity wrt the number
> of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison() defines the
> overall time complexity. For a 1G hugepage having several
> HWPoison pages, free_has_hwpoisoned() takes around 2ms on
> average.
> 
> Since free_has_hwpoisoned() has nontrivial overhead, it is
> wrapped inside free_pages_prepare_has_hwpoisoned() and done
> only PG_has_hwpoisoned indicates HWPoison page exists and
> after free_pages_prepare() succeeded.
> 
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>
> ---
>  mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a9646..9393589118604 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
>  	return ret;
>  }

From correctness point of view I think it looks good to me.
Let's see what the page allocator folks say.

A few nits below.

> +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
> +{
> +	if (order == 0 || !PageCompound(page))
> +		return false;

nit: since order-0 compound page is not a thing,
!PageCompound(page) check should cover order == 0 case.

> +	return folio_test_has_hwpoisoned(page_folio(page));
> +}
> +
> +/*
> + * Do free_has_hwpoisoned() when needed after free_pages_prepare().
> + * Returns
> + * - true: free_pages_prepare() is good and caller can proceed freeing.
> + * - false: caller should not free pages for one of the two reasons:
> + *   1. free_pages_prepare() failed so it is not safe to proceed freeing.
> + *   2. this is a compound page having some HWPoison pages, and healthy
> + *      pages are already safely freed.
> + */
> +static bool free_pages_prepare_has_hwpoisoned(struct page *page,
> +					      unsigned int order,
> +					      fpi_t fpi_flags)

nit: Hope we'll come up with a better name than
free_pages_prepare_has_poisoned(), but I don't have any better
suggestion... :)

And I hope somebody familiar with compaction (as compaction_free() calls
free_pages_prepare() and ignores its return value) could confirm that
it is safe to do a compound_has_hwpoisoned() check and, when it returns
true, call free_has_hwpoisoned() in free_pages_prepare(),
so that we won't need a separate function to do this.

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Jiaqi Yan 2 weeks, 1 day ago

On Mon, Jan 12, 2026 at 9:39 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Jan 12, 2026 at 12:49:22AM +0000, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> > becomes non-HugeTLB, and it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio(). However, there is always a time
> > window between dissolve_free_hugetlb_folio() frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
>
> I wonder if this is something we want to backport to -stable.
>
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoisoned() to only free the healthy pages
> > and to exclude the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages. Instead of freeing
> > pages one by one, decompose healthy ranges into the largest
> > possible blocks having different orders. Every block meets the
> > requirements to be freed via __free_one_page().
> >
> > free_has_hwpoisoned() has linear time complexity wrt the number
> > of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison() defines the
> > overall time complexity. For a 1G hugepage having several
> > HWPoison pages, free_has_hwpoisoned() takes around 2ms on
> > average.
> >
> > Since free_has_hwpoisoned() has nontrivial overhead, it is
> > wrapped inside free_pages_prepare_has_hwpoisoned() and done
> > only PG_has_hwpoisoned indicates HWPoison page exists and
> > after free_pages_prepare() succeeded.
> >
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> >
> > ---
> >  mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 154 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a9646..9393589118604 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
> >       return ret;
> >  }
>
> From correctness point of view I think it looks good to me.

Thanks, Harry!

> Let's see what the page allocator folks say.
>
> A few nits below.
>
> > +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
> > +{
> > +     if (order == 0 || !PageCompound(page))
> > +             return false;
>
> nit: since order-0 compound page is not a thing,
> !PageCompound(page) check should cover order == 0 case.

ack, will simplify to something like PageCompound && folio_test_has_hwpoisoned.

>
> > +     return folio_test_has_hwpoisoned(page_folio(page));
> > +}
> > +
> > +/*
> > + * Do free_has_hwpoisoned() when needed after free_pages_prepare().
> > + * Returns
> > + * - true: free_pages_prepare() is good and caller can proceed freeing.
> > + * - false: caller should not free pages for one of the two reasons:
> > + *   1. free_pages_prepare() failed so it is not safe to proceed freeing.
> > + *   2. this is a compound page having some HWPoison pages, and healthy
> > + *      pages are already safely freed.
> > + */
> > +static bool free_pages_prepare_has_hwpoisoned(struct page *page,
> > +                                           unsigned int order,
> > +                                           fpi_t fpi_flags)
>
> nit: Hope we'll come up with a better name than
> free_pages_prepare_has_poisoned(), but I don't have any better
> suggestion... :)
>
> And I hope somebody familiar with compaction (as compaction_free() calls
> free_pages_prepare() and ignores its return value) could confirm that
> it is safe to do a compound_has_hwpoisoned() check and, when it returns
> true, call free_has_hwpoisoned() in free_pages_prepare(),
> so that we won't need a separate function to do this.
>
> --
> Cheers,
> Harry / Hyeonggon

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Miaohe Lin 3 weeks, 4 days ago

On 2026/1/13 13:39, Harry Yoo wrote:
> On Mon, Jan 12, 2026 at 12:49:22AM +0000, Jiaqi Yan wrote:
>> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>> becomes non-HugeTLB, and it is released to buddy allocator
>> as a high-order folio, e.g. a folio that contains 262144 pages
>> if the folio was a 1G HugeTLB hugepage.
>>
>> This is problematic if the HugeTLB hugepage contained HWPoison
>> subpages. In that case, since buddy allocator does not check
>> HWPoison for non-zero-order folio, the raw HWPoison page can
>> be given out with its buddy page and be re-used by either
>> kernel or userspace.
>>
>> Memory failure recovery (MFR) in kernel does attempt to take
>> raw HWPoison page off buddy allocator after
>> dissolve_free_hugetlb_folio(). However, there is always a time
>> window between dissolve_free_hugetlb_folio() frees a HWPoison
>> high-order folio to buddy allocator and MFR takes HWPoison
>> raw page off buddy allocator.
> 
> I wonder if this is something we want to backport to -stable.
> 
>> One obvious way to avoid this problem is to add page sanity
>> checks in page allocate or free path. However, it is against
>> the past efforts to reduce sanity check overhead [1,2,3].
>>
>> Introduce free_has_hwpoisoned() to only free the healthy pages
>> and to exclude the HWPoison ones in the high-order folio.
>> The idea is to iterate through the sub-pages of the folio to
>> identify contiguous ranges of healthy pages. Instead of freeing
>> pages one by one, decompose healthy ranges into the largest
>> possible blocks having different orders. Every block meets the
>> requirements to be freed via __free_one_page().
>>
>> free_has_hwpoisoned() has linear time complexity wrt the number
>> of pages in the folio. While the power-of-two decomposition
>> ensures that the number of calls to the buddy allocator is
>> logarithmic for each contiguous healthy range, the mandatory
>> linear scan of pages to identify PageHWPoison() defines the
>> overall time complexity. For a 1G hugepage having several
>> HWPoison pages, free_has_hwpoisoned() takes around 2ms on
>> average.
>>
>> Since free_has_hwpoisoned() has nontrivial overhead, it is
>> wrapped inside free_pages_prepare_has_hwpoisoned() and done
>> only PG_has_hwpoisoned indicates HWPoison page exists and
>> after free_pages_prepare() succeeded.
>>
>> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>
>> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>
>> ---
>>  mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 154 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 822e05f1a9646..9393589118604 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
>>  	return ret;
>>  }
> 
>>From correctness point of view I think it looks good to me.
> Let's see what the page allocator folks say.
> 
> A few nits below.
> 
>> +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
>> +{
>> +	if (order == 0 || !PageCompound(page))
>> +		return false;
> 
> nit: since order-0 compound page is not a thing,
> !PageCompound(page) check should cover order == 0 case.
> 
>> +	return folio_test_has_hwpoisoned(page_folio(page));
>> +}
>> +
>> +/*
>> + * Do free_has_hwpoisoned() when needed after free_pages_prepare().
>> + * Returns
>> + * - true: free_pages_prepare() is good and caller can proceed freeing.
>> + * - false: caller should not free pages for one of the two reasons:
>> + *   1. free_pages_prepare() failed so it is not safe to proceed freeing.
>> + *   2. this is a compound page having some HWPoison pages, and healthy
>> + *      pages are already safely freed.
>> + */
>> +static bool free_pages_prepare_has_hwpoisoned(struct page *page,
>> +					      unsigned int order,
>> +					      fpi_t fpi_flags)
> 
> nit: Hope we'll come up with a better name than
> free_pages_prepare_has_poisoned(), but I don't have any better
> suggestion... :)

What about something like free_healthy_pages_prepare?

Thanks both.
.

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Zi Yan 3 weeks, 5 days ago

On 13 Jan 2026, at 0:39, Harry Yoo wrote:

> On Mon, Jan 12, 2026 at 12:49:22AM +0000, Jiaqi Yan wrote:
>> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
>> becomes non-HugeTLB, and it is released to buddy allocator
>> as a high-order folio, e.g. a folio that contains 262144 pages
>> if the folio was a 1G HugeTLB hugepage.
>>
>> This is problematic if the HugeTLB hugepage contained HWPoison
>> subpages. In that case, since buddy allocator does not check
>> HWPoison for non-zero-order folio, the raw HWPoison page can
>> be given out with its buddy page and be re-used by either
>> kernel or userspace.
>>
>> Memory failure recovery (MFR) in kernel does attempt to take
>> raw HWPoison page off buddy allocator after
>> dissolve_free_hugetlb_folio(). However, there is always a time
>> window between dissolve_free_hugetlb_folio() frees a HWPoison
>> high-order folio to buddy allocator and MFR takes HWPoison
>> raw page off buddy allocator.
>
> I wonder if this is something we want to backport to -stable.
>
>> One obvious way to avoid this problem is to add page sanity
>> checks in page allocate or free path. However, it is against
>> the past efforts to reduce sanity check overhead [1,2,3].
>>
>> Introduce free_has_hwpoisoned() to only free the healthy pages
>> and to exclude the HWPoison ones in the high-order folio.
>> The idea is to iterate through the sub-pages of the folio to
>> identify contiguous ranges of healthy pages. Instead of freeing
>> pages one by one, decompose healthy ranges into the largest
>> possible blocks having different orders. Every block meets the
>> requirements to be freed via __free_one_page().
>>
>> free_has_hwpoisoned() has linear time complexity wrt the number
>> of pages in the folio. While the power-of-two decomposition
>> ensures that the number of calls to the buddy allocator is
>> logarithmic for each contiguous healthy range, the mandatory
>> linear scan of pages to identify PageHWPoison() defines the
>> overall time complexity. For a 1G hugepage having several
>> HWPoison pages, free_has_hwpoisoned() takes around 2ms on
>> average.
>>
>> Since free_has_hwpoisoned() has nontrivial overhead, it is
>> wrapped inside free_pages_prepare_has_hwpoisoned() and done
>> only PG_has_hwpoisoned indicates HWPoison page exists and
>> after free_pages_prepare() succeeded.
>>
>> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
>> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
>> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>
>> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>>
>> ---
>>  mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 154 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 822e05f1a9646..9393589118604 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
>>  	return ret;
>>  }
>
> From correctness point of view I think it looks good to me.
> Let's see what the page allocator folks say.
>
> A few nits below.
>
>> +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
>> +{
>> +	if (order == 0 || !PageCompound(page))
>> +		return false;
>
> nit: since order-0 compound page is not a thing,
> !PageCompound(page) check should cover order == 0 case.
>
>> +	return folio_test_has_hwpoisoned(page_folio(page));
>> +}
>> +
>> +/*
>> + * Do free_has_hwpoisoned() when needed after free_pages_prepare().
>> + * Returns
>> + * - true: free_pages_prepare() is good and caller can proceed freeing.
>> + * - false: caller should not free pages for one of the two reasons:
>> + *   1. free_pages_prepare() failed so it is not safe to proceed freeing.
>> + *   2. this is a compound page having some HWPoison pages, and healthy
>> + *      pages are already safely freed.
>> + */
>> +static bool free_pages_prepare_has_hwpoisoned(struct page *page,
>> +					      unsigned int order,
>> +					      fpi_t fpi_flags)
>
> nit: Hope we'll come up with a better name than
> free_pages_prepare_has_poisoned(), but I don't have any better
> suggestion... :)
>
> And I hope somebody familiar with compaction (as compaction_free() calls
> free_pages_prepare() and ignores its return value) could confirm that
> it is safe to do a compound_has_hwpoisoned() check and, when it returns
> true, call free_has_hwpoisoned() in free_pages_prepare(),
> so that we won't need a separate function to do this.

I wrote the function. compact_control->freepages[] is an array of free
pages isolated from buddy allocator as a temporary free page buffer.
I wonder if compaction_free() can get a HWPoisoned folio or not, since
to get to compaction_free(), the folio needs to be copied to a new folio.
If the memory goes bad, folio_mc_copy() will return -EHWPOISON to abort
the migration, but I do not see any code marking the folio has_hwpoisoned.
When it reaches to compact_free(), there is no way to detect the error.
I think the corrupted page will be put back to buddy and be marked as
has_hwpoison after it is allocated and accessed by an application.

You might need to add has_hwpoisoned code in migration code path to
catch folio_mc_copy() errors.


Best Regards,
Yan, Zi

Re: [PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

Posted by Jiaqi Yan 2 weeks, 1 day ago

On Tue, Jan 13, 2026 at 2:02 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 13 Jan 2026, at 0:39, Harry Yoo wrote:
>
> > On Mon, Jan 12, 2026 at 12:49:22AM +0000, Jiaqi Yan wrote:
> >> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> >> becomes non-HugeTLB, and it is released to buddy allocator
> >> as a high-order folio, e.g. a folio that contains 262144 pages
> >> if the folio was a 1G HugeTLB hugepage.
> >>
> >> This is problematic if the HugeTLB hugepage contained HWPoison
> >> subpages. In that case, since buddy allocator does not check
> >> HWPoison for non-zero-order folio, the raw HWPoison page can
> >> be given out with its buddy page and be re-used by either
> >> kernel or userspace.
> >>
> >> Memory failure recovery (MFR) in kernel does attempt to take
> >> raw HWPoison page off buddy allocator after
> >> dissolve_free_hugetlb_folio(). However, there is always a time
> >> window between dissolve_free_hugetlb_folio() frees a HWPoison
> >> high-order folio to buddy allocator and MFR takes HWPoison
> >> raw page off buddy allocator.
> >
> > I wonder if this is something we want to backport to -stable.
> >
> >> One obvious way to avoid this problem is to add page sanity
> >> checks in page allocate or free path. However, it is against
> >> the past efforts to reduce sanity check overhead [1,2,3].
> >>
> >> Introduce free_has_hwpoisoned() to only free the healthy pages
> >> and to exclude the HWPoison ones in the high-order folio.
> >> The idea is to iterate through the sub-pages of the folio to
> >> identify contiguous ranges of healthy pages. Instead of freeing
> >> pages one by one, decompose healthy ranges into the largest
> >> possible blocks having different orders. Every block meets the
> >> requirements to be freed via __free_one_page().
> >>
> >> free_has_hwpoisoned() has linear time complexity wrt the number
> >> of pages in the folio. While the power-of-two decomposition
> >> ensures that the number of calls to the buddy allocator is
> >> logarithmic for each contiguous healthy range, the mandatory
> >> linear scan of pages to identify PageHWPoison() defines the
> >> overall time complexity. For a 1G hugepage having several
> >> HWPoison pages, free_has_hwpoisoned() takes around 2ms on
> >> average.
> >>
> >> Since free_has_hwpoisoned() has nontrivial overhead, it is
> >> wrapped inside free_pages_prepare_has_hwpoisoned() and done
> >> only PG_has_hwpoisoned indicates HWPoison page exists and
> >> after free_pages_prepare() succeeded.
> >>
> >> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net
> >> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net
> >> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >>
> >> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> >>
> >> ---
> >>  mm/page_alloc.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++-
> >>  1 file changed, 154 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 822e05f1a9646..9393589118604 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2923,6 +2928,152 @@ static bool free_frozen_page_commit(struct zone *zone,
> >>      return ret;
> >>  }
> >
> > From correctness point of view I think it looks good to me.
> > Let's see what the page allocator folks say.
> >
> > A few nits below.
> >
> >> +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
> >> +{
> >> +    if (order == 0 || !PageCompound(page))
> >> +            return false;
> >
> > nit: since order-0 compound page is not a thing,
> > !PageCompound(page) check should cover order == 0 case.
> >
> >> +    return folio_test_has_hwpoisoned(page_folio(page));
> >> +}
> >> +
> >> +/*
> >> + * Do free_has_hwpoisoned() when needed after free_pages_prepare().
> >> + * Returns
> >> + * - true: free_pages_prepare() is good and caller can proceed freeing.
> >> + * - false: caller should not free pages for one of the two reasons:
> >> + *   1. free_pages_prepare() failed so it is not safe to proceed freeing.
> >> + *   2. this is a compound page having some HWPoison pages, and healthy
> >> + *      pages are already safely freed.
> >> + */
> >> +static bool free_pages_prepare_has_hwpoisoned(struct page *page,
> >> +                                          unsigned int order,
> >> +                                          fpi_t fpi_flags)
> >
> > nit: Hope we'll come up with a better name than
> > free_pages_prepare_has_poisoned(), but I don't have any better
> > suggestion... :)
> >
> > And I hope somebody familiar with compaction (as compaction_free() calls
> > free_pages_prepare() and ignores its return value) could confirm that
> > it is safe to do a compound_has_hwpoisoned() check and, when it returns
> > true, call free_has_hwpoisoned() in free_pages_prepare(),
> > so that we won't need a separate function to do this.
>
> I wrote the function. compact_control->freepages[] is an array of free
> pages isolated from buddy allocator as a temporary free page buffer.

Thanks for the context, Zi!

> I wonder if compaction_free() can get a HWPoisoned folio or not, since
> to get to compaction_free(), the folio needs to be copied to a new folio.
> If the memory goes bad, folio_mc_copy() will return -EHWPOISON to abort
> the migration, but I do not see any code marking the folio has_hwpoisoned.
> When it reaches to compact_free(), there is no way to detect the error.
> I think the corrupted page will be put back to buddy and be marked as
> has_hwpoison after it is allocated and accessed by an application.

IIRC compaction_free() is used to deal with the *target* folio (i.e.
dst) if migration fails for any reason, say the src folio is hardware
corrupted and folio_mc_copy() returns -EHWPOISON. We can know nothing
about the integrity of the target folio out of folio_mc_copy(), hence
no marking PG_has_hwpoisoned for target page. The src folio should
already be marked async by memory_failure_queue() in
copy_mc_highpage().

So I think to compaction_free(), it always deal with
non-PG_has_hwpoisoned folio, and there isn't any difference between
free_pages_prepare_has_hwpoisoned() and free_pages_prepare(). I can
merge them in v4.

>
> You might need to add has_hwpoisoned code in migration code path to
> catch folio_mc_copy() errors.
>
>
> Best Regards,
> Yan, Zi

[PATCH v3 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio
[PATCH v3 2/3] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio
[PATCH v3 3/3] mm/memory-failure: refactor page_handle_poison()