[v2] Only free healthy pages in high-order HWPoison folio

[PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Jiaqi Yan 1 month, 3 weeks ago

At the end of dissolve_free_hugetlb_folio that a free HugeTLB
folio becomes non-HugeTLB, it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio. However, there is always a time
window between dissolve_free_hugetlb_folio frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.

One obvious way to avoid this problem is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce free_has_hwpoison_pages to only free the healthy
pages and excludes the HWPoison ones in the high-order folio.
The idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages. Instead of freeing
pages one by one, decompose healthy ranges into the largest
possible blocks. Each block meets the requirements to be freed
to buddy allocator (__free_frozen_pages).

free_has_hwpoison_pages has linear time complexity O(N) wrt the
number of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison defines the
overall time complexity.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a9646..20c8862ce594e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	}
 }
 
+static void prepare_compound_page_to_free(struct page *new_head,
+					  unsigned int order,
+					  unsigned long flags)
+{
+	new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
+	new_head->mapping = NULL;
+	new_head->private = 0;
+
+	clear_compound_head(new_head);
+	if (order)
+		prep_compound_page(new_head, order);
+}
+
+/*
+ * Given a range of pages physically contiguous physical, efficiently
+ * free them in blocks that meet __free_frozen_pages's requirements.
+ */
+static void free_contiguous_pages(struct page *curr, struct page *next,
+				  unsigned long flags)
+{
+	unsigned int order;
+	unsigned int align_order;
+	unsigned int size_order;
+	unsigned long pfn;
+	unsigned long end_pfn = page_to_pfn(next);
+	unsigned long remaining;
+
+	/*
+	 * This decomposition algorithm at every iteration chooses the
+	 * order to be the minimum of two constraints:
+	 * - Alignment: the largest power-of-two that divides current pfn.
+	 * - Size: the largest power-of-two that fits in the
+	 *   current remaining number of pages.
+	 */
+	while (curr < next) {
+		pfn = page_to_pfn(curr);
+		remaining = end_pfn - pfn;
+
+		align_order = ffs(pfn) - 1;
+		size_order = fls_long(remaining) - 1;
+		order = min(align_order, size_order);
+
+		prepare_compound_page_to_free(curr, order, flags);
+		__free_frozen_pages(curr, order, FPI_NONE);
+		curr += (1UL << order);
+	}
+
+	VM_WARN_ON(curr != next);
+}
+
+/*
+ * Given a high-order compound page containing certain number of HWPoison
+ * pages, free only the healthy ones to buddy allocator.
+ *
+ * It calls __free_frozen_pages O(2^order) times and cause nontrivial
+ * overhead. So only use this when compound page really contains HWPoison.
+ *
+ * This implementation doesn't work in memdesc world.
+ */
+static void free_has_hwpoison_pages(struct page *page, unsigned int order)
+{
+	struct page *curr = page;
+	struct page *end = page + (1 << order);
+	struct page *next;
+	unsigned long flags = page->flags.f;
+	unsigned long nr_pages;
+	unsigned long total_freed = 0;
+	unsigned long total_hwp = 0;
+
+	VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
+
+	while (curr < end) {
+		next = curr;
+		nr_pages = 0;
+
+		while (next < end && !PageHWPoison(next)) {
+			++next;
+			++nr_pages;
+		}
+
+		if (PageHWPoison(next))
+			++total_hwp;
+
+		free_contiguous_pages(curr, next, flags);
+
+		total_freed += nr_pages;
+		curr = PageHWPoison(next) ? next + 1 : next;
+	}
+
+	pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
+	pr_info("Freed %#lx pages from folio\n", total_freed);
+}
+
 void free_frozen_pages(struct page *page, unsigned int order)
 {
+	struct folio *folio = page_folio(page);
+
+	if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
+		folio_clear_has_hwpoisoned(folio);
+		free_has_hwpoison_pages(page, order);
+		return;
+	}
+
 	__free_frozen_pages(page, order, FPI_NONE);
 }
 
-- 
2.52.0.322.g1dd061c0dc-goog

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Miaohe Lin 1 month, 2 weeks ago

On 2025/12/20 2:33, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> folio becomes non-HugeTLB, it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio. However, there is always a time
> window between dissolve_free_hugetlb_folio frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
> 
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce free_has_hwpoison_pages to only free the healthy
> pages and excludes the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages. Instead of freeing
> pages one by one, decompose healthy ranges into the largest
> possible blocks. Each block meets the requirements to be freed
> to buddy allocator (__free_frozen_pages).
> 
> free_has_hwpoison_pages has linear time complexity O(N) wrt the
> number of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison defines the
> overall time complexity.
> 

Thanks for your patch.

> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a9646..20c8862ce594e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	}
>  }
>  
> +static void prepare_compound_page_to_free(struct page *new_head,
> +					  unsigned int order,
> +					  unsigned long flags)
> +{
> +	new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> +	new_head->mapping = NULL;
> +	new_head->private = 0;
> +
> +	clear_compound_head(new_head);
> +	if (order)
> +		prep_compound_page(new_head, order);
> +}
> +
> +/*
> + * Given a range of pages physically contiguous physical, efficiently
> + * free them in blocks that meet __free_frozen_pages's requirements.
> + */
> +static void free_contiguous_pages(struct page *curr, struct page *next,
> +				  unsigned long flags)
> +{
> +	unsigned int order;
> +	unsigned int align_order;
> +	unsigned int size_order;
> +	unsigned long pfn;
> +	unsigned long end_pfn = page_to_pfn(next);
> +	unsigned long remaining;
> +
> +	/*
> +	 * This decomposition algorithm at every iteration chooses the
> +	 * order to be the minimum of two constraints:
> +	 * - Alignment: the largest power-of-two that divides current pfn.
> +	 * - Size: the largest power-of-two that fits in the
> +	 *   current remaining number of pages.
> +	 */
> +	while (curr < next) {
> +		pfn = page_to_pfn(curr);
> +		remaining = end_pfn - pfn;
> +
> +		align_order = ffs(pfn) - 1;
> +		size_order = fls_long(remaining) - 1;
> +		order = min(align_order, size_order);
> +
> +		prepare_compound_page_to_free(curr, order, flags);> +		__free_frozen_pages(curr, order, FPI_NONE);
> +		curr += (1UL << order);

For hwpoisoned pages, nothing is done for them. I think we should run at least
some portion of code snippet from free_pages_prepare():

if (unlikely(PageHWPoison(page)) && !order) {
		/* Do not let hwpoison pages hit pcplists/buddy */
		reset_page_owner(page, order);
		page_table_check_free(page, order);
		pgalloc_tag_sub(page, 1 << order);

		/*
		 * The page is isolated and accounted for.
		 * Mark the codetag as empty to avoid accounting error
		 * when the page is freed by unpoison_memory().
		 */
		clear_page_tag_ref(page);
		return false;
	}

> +	}
> +
> +	VM_WARN_ON(curr != next);
> +}
> +
> +/*
> + * Given a high-order compound page containing certain number of HWPoison
> + * pages, free only the healthy ones to buddy allocator.
> + *
> + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> + * overhead. So only use this when compound page really contains HWPoison.
> + *
> + * This implementation doesn't work in memdesc world.
> + */
> +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> +{
> +	struct page *curr = page;
> +	struct page *end = page + (1 << order);
> +	struct page *next;
> +	unsigned long flags = page->flags.f;
> +	unsigned long nr_pages;
> +	unsigned long total_freed = 0;
> +	unsigned long total_hwp = 0;
> +
> +	VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> +
> +	while (curr < end) {
> +		next = curr;
> +		nr_pages = 0;
> +
> +		while (next < end && !PageHWPoison(next)) {
> +			++next;
> +			++nr_pages;
> +		}
> +
> +		if (PageHWPoison(next))
Would it be possible next points to end? In that case, irrelevant even nonexistent page
will be accessed ?

Thanks.
.

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Jiaqi Yan 1 month, 2 weeks ago

On Mon, Dec 22, 2025 at 11:45 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2025/12/20 2:33, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > folio becomes non-HugeTLB, it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio. However, there is always a time
> > window between dissolve_free_hugetlb_folio frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoison_pages to only free the healthy
> > pages and excludes the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages. Instead of freeing
> > pages one by one, decompose healthy ranges into the largest
> > possible blocks. Each block meets the requirements to be freed
> > to buddy allocator (__free_frozen_pages).
> >
> > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > number of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison defines the
> > overall time complexity.
> >
>
> Thanks for your patch.

Thanks for your review/comments!

>
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 101 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a9646..20c8862ce594e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >       }
> >  }
> >
> > +static void prepare_compound_page_to_free(struct page *new_head,
> > +                                       unsigned int order,
> > +                                       unsigned long flags)
> > +{
> > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > +     new_head->mapping = NULL;
> > +     new_head->private = 0;
> > +
> > +     clear_compound_head(new_head);
> > +     if (order)
> > +             prep_compound_page(new_head, order);
> > +}
> > +
> > +/*
> > + * Given a range of pages physically contiguous physical, efficiently
> > + * free them in blocks that meet __free_frozen_pages's requirements.
> > + */
> > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > +                               unsigned long flags)
> > +{
> > +     unsigned int order;
> > +     unsigned int align_order;
> > +     unsigned int size_order;
> > +     unsigned long pfn;
> > +     unsigned long end_pfn = page_to_pfn(next);
> > +     unsigned long remaining;
> > +
> > +     /*
> > +      * This decomposition algorithm at every iteration chooses the
> > +      * order to be the minimum of two constraints:
> > +      * - Alignment: the largest power-of-two that divides current pfn.
> > +      * - Size: the largest power-of-two that fits in the
> > +      *   current remaining number of pages.
> > +      */
> > +     while (curr < next) {
> > +             pfn = page_to_pfn(curr);
> > +             remaining = end_pfn - pfn;
> > +
> > +             align_order = ffs(pfn) - 1;
> > +             size_order = fls_long(remaining) - 1;
> > +             order = min(align_order, size_order);
> > +
> > +             prepare_compound_page_to_free(curr, order, flags);
> > +             __free_frozen_pages(curr, order, FPI_NONE);
> > +             curr += (1UL << order);
>
> For hwpoisoned pages, nothing is done for them. I think we should run at least
> some portion of code snippet from free_pages_prepare():

Agreed, will add in v3.

>
> if (unlikely(PageHWPoison(page)) && !order) {
>                 /* Do not let hwpoison pages hit pcplists/buddy */
>                 reset_page_owner(page, order);
>                 page_table_check_free(page, order);
>                 pgalloc_tag_sub(page, 1 << order);
>
>                 /*
>                  * The page is isolated and accounted for.
>                  * Mark the codetag as empty to avoid accounting error
>                  * when the page is freed by unpoison_memory().
>                  */
>                 clear_page_tag_ref(page);
>                 return false;
>         }
>
> > +     }
> > +
> > +     VM_WARN_ON(curr != next);
> > +}
> > +
> > +/*
> > + * Given a high-order compound page containing certain number of HWPoison
> > + * pages, free only the healthy ones to buddy allocator.
> > + *
> > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > + * overhead. So only use this when compound page really contains HWPoison.
> > + *
> > + * This implementation doesn't work in memdesc world.
> > + */
> > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > +{
> > +     struct page *curr = page;
> > +     struct page *end = page + (1 << order);
> > +     struct page *next;
> > +     unsigned long flags = page->flags.f;
> > +     unsigned long nr_pages;
> > +     unsigned long total_freed = 0;
> > +     unsigned long total_hwp = 0;
> > +
> > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > +
> > +     while (curr < end) {
> > +             next = curr;
> > +             nr_pages = 0;
> > +
> > +             while (next < end && !PageHWPoison(next)) {
> > +                     ++next;
> > +                     ++nr_pages;
> > +             }
> > +
> > +             if (PageHWPoison(next))
> Would it be possible next points to end? In that case, irrelevant even nonexistent page
> will be accessed ?

Thanks for catching that. Let me avoid access end as a page at all,
both here and in free_contiguous_pages.

>
> Thanks.
> .

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Harry Yoo 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> folio becomes non-HugeTLB, it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio. However, there is always a time
> window between dissolve_free_hugetlb_folio frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
> 
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce free_has_hwpoison_pages to only free the healthy
> pages and excludes the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages. Instead of freeing
> pages one by one, decompose healthy ranges into the largest
> possible blocks. Each block meets the requirements to be freed
> to buddy allocator (__free_frozen_pages).
> 
> free_has_hwpoison_pages has linear time complexity O(N) wrt the
> number of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison defines the
> overall time complexity.

Hi Jiaqi, thanks for the patch!

Have you tried measuring the latency of free_has_hwpoison_pages() when
a few pages in a 1GB folio are hwpoisoned?

Just wanted to make sure we don't introduce a possible soft lockup...
Or am I worrying too much?

> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a9646..20c8862ce594e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	}
>  }
>  
> +static void prepare_compound_page_to_free(struct page *new_head,
> +					  unsigned int order,
> +					  unsigned long flags)
> +{
> +	new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> +	new_head->mapping = NULL;
> +	new_head->private = 0;
> +
> +	clear_compound_head(new_head);
> +	if (order)
> +		prep_compound_page(new_head, order);
> +}

Not sure why it's building compound pages, just to decompose them
when freeing via __free_frozen_pages()?

If you intended to reset compound head & tails, I think it's more
readable to decompose the whole compound page at once and not build
compound pages when freeing it?

> +/*
> + * Given a range of pages physically contiguous physical, efficiently
> + * free them in blocks that meet __free_frozen_pages's requirements.
> + */
> +static void free_contiguous_pages(struct page *curr, struct page *next,
> +				  unsigned long flags)
> +{
> +	unsigned int order;
> +	unsigned int align_order;
> +	unsigned int size_order;
> +	unsigned long pfn;
> +	unsigned long end_pfn = page_to_pfn(next);
> +	unsigned long remaining;
> +
> +	/*
> +	 * This decomposition algorithm at every iteration chooses the
> +	 * order to be the minimum of two constraints:
> +	 * - Alignment: the largest power-of-two that divides current pfn.
> +	 * - Size: the largest power-of-two that fits in the
> +	 *   current remaining number of pages.
> +	 */
> +	while (curr < next) {
> +		pfn = page_to_pfn(curr);
> +		remaining = end_pfn - pfn;
> +
> +		align_order = ffs(pfn) - 1;
> +		size_order = fls_long(remaining) - 1;
> +		order = min(align_order, size_order);
> +
> +		prepare_compound_page_to_free(curr, order, flags);
> +		__free_frozen_pages(curr, order, FPI_NONE);
> +		curr += (1UL << order);
> +	}
> +
> +	VM_WARN_ON(curr != next);
> +}
> +
> +/*
> + * Given a high-order compound page containing certain number of HWPoison
> + * pages, free only the healthy ones to buddy allocator.
> + *
> + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> + * overhead. So only use this when compound page really contains HWPoison.
> + *
> + * This implementation doesn't work in memdesc world.
> + */
> +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> +{
> +	struct page *curr = page;
> +	struct page *end = page + (1 << order);
> +	struct page *next;
> +	unsigned long flags = page->flags.f;
> +	unsigned long nr_pages;
> +	unsigned long total_freed = 0;
> +	unsigned long total_hwp = 0;
> +
> +	VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> +
> +	while (curr < end) {
> +		next = curr;
> +		nr_pages = 0;
> +
> +		while (next < end && !PageHWPoison(next)) {
> +			++next;
> +			++nr_pages;
> +		}
> +
> +		if (PageHWPoison(next))
> +			++total_hwp;
> +
> +		free_contiguous_pages(curr, next, flags);

page_owner, memory profiling (anything else?) will be confused
because it was allocated as a larger size, but we're freeing only
some portion of it.

Perhaps we need to run some portion of this code snippet
(from free_pages_prepare()), before freeing portions of it:

        page_cpupid_reset_last(page);
        page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
        reset_page_owner(page, order);
        page_table_check_free(page, order);
        pgalloc_tag_sub(page, 1 << order);

> +		total_freed += nr_pages;
> +		curr = PageHWPoison(next) ? next + 1 : next;
> +	}
> +
> +	pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> +	pr_info("Freed %#lx pages from folio\n", total_freed);
> +}
> +
>  void free_frozen_pages(struct page *page, unsigned int order)
>  {
> +	struct folio *folio = page_folio(page);
> +
> +	if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> +		folio_clear_has_hwpoisoned(folio);
> +		free_has_hwpoison_pages(page, order);
> +		return;
> +	}
> +

It feels like it's a bit random place to do has_hwpoisoned check.
Can we move this to free_pages_prepare() where we have some
sanity checks (and also order-0 hwpoison page handling)?

>  	__free_frozen_pages(page, order, FPI_NONE);
>  }
>  
> -- 
> 2.52.0.322.g1dd061c0dc-goog

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Jiaqi Yan 1 month, 2 weeks ago

On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > folio becomes non-HugeTLB, it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio. However, there is always a time
> > window between dissolve_free_hugetlb_folio frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoison_pages to only free the healthy
> > pages and excludes the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages. Instead of freeing
> > pages one by one, decompose healthy ranges into the largest
> > possible blocks. Each block meets the requirements to be freed
> > to buddy allocator (__free_frozen_pages).
> >
> > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > number of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison defines the
> > overall time complexity.
>
> Hi Jiaqi, thanks for the patch!

Thanks for your review/comments!

>
> Have you tried measuring the latency of free_has_hwpoison_pages() when
> a few pages in a 1GB folio are hwpoisoned?
>
> Just wanted to make sure we don't introduce a possible soft lockup...
> Or am I worrying too much?

In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
I never run into a soft lockup. The 8-HWPoison-page case takes more
time than other cases, meaning that handling the additional HWPoison
page adds to the time cost.

After adding some instrument code, 10 sample runs of
free_has_hwpoison_pages with 8 HWPoison pages:
- observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
- observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)

In comparison, freeing a 1G folio without any HWPoison pages 10 times
(with same kernel config):
- observed mean is 3.39 ms
- observed standard deviation is 0.16ms

So it's around twice the baseline. It should be far from triggering a
soft lockup, and the cost seems fair for handling exceptional hardware
memory errors.

I can add these measurements in future revisions.

>
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 101 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a9646..20c8862ce594e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >       }
> >  }
> >
> > +static void prepare_compound_page_to_free(struct page *new_head,
> > +                                       unsigned int order,
> > +                                       unsigned long flags)
> > +{
> > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > +     new_head->mapping = NULL;
> > +     new_head->private = 0;
> > +
> > +     clear_compound_head(new_head);
> > +     if (order)
> > +             prep_compound_page(new_head, order);
> > +}
>
> Not sure why it's building compound pages, just to decompose them
> when freeing via __free_frozen_pages()?

prepare_compound_page_to_free() borrowed the idea from
__split_folio_to_order(). Conceptually the original folio is split
into new compound pages with different orders; here this is done on
the fly in free_contiguous_pages() when order is decided.
__free_frozen_pages() is also happy with a compound page when order >
0, as I tested with free_pages_prepare before calling
__free_frozen_pages().

>
> If you intended to reset compound head & tails, I think it's more
> readable to decompose the whole compound page at once and not build
> compound pages when freeing it?

I don't think prepare_compound_page_to_free() is that hard to read,
but I'm open to more opinions.

>
> > +/*
> > + * Given a range of pages physically contiguous physical, efficiently
> > + * free them in blocks that meet __free_frozen_pages's requirements.
> > + */
> > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > +                               unsigned long flags)
> > +{
> > +     unsigned int order;
> > +     unsigned int align_order;
> > +     unsigned int size_order;
> > +     unsigned long pfn;
> > +     unsigned long end_pfn = page_to_pfn(next);
> > +     unsigned long remaining;
> > +
> > +     /*
> > +      * This decomposition algorithm at every iteration chooses the
> > +      * order to be the minimum of two constraints:
> > +      * - Alignment: the largest power-of-two that divides current pfn.
> > +      * - Size: the largest power-of-two that fits in the
> > +      *   current remaining number of pages.
> > +      */
> > +     while (curr < next) {
> > +             pfn = page_to_pfn(curr);
> > +             remaining = end_pfn - pfn;
> > +
> > +             align_order = ffs(pfn) - 1;
> > +             size_order = fls_long(remaining) - 1;
> > +             order = min(align_order, size_order);
> > +
> > +             prepare_compound_page_to_free(curr, order, flags);
> > +             __free_frozen_pages(curr, order, FPI_NONE);
> > +             curr += (1UL << order);
> > +     }
> > +
> > +     VM_WARN_ON(curr != next);
> > +}
> > +
> > +/*
> > + * Given a high-order compound page containing certain number of HWPoison
> > + * pages, free only the healthy ones to buddy allocator.
> > + *
> > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > + * overhead. So only use this when compound page really contains HWPoison.
> > + *
> > + * This implementation doesn't work in memdesc world.
> > + */
> > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > +{
> > +     struct page *curr = page;
> > +     struct page *end = page + (1 << order);
> > +     struct page *next;
> > +     unsigned long flags = page->flags.f;
> > +     unsigned long nr_pages;
> > +     unsigned long total_freed = 0;
> > +     unsigned long total_hwp = 0;
> > +
> > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > +
> > +     while (curr < end) {
> > +             next = curr;
> > +             nr_pages = 0;
> > +
> > +             while (next < end && !PageHWPoison(next)) {
> > +                     ++next;
> > +                     ++nr_pages;
> > +             }
> > +
> > +             if (PageHWPoison(next))
> > +                     ++total_hwp;
> > +
> > +             free_contiguous_pages(curr, next, flags);
>
> page_owner, memory profiling (anything else?) will be confused
> because it was allocated as a larger size, but we're freeing only
> some portion of it.

I am not sure, but looking at __split_unmapped_folio, it calls
pgalloc_tag_split(folio, old_order, split_order) when splitting an
old_order-order folio into a new split_order.

Maybe prepare_compound_page_to_free really should
update_page_tag_ref(), I need to take a closer look at this with
CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).

>
> Perhaps we need to run some portion of this code snippet
> (from free_pages_prepare()), before freeing portions of it:
>
>         page_cpupid_reset_last(page);
>         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>         reset_page_owner(page, order);
>         page_table_check_free(page, order);
>         pgalloc_tag_sub(page, 1 << order);

Since they come from free_pages_prepare, I believe these lines are
already executed via free_contiguous_pages()=> __free_frozen_pages()=>
free_pages_prepare(), right? Or am I missing something?

>
> > +             total_freed += nr_pages;
> > +             curr = PageHWPoison(next) ? next + 1 : next;
> > +     }
> > +
> > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > +}
> > +
> >  void free_frozen_pages(struct page *page, unsigned int order)
> >  {
> > +     struct folio *folio = page_folio(page);
> > +
> > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > +             folio_clear_has_hwpoisoned(folio);
> > +             free_has_hwpoison_pages(page, order);
> > +             return;
> > +     }
> > +
>
> It feels like it's a bit random place to do has_hwpoisoned check.
> Can we move this to free_pages_prepare() where we have some
> sanity checks (and also order-0 hwpoison page handling)?

While free_pages_prepare() seems to be a better place to do the
has_hwpoisoned check, it is not a good place to do
free_has_hwpoison_pages().


>
> >       __free_frozen_pages(page, order, FPI_NONE);
> >  }
> >
> > --
> > 2.52.0.322.g1dd061c0dc-goog
>
> --
> Cheers,
> Harry / Hyeonggon

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Harry Yoo 1 month, 1 week ago

On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > > folio becomes non-HugeTLB, it is released to buddy allocator
> > > as a high-order folio, e.g. a folio that contains 262144 pages
> > > if the folio was a 1G HugeTLB hugepage.
> > >
> > > This is problematic if the HugeTLB hugepage contained HWPoison
> > > subpages. In that case, since buddy allocator does not check
> > > HWPoison for non-zero-order folio, the raw HWPoison page can
> > > be given out with its buddy page and be re-used by either
> > > kernel or userspace.
> > >
> > > Memory failure recovery (MFR) in kernel does attempt to take
> > > raw HWPoison page off buddy allocator after
> > > dissolve_free_hugetlb_folio. However, there is always a time
> > > window between dissolve_free_hugetlb_folio frees a HWPoison
> > > high-order folio to buddy allocator and MFR takes HWPoison
> > > raw page off buddy allocator.
> > >
> > > One obvious way to avoid this problem is to add page sanity
> > > checks in page allocate or free path. However, it is against
> > > the past efforts to reduce sanity check overhead [1,2,3].
> > >
> > > Introduce free_has_hwpoison_pages to only free the healthy
> > > pages and excludes the HWPoison ones in the high-order folio.
> > > The idea is to iterate through the sub-pages of the folio to
> > > identify contiguous ranges of healthy pages. Instead of freeing
> > > pages one by one, decompose healthy ranges into the largest
> > > possible blocks. Each block meets the requirements to be freed
> > > to buddy allocator (__free_frozen_pages).
> > >
> > > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > > number of pages in the folio. While the power-of-two decomposition
> > > ensures that the number of calls to the buddy allocator is
> > > logarithmic for each contiguous healthy range, the mandatory
> > > linear scan of pages to identify PageHWPoison defines the
> > > overall time complexity.
> >
> > Hi Jiaqi, thanks for the patch!
> 
> Thanks for your review/comments!
> 
> >
> > Have you tried measuring the latency of free_has_hwpoison_pages() when
> > a few pages in a 1GB folio are hwpoisoned?
> >
> > Just wanted to make sure we don't introduce a possible soft lockup...
> > Or am I worrying too much?
> 
> In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
> I never run into a soft lockup. The 8-HWPoison-page case takes more
> time than other cases, meaning that handling the additional HWPoison
> page adds to the time cost.
> 
> After adding some instrument code, 10 sample runs of
> free_has_hwpoison_pages with 8 HWPoison pages:
> - observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
> - observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)
> 
> In comparison, freeing a 1G folio without any HWPoison pages 10 times
> (with same kernel config):
> - observed mean is 3.39 ms
> - observed standard deviation is 0.16ms

Thanks for the measurement!
 
> So it's around twice the baseline. It should be far from triggering a
> soft lockup, and the cost seems fair for handling exceptional hardware
> memory errors.

Yeah it looks fine to me.

> I can add these measurements in future revisions.

That would be nice, thanks.

> > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 101 insertions(+)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 822e05f1a9646..20c8862ce594e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> > >       }
> > >  }
> > >
> > > +static void prepare_compound_page_to_free(struct page *new_head,
> > > +                                       unsigned int order,
> > > +                                       unsigned long flags)
> > > +{
> > > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > +     new_head->mapping = NULL;
> > > +     new_head->private = 0;
> > > +
> > > +     clear_compound_head(new_head);
> > > +     if (order)
> > > +             prep_compound_page(new_head, order);
> > > +}
> >
> > Not sure why it's building compound pages, just to decompose them
> > when freeing via __free_frozen_pages()?
> 
> prepare_compound_page_to_free() borrowed the idea from
> __split_folio_to_order(). Conceptually the original folio is split
> into new compound pages with different orders;

I see, and per the previous discussion we don't want to split it
to 262,144 4K pages in the future, anyway...

> here this is done on
> the fly in free_contiguous_pages() when order is decided.
>
> > > +/*
> > > + * Given a range of pages physically contiguous physical, efficiently
> > > + * free them in blocks that meet __free_frozen_pages's requirements.
> > > + */
> > > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > > +                               unsigned long flags)
> > > +{
> > > +     unsigned int order;
> > > +     unsigned int align_order;
> > > +     unsigned int size_order;
> > > +     unsigned long pfn;
> > > +     unsigned long end_pfn = page_to_pfn(next);
> > > +     unsigned long remaining;
> > > +
> > > +     /*
> > > +      * This decomposition algorithm at every iteration chooses the
> > > +      * order to be the minimum of two constraints:
> > > +      * - Alignment: the largest power-of-two that divides current pfn.
> > > +      * - Size: the largest power-of-two that fits in the
> > > +      *   current remaining number of pages.
> > > +      */
> > > +     while (curr < next) {
> > > +             pfn = page_to_pfn(curr);
> > > +             remaining = end_pfn - pfn;
> > > +
> > > +             align_order = ffs(pfn) - 1;
> > > +             size_order = fls_long(remaining) - 1;
> > > +             order = min(align_order, size_order);
> > > +
> > > +             prepare_compound_page_to_free(curr, order, flags);
> > > +             __free_frozen_pages(curr, order, FPI_NONE);
> > > +             curr += (1UL << order);
> > > +     }
> > > +
> > > +     VM_WARN_ON(curr != next);
> > > +}
> > > +
> > > +/*
> > > + * Given a high-order compound page containing certain number of HWPoison
> > > + * pages, free only the healthy ones to buddy allocator.
> > > + *
> > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > + * overhead. So only use this when compound page really contains HWPoison.
> > > + *
> > > + * This implementation doesn't work in memdesc world.
> > > + */
> > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > +{
> > > +     struct page *curr = page;
> > > +     struct page *end = page + (1 << order);
> > > +     struct page *next;
> > > +     unsigned long flags = page->flags.f;
> > > +     unsigned long nr_pages;
> > > +     unsigned long total_freed = 0;
> > > +     unsigned long total_hwp = 0;
> > > +
> > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > +
> > > +     while (curr < end) {
> > > +             next = curr;
> > > +             nr_pages = 0;
> > > +
> > > +             while (next < end && !PageHWPoison(next)) {
> > > +                     ++next;
> > > +                     ++nr_pages;
> > > +             }
> > > +
> > > +             if (PageHWPoison(next))
> > > +                     ++total_hwp;
> > > +
> > > +             free_contiguous_pages(curr, next, flags);
> >
> > page_owner, memory profiling (anything else?) will be confused
> > because it was allocated as a larger size, but we're freeing only
> > some portion of it.
> 
> I am not sure, but looking at __split_unmapped_folio, it calls
> pgalloc_tag_split(folio, old_order, split_order) when splitting an
> old_order-order folio into a new split_order.
> 
> Maybe prepare_compound_page_to_free really should
> update_page_tag_ref(), I need to take a closer look at this with
> CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> 
> > Perhaps we need to run some portion of this code snippet
> > (from free_pages_prepare()), before freeing portions of it:
> >
> >         page_cpupid_reset_last(page);
> >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >         reset_page_owner(page, order);
> >         page_table_check_free(page, order);
> >         pgalloc_tag_sub(page, 1 << order);
> 
> Since they come from free_pages_prepare, I believe these lines are
> already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> free_pages_prepare(), right? Or am I missing something?

But they're called with order that is smaller than the original order.
That's could be problematic; for example, memory profiling stores metadata
only on the first page. If you pass anything other than the first page
to free_pages_prepare(), it will not recognize that metadata was stored
during allocation.

In general, I think they're not designed to handle cases where
the allocation order and the free order differ (unless we split
metadata like __split_unmapped_folio() does).

> > > +             total_freed += nr_pages;
> > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > +     }
> > > +
> > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > +}
> > > +
> > >  void free_frozen_pages(struct page *page, unsigned int order)
> > >  {
> > > +     struct folio *folio = page_folio(page);
> > > +
> > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > +             folio_clear_has_hwpoisoned(folio);
> > > +             free_has_hwpoison_pages(page, order);
> > > +             return;
> > > +     }
> > > +
> >
> > It feels like it's a bit random place to do has_hwpoisoned check.
> > Can we move this to free_pages_prepare() where we have some
> > sanity checks (and also order-0 hwpoison page handling)?
> 
> While free_pages_prepare() seems to be a better place to do the
> has_hwpoisoned check, it is not a good place to do
> free_has_hwpoison_pages().

Why is it not a good place for free_has_hwpoison_pages()?

Callers of free_pages_prepare() are supposed to avoid freeing it back to
the buddy or using the page when it returns false.

...except compaction_free(), which I don't have much idea what it's
doing.

> > >       __free_frozen_pages(page, order, FPI_NONE);
> > >  }

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Jiaqi Yan 1 month, 1 week ago

On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > > > folio becomes non-HugeTLB, it is released to buddy allocator
> > > > as a high-order folio, e.g. a folio that contains 262144 pages
> > > > if the folio was a 1G HugeTLB hugepage.
> > > >
> > > > This is problematic if the HugeTLB hugepage contained HWPoison
> > > > subpages. In that case, since buddy allocator does not check
> > > > HWPoison for non-zero-order folio, the raw HWPoison page can
> > > > be given out with its buddy page and be re-used by either
> > > > kernel or userspace.
> > > >
> > > > Memory failure recovery (MFR) in kernel does attempt to take
> > > > raw HWPoison page off buddy allocator after
> > > > dissolve_free_hugetlb_folio. However, there is always a time
> > > > window between dissolve_free_hugetlb_folio frees a HWPoison
> > > > high-order folio to buddy allocator and MFR takes HWPoison
> > > > raw page off buddy allocator.
> > > >
> > > > One obvious way to avoid this problem is to add page sanity
> > > > checks in page allocate or free path. However, it is against
> > > > the past efforts to reduce sanity check overhead [1,2,3].
> > > >
> > > > Introduce free_has_hwpoison_pages to only free the healthy
> > > > pages and excludes the HWPoison ones in the high-order folio.
> > > > The idea is to iterate through the sub-pages of the folio to
> > > > identify contiguous ranges of healthy pages. Instead of freeing
> > > > pages one by one, decompose healthy ranges into the largest
> > > > possible blocks. Each block meets the requirements to be freed
> > > > to buddy allocator (__free_frozen_pages).
> > > >
> > > > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > > > number of pages in the folio. While the power-of-two decomposition
> > > > ensures that the number of calls to the buddy allocator is
> > > > logarithmic for each contiguous healthy range, the mandatory
> > > > linear scan of pages to identify PageHWPoison defines the
> > > > overall time complexity.
> > >
> > > Hi Jiaqi, thanks for the patch!
> >
> > Thanks for your review/comments!
> >
> > >
> > > Have you tried measuring the latency of free_has_hwpoison_pages() when
> > > a few pages in a 1GB folio are hwpoisoned?
> > >
> > > Just wanted to make sure we don't introduce a possible soft lockup...
> > > Or am I worrying too much?
> >
> > In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
> > I never run into a soft lockup. The 8-HWPoison-page case takes more
> > time than other cases, meaning that handling the additional HWPoison
> > page adds to the time cost.
> >
> > After adding some instrument code, 10 sample runs of
> > free_has_hwpoison_pages with 8 HWPoison pages:
> > - observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
> > - observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)
> >
> > In comparison, freeing a 1G folio without any HWPoison pages 10 times
> > (with same kernel config):
> > - observed mean is 3.39 ms
> > - observed standard deviation is 0.16ms
>
> Thanks for the measurement!
>
> > So it's around twice the baseline. It should be far from triggering a
> > soft lockup, and the cost seems fair for handling exceptional hardware
> > memory errors.
>
> Yeah it looks fine to me.
>
> > I can add these measurements in future revisions.
>
> That would be nice, thanks.
>
> > > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > > >
> > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > ---
> > > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 101 insertions(+)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 822e05f1a9646..20c8862ce594e 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> > > >       }
> > > >  }
> > > >
> > > > +static void prepare_compound_page_to_free(struct page *new_head,
> > > > +                                       unsigned int order,
> > > > +                                       unsigned long flags)
> > > > +{
> > > > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > > +     new_head->mapping = NULL;
> > > > +     new_head->private = 0;
> > > > +
> > > > +     clear_compound_head(new_head);
> > > > +     if (order)
> > > > +             prep_compound_page(new_head, order);
> > > > +}
> > >
> > > Not sure why it's building compound pages, just to decompose them
> > > when freeing via __free_frozen_pages()?
> >
> > prepare_compound_page_to_free() borrowed the idea from
> > __split_folio_to_order(). Conceptually the original folio is split
> > into new compound pages with different orders;
>
> I see, and per the previous discussion we don't want to split it
> to 262,144 4K pages in the future, anyway...
>
> > here this is done on
> > the fly in free_contiguous_pages() when order is decided.
> >
> > > > +/*
> > > > + * Given a range of pages physically contiguous physical, efficiently
> > > > + * free them in blocks that meet __free_frozen_pages's requirements.
> > > > + */
> > > > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > > > +                               unsigned long flags)
> > > > +{
> > > > +     unsigned int order;
> > > > +     unsigned int align_order;
> > > > +     unsigned int size_order;
> > > > +     unsigned long pfn;
> > > > +     unsigned long end_pfn = page_to_pfn(next);
> > > > +     unsigned long remaining;
> > > > +
> > > > +     /*
> > > > +      * This decomposition algorithm at every iteration chooses the
> > > > +      * order to be the minimum of two constraints:
> > > > +      * - Alignment: the largest power-of-two that divides current pfn.
> > > > +      * - Size: the largest power-of-two that fits in the
> > > > +      *   current remaining number of pages.
> > > > +      */
> > > > +     while (curr < next) {
> > > > +             pfn = page_to_pfn(curr);
> > > > +             remaining = end_pfn - pfn;
> > > > +
> > > > +             align_order = ffs(pfn) - 1;
> > > > +             size_order = fls_long(remaining) - 1;
> > > > +             order = min(align_order, size_order);
> > > > +
> > > > +             prepare_compound_page_to_free(curr, order, flags);
> > > > +             __free_frozen_pages(curr, order, FPI_NONE);
> > > > +             curr += (1UL << order);
> > > > +     }
> > > > +
> > > > +     VM_WARN_ON(curr != next);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Given a high-order compound page containing certain number of HWPoison
> > > > + * pages, free only the healthy ones to buddy allocator.
> > > > + *
> > > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > > + * overhead. So only use this when compound page really contains HWPoison.
> > > > + *
> > > > + * This implementation doesn't work in memdesc world.
> > > > + */
> > > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > > +{
> > > > +     struct page *curr = page;
> > > > +     struct page *end = page + (1 << order);
> > > > +     struct page *next;
> > > > +     unsigned long flags = page->flags.f;
> > > > +     unsigned long nr_pages;
> > > > +     unsigned long total_freed = 0;
> > > > +     unsigned long total_hwp = 0;
> > > > +
> > > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > > +
> > > > +     while (curr < end) {
> > > > +             next = curr;
> > > > +             nr_pages = 0;
> > > > +
> > > > +             while (next < end && !PageHWPoison(next)) {
> > > > +                     ++next;
> > > > +                     ++nr_pages;
> > > > +             }
> > > > +
> > > > +             if (PageHWPoison(next))
> > > > +                     ++total_hwp;
> > > > +
> > > > +             free_contiguous_pages(curr, next, flags);
> > >
> > > page_owner, memory profiling (anything else?) will be confused
> > > because it was allocated as a larger size, but we're freeing only
> > > some portion of it.
> >
> > I am not sure, but looking at __split_unmapped_folio, it calls
> > pgalloc_tag_split(folio, old_order, split_order) when splitting an
> > old_order-order folio into a new split_order.
> >
> > Maybe prepare_compound_page_to_free really should
> > update_page_tag_ref(), I need to take a closer look at this with
> > CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> >
> > > Perhaps we need to run some portion of this code snippet
> > > (from free_pages_prepare()), before freeing portions of it:
> > >
> > >         page_cpupid_reset_last(page);
> > >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >         reset_page_owner(page, order);
> > >         page_table_check_free(page, order);
> > >         pgalloc_tag_sub(page, 1 << order);
> >
> > Since they come from free_pages_prepare, I believe these lines are
> > already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> > free_pages_prepare(), right? Or am I missing something?
>
> But they're called with order that is smaller than the original order.
> That's could be problematic; for example, memory profiling stores metadata
> only on the first page. If you pass anything other than the first page
> to free_pages_prepare(), it will not recognize that metadata was stored
> during allocation.
>

Right, with MEM_ALLOC_PROFILING enabled, I ran into the following
WARNING when freeing all blocks except the 1st one (which contains the
original head page):

[ 2101.713669] ------------[ cut here ]------------
[ 2101.713670] alloc_tag was not set
[ 2101.713671] WARNING: ./include/linux/alloc_tag.h:164 at
__pgalloc_tag_sub+0xdf/0x160, CPU#18: hugetlb-mfr-3pa/33675
[ 2101.713693] CPU: 18 UID: 0 PID: 33675 Comm: hugetlb-mfr-3pa
Tainted: G S W O 6.19.0-smp-DEV #2 NONE
[ 2101.713698] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE
[ 2101.713702] RIP: 0010:__pgalloc_tag_sub+0xdf/0x160
...
[ 2101.713723] Call Trace:
[ 2101.713725] <TASK>
[ 2101.713727] free_has_hwpoison_pages+0xbc/0x370
[ 2101.713731] free_frozen_pages+0xb3/0x100
[ 2101.713733] __folio_put+0xd5/0x100
[ 2101.713739] dissolve_free_hugetlb_folio+0x17f/0x1a0
[ 2101.713743] filemap_offline_hwpoison_folio+0x193/0x4c0
[ 2101.713747] ? __pfx_workingset_update_node+0x10/0x10
[ 2101.713751] remove_inode_hugepages+0x209/0x690
[ 2101.713757] ? on_each_cpu_cond_mask+0x1a/0x20
[ 2101.713760] ? __cond_resched+0x23/0x60
[ 2101.713768] ? n_tty_write+0x4c7/0x500
[ 2101.713773] hugetlbfs_setattr+0x127/0x170
[ 2101.713776] notify_change+0x32e/0x390
[ 2101.713781] do_ftruncate+0x12c/0x1a0
[ 2101.713786] __x64_sys_ftruncate+0x3e/0x70
[ 2101.713789] do_syscall_64+0x6f/0x890
[ 2101.713792] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 2101.713811] </TASK>
[ 2101.713812] ---[ end trace 0000000000000000 ]---

This is because in free_pages_prepare(), pgalloc_tag_sub() found there
is no alloc tag on the compound page being freeing.

> In general, I think they're not designed to handle cases where
> the allocation order and the free order differ (unless we split
> metadata like __split_unmapped_folio() does).

I believe the proper way to fix this is to do something similar to
pgalloc_tag_split(), used by __split_unmapped_folio(). When we split a
new block from the original folio, we create a compound page from the
block (just the way prep_compound_page_to_free does) and link the
alloc tag of the original head page to the head of the new compound
page. Something like copy_alloc_tag (to be added in v3) below to demo
my idea, assuming tag=pgalloc_tag_get(original head page):

+/*
+ * Point page's alloc tag to an existing one.
+ */
+static void copy_alloc_tag(struct page *page, struct alloc_tag *tag)
+{
+       union pgtag_ref_handle handle;
+       union codetag_ref ref;
+       unsigned long pfn = page_to_pfn(page);
+
+       if (!mem_alloc_profiling_enabled())
+               return;
+
+       /* tag is NULL if HugeTLB page is allocated in boot process. */
+       if (!tag)
+               return;
+
+       if (!get_page_tag_ref(page, &ref, &handle))
+               return;
+
+       /* Avoid overriding existing alloc tag from page. */
+       if (!ref.ct || is_codetag_empty(&ref)) {
+               alloc_tag_ref_set(&ref, tag);
+               update_page_tag_ref(handle, &ref);
+       }
+       put_page_tag_ref(handle);
+}
+
+static void prep_compound_page_to_free(struct page *head, unsigned int order,
+                                      unsigned long flags, struct
alloc_tag *tag)
+{
+       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
+       head->mapping = NULL;
+       head->private = 0;
+
+       clear_compound_head(head);
+       if (order)
+               prep_compound_page(head, order);
+
+       copy_alloc_tag(head, tag);
+}

I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
HWPoison pages before pgalloc_tag_sub().

>
> > > > +             total_freed += nr_pages;
> > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > +     }
> > > > +
> > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > +}
> > > > +
> > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > >  {
> > > > +     struct folio *folio = page_folio(page);
> > > > +
> > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > +             folio_clear_has_hwpoisoned(folio);
> > > > +             free_has_hwpoison_pages(page, order);
> > > > +             return;
> > > > +     }
> > > > +
> > >
> > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > Can we move this to free_pages_prepare() where we have some
> > > sanity checks (and also order-0 hwpoison page handling)?
> >
> > While free_pages_prepare() seems to be a better place to do the
> > has_hwpoisoned check, it is not a good place to do
> > free_has_hwpoison_pages().
>
> Why is it not a good place for free_has_hwpoison_pages()?
>
> Callers of free_pages_prepare() are supposed to avoid freeing it back to
> the buddy or using the page when it returns false.

What I mean is, callers of free_pages_prepare() wouldn't know from the
single false return value whether 1) they should completely bail out
or 2) they should retry with free_has_hwpoison_pages. So for now I
think it'd better that free_frozen_pages does the check and act upon
the check result. Not sure if there is a better place, or worthy to
change free_pages_prepare()'s return type?

>
> ...except compaction_free(), which I don't have much idea what it's
> doing.
>
> > > >       __free_frozen_pages(page, order, FPI_NONE);
> > > >  }
>
> --
> Cheers,
> Harry / Hyeonggon

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Harry Yoo 1 month, 1 week ago

On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > > ---
> > > > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 101 insertions(+)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 822e05f1a9646..20c8862ce594e 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > +/*
> > > > > + * Given a high-order compound page containing certain number of HWPoison
> > > > > + * pages, free only the healthy ones to buddy allocator.
> > > > > + *
> > > > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > > > + * overhead. So only use this when compound page really contains HWPoison.
> > > > > + *
> > > > > + * This implementation doesn't work in memdesc world.
> > > > > + */
> > > > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > > > +{
> > > > > +     struct page *curr = page;
> > > > > +     struct page *end = page + (1 << order);
> > > > > +     struct page *next;
> > > > > +     unsigned long flags = page->flags.f;
> > > > > +     unsigned long nr_pages;
> > > > > +     unsigned long total_freed = 0;
> > > > > +     unsigned long total_hwp = 0;
> > > > > +
> > > > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > > > +
> > > > > +     while (curr < end) {
> > > > > +             next = curr;
> > > > > +             nr_pages = 0;
> > > > > +
> > > > > +             while (next < end && !PageHWPoison(next)) {
> > > > > +                     ++next;
> > > > > +                     ++nr_pages;
> > > > > +             }
> > > > > +
> > > > > +             if (PageHWPoison(next))
> > > > > +                     ++total_hwp;
> > > > > +
> > > > > +             free_contiguous_pages(curr, next, flags);
> > > >
> > > > page_owner, memory profiling (anything else?) will be confused
> > > > because it was allocated as a larger size, but we're freeing only
> > > > some portion of it.
> > >
> > > I am not sure, but looking at __split_unmapped_folio, it calls
> > > pgalloc_tag_split(folio, old_order, split_order) when splitting an
> > > old_order-order folio into a new split_order.
> > >
> > > Maybe prepare_compound_page_to_free really should
> > > update_page_tag_ref(), I need to take a closer look at this with
> > > CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> > >
> > > > Perhaps we need to run some portion of this code snippet
> > > > (from free_pages_prepare()), before freeing portions of it:
> > > >
> > > >         page_cpupid_reset_last(page);
> > > >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > >         reset_page_owner(page, order);
> > > >         page_table_check_free(page, order);
> > > >         pgalloc_tag_sub(page, 1 << order);
> > >
> > > Since they come from free_pages_prepare, I believe these lines are
> > > already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> > > free_pages_prepare(), right? Or am I missing something?
> >
> > But they're called with order that is smaller than the original order.
> > That's could be problematic; for example, memory profiling stores metadata
> > only on the first page. If you pass anything other than the first page
> > to free_pages_prepare(), it will not recognize that metadata was stored
> > during allocation.
> >
> 
> Right, with MEM_ALLOC_PROFILING enabled, I ran into the following
> WARNING when freeing all blocks except the 1st one (which contains the
> original head page):
> 
> [ 2101.713669] ------------[ cut here ]------------
> [ 2101.713670] alloc_tag was not set
> [ 2101.713671] WARNING: ./include/linux/alloc_tag.h:164 at
> __pgalloc_tag_sub+0xdf/0x160, CPU#18: hugetlb-mfr-3pa/33675
> [ 2101.713693] CPU: 18 UID: 0 PID: 33675 Comm: hugetlb-mfr-3pa
> Tainted: G S W O 6.19.0-smp-DEV #2 NONE
> [ 2101.713698] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE
> [ 2101.713702] RIP: 0010:__pgalloc_tag_sub+0xdf/0x160
> ...
> [ 2101.713723] Call Trace:
> [ 2101.713725] <TASK>
> [ 2101.713727] free_has_hwpoison_pages+0xbc/0x370
> [ 2101.713731] free_frozen_pages+0xb3/0x100
> [ 2101.713733] __folio_put+0xd5/0x100
> [ 2101.713739] dissolve_free_hugetlb_folio+0x17f/0x1a0
> [ 2101.713743] filemap_offline_hwpoison_folio+0x193/0x4c0
> [ 2101.713747] ? __pfx_workingset_update_node+0x10/0x10
> [ 2101.713751] remove_inode_hugepages+0x209/0x690
> [ 2101.713757] ? on_each_cpu_cond_mask+0x1a/0x20
> [ 2101.713760] ? __cond_resched+0x23/0x60
> [ 2101.713768] ? n_tty_write+0x4c7/0x500
> [ 2101.713773] hugetlbfs_setattr+0x127/0x170
> [ 2101.713776] notify_change+0x32e/0x390
> [ 2101.713781] do_ftruncate+0x12c/0x1a0
> [ 2101.713786] __x64_sys_ftruncate+0x3e/0x70
> [ 2101.713789] do_syscall_64+0x6f/0x890
> [ 2101.713792] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 2101.713811] </TASK>
> [ 2101.713812] ---[ end trace 0000000000000000 ]---
> 
> This is because in free_pages_prepare(), pgalloc_tag_sub() found there
> is no alloc tag on the compound page being freeing.
> 
> > In general, I think they're not designed to handle cases where
> > the allocation order and the free order differ (unless we split
> > metadata like __split_unmapped_folio() does).
> 
> I believe the proper way to fix this is to do something similar to
> pgalloc_tag_split(), used by __split_unmapped_folio().

Yeah, that will work.

The difference is that pgalloc_tag_split(), split_page_owner(), and
split_page_memcg() only support splitting the whole page into many (> 1)
smaller pieces, but we're trying to create only one smaller page from the
original page.

> When we split a
> new block from the original folio, we create a compound page from the
> block (just the way prep_compound_page_to_free does) and link the
> alloc tag of the original head page to the head of the new compound
> page.
>
> Something like copy_alloc_tag (to be added in v3) below to demo
> my idea, assuming tag=pgalloc_tag_get(original head page):
> 
> +/*
> + * Point page's alloc tag to an existing one.
> + */
> +static void copy_alloc_tag(struct page *page, struct alloc_tag *tag)

This should be defined in lib/alloc_tag.c

> +{
> +       union pgtag_ref_handle handle;
> +       union codetag_ref ref;
> +       unsigned long pfn = page_to_pfn(page);
> +
> +       if (!mem_alloc_profiling_enabled())
> +               return;
> +
> +       /* tag is NULL if HugeTLB page is allocated in boot process. */
> +       if (!tag)
> +               return;
> +
> +       if (!get_page_tag_ref(page, &ref, &handle))
> +               return;
> +
> +       /* Avoid overriding existing alloc tag from page. */
> +       if (!ref.ct || is_codetag_empty(&ref)) {

Is it an error if the page already has a valid tag?

> +               alloc_tag_ref_set(&ref, tag);
> +               update_page_tag_ref(handle, &ref);
> +       }
> +       put_page_tag_ref(handle);
> +}
> +
> +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> +                                      unsigned long flags, struct
> alloc_tag *tag)
> +{
> +       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> +       head->mapping = NULL;
> +       head->private = 0;
> +
> +       clear_compound_head(head);
> +       if (order)
> +               prep_compound_page(head, order);
> +
> +       copy_alloc_tag(head, tag);

Do we need a similar thing for memcg? Probably not, since it should have
been uncharged before freeing (as long as we're not using it for kmem pages)
Perhaps a comment on why we don't need to split memcg will be enough.

Do we need a similar thing for page_owner? I think yes, although it won't
crash or cause a warning, it will be inconsistent if we split page_owner in
split_page()/__split_unmapped_folio()/etc but not in
prep_compound_page_to_free(). 

> +}
> 
> I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> HWPoison pages before pgalloc_tag_sub().
> 
> > > > > +             total_freed += nr_pages;
> > > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > > +     }
> > > > > +
> > > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > +}
> > > > > +
> > > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > > >  {
> > > > > +     struct folio *folio = page_folio(page);
> > > > > +
> > > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > +             folio_clear_has_hwpoisoned(folio);
> > > > > +             free_has_hwpoison_pages(page, order);
> > > > > +             return;
> > > > > +     }
> > > > > +
> > > >
> > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > Can we move this to free_pages_prepare() where we have some
> > > > sanity checks (and also order-0 hwpoison page handling)?
> > >
> > > While free_pages_prepare() seems to be a better place to do the
> > > has_hwpoisoned check, it is not a good place to do
> > > free_has_hwpoison_pages().
> >
> > Why is it not a good place for free_has_hwpoison_pages()?
> >
> > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > the buddy or using the page when it returns false.
> 
> What I mean is, callers of free_pages_prepare() wouldn't know from the
> single false return value whether 1) they should completely bail out
> or 2) they should retry with free_has_hwpoison_pages.

I was thinking that once free_pages_prepare() returns false, it already
has done all the work to isolate HWPoison pages and freed healthy portions,
so the caller doesn't have to do anything and can bail out completely.

> So for now I
> think it'd better that free_frozen_pages does the check and act upon
> the check result. Not sure if there is a better place, or worthy to
> change free_pages_prepare()'s return type?
> 
> > ...except compaction_free(), which I don't have much idea what it's
> > doing.
> >
> > > > >       __free_frozen_pages(page, order, FPI_NONE);
> > > > >  }

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Jiaqi Yan 1 month ago

On Tue, Dec 30, 2025 at 8:37 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> > On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > > > ---
> > > > > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 101 insertions(+)
> > > > > >
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 822e05f1a9646..20c8862ce594e 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > +/*
> > > > > > + * Given a high-order compound page containing certain number of HWPoison
> > > > > > + * pages, free only the healthy ones to buddy allocator.
> > > > > > + *
> > > > > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > > > > + * overhead. So only use this when compound page really contains HWPoison.
> > > > > > + *
> > > > > > + * This implementation doesn't work in memdesc world.
> > > > > > + */
> > > > > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > > > > +{
> > > > > > +     struct page *curr = page;
> > > > > > +     struct page *end = page + (1 << order);
> > > > > > +     struct page *next;
> > > > > > +     unsigned long flags = page->flags.f;
> > > > > > +     unsigned long nr_pages;
> > > > > > +     unsigned long total_freed = 0;
> > > > > > +     unsigned long total_hwp = 0;
> > > > > > +
> > > > > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > > > > +
> > > > > > +     while (curr < end) {
> > > > > > +             next = curr;
> > > > > > +             nr_pages = 0;
> > > > > > +
> > > > > > +             while (next < end && !PageHWPoison(next)) {
> > > > > > +                     ++next;
> > > > > > +                     ++nr_pages;
> > > > > > +             }
> > > > > > +
> > > > > > +             if (PageHWPoison(next))
> > > > > > +                     ++total_hwp;
> > > > > > +
> > > > > > +             free_contiguous_pages(curr, next, flags);
> > > > >
> > > > > page_owner, memory profiling (anything else?) will be confused
> > > > > because it was allocated as a larger size, but we're freeing only
> > > > > some portion of it.
> > > >
> > > > I am not sure, but looking at __split_unmapped_folio, it calls
> > > > pgalloc_tag_split(folio, old_order, split_order) when splitting an
> > > > old_order-order folio into a new split_order.
> > > >
> > > > Maybe prepare_compound_page_to_free really should
> > > > update_page_tag_ref(), I need to take a closer look at this with
> > > > CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> > > >
> > > > > Perhaps we need to run some portion of this code snippet
> > > > > (from free_pages_prepare()), before freeing portions of it:
> > > > >
> > > > >         page_cpupid_reset_last(page);
> > > > >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > > >         reset_page_owner(page, order);
> > > > >         page_table_check_free(page, order);
> > > > >         pgalloc_tag_sub(page, 1 << order);
> > > >
> > > > Since they come from free_pages_prepare, I believe these lines are
> > > > already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> > > > free_pages_prepare(), right? Or am I missing something?
> > >
> > > But they're called with order that is smaller than the original order.
> > > That's could be problematic; for example, memory profiling stores metadata
> > > only on the first page. If you pass anything other than the first page
> > > to free_pages_prepare(), it will not recognize that metadata was stored
> > > during allocation.
> > >
> >
> > Right, with MEM_ALLOC_PROFILING enabled, I ran into the following
> > WARNING when freeing all blocks except the 1st one (which contains the
> > original head page):
> >
> > [ 2101.713669] ------------[ cut here ]------------
> > [ 2101.713670] alloc_tag was not set
> > [ 2101.713671] WARNING: ./include/linux/alloc_tag.h:164 at
> > __pgalloc_tag_sub+0xdf/0x160, CPU#18: hugetlb-mfr-3pa/33675
> > [ 2101.713693] CPU: 18 UID: 0 PID: 33675 Comm: hugetlb-mfr-3pa
> > Tainted: G S W O 6.19.0-smp-DEV #2 NONE
> > [ 2101.713698] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE
> > [ 2101.713702] RIP: 0010:__pgalloc_tag_sub+0xdf/0x160
> > ...
> > [ 2101.713723] Call Trace:
> > [ 2101.713725] <TASK>
> > [ 2101.713727] free_has_hwpoison_pages+0xbc/0x370
> > [ 2101.713731] free_frozen_pages+0xb3/0x100
> > [ 2101.713733] __folio_put+0xd5/0x100
> > [ 2101.713739] dissolve_free_hugetlb_folio+0x17f/0x1a0
> > [ 2101.713743] filemap_offline_hwpoison_folio+0x193/0x4c0
> > [ 2101.713747] ? __pfx_workingset_update_node+0x10/0x10
> > [ 2101.713751] remove_inode_hugepages+0x209/0x690
> > [ 2101.713757] ? on_each_cpu_cond_mask+0x1a/0x20
> > [ 2101.713760] ? __cond_resched+0x23/0x60
> > [ 2101.713768] ? n_tty_write+0x4c7/0x500
> > [ 2101.713773] hugetlbfs_setattr+0x127/0x170
> > [ 2101.713776] notify_change+0x32e/0x390
> > [ 2101.713781] do_ftruncate+0x12c/0x1a0
> > [ 2101.713786] __x64_sys_ftruncate+0x3e/0x70
> > [ 2101.713789] do_syscall_64+0x6f/0x890
> > [ 2101.713792] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 2101.713811] </TASK>
> > [ 2101.713812] ---[ end trace 0000000000000000 ]---
> >
> > This is because in free_pages_prepare(), pgalloc_tag_sub() found there
> > is no alloc tag on the compound page being freeing.
> >
> > > In general, I think they're not designed to handle cases where
> > > the allocation order and the free order differ (unless we split
> > > metadata like __split_unmapped_folio() does).
> >
> > I believe the proper way to fix this is to do something similar to
> > pgalloc_tag_split(), used by __split_unmapped_folio().
>
> Yeah, that will work.
>
> The difference is that pgalloc_tag_split(), split_page_owner(), and
> split_page_memcg() only support splitting the whole page into many (> 1)
> smaller pieces, but we're trying to create only one smaller page from the
> original page.
>
> > When we split a
> > new block from the original folio, we create a compound page from the
> > block (just the way prep_compound_page_to_free does) and link the
> > alloc tag of the original head page to the head of the new compound
> > page.
> >
> > Something like copy_alloc_tag (to be added in v3) below to demo
> > my idea, assuming tag=pgalloc_tag_get(original head page):

Actually I had another idea for page metadata like page alloc tag and
page owner - how about we do the special handling AFTER
free_pages_prepare()?

Looking at free_pages_prepare, it actually breaks down compound page
(if it is compound, using free_tail_page_prepare), adjust page flags,
and deals page metadata:

for (i = 1; i < (1 << order); i++) {
    if (compound)
        bad += free_tail_page_prepare(page, page + i);
...
page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
pgalloc_tag_sub(page, 1 << order);

Even the original compound page has some HWPoison pages, these
operations should be done in the same way when no HWPoison page exist.
The compound structure will be gone after free, the alloc tag will be
reduced after free, the page owner will be reset after free. So
instead of splitting these stuff into temporary compound pages, why
not getting rid of them in the first place with free_pages_prepare()?
This is unique to the freeing path, so we don't need to mimic
everything __split_unmapped_folio() does.

At high level, applying the idea to __free_pages_ok (or
__free_frozen_pages) looks like this:

+static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
 {
+       if (order == 0 || !PageCompound(page))
+               return false;
+
+       return folio_test_has_hwpoisoned(page_folio(page));
+}

static void __free_pages_ok(struct page *page, unsigned int order,
 {
        unsigned long pfn = page_to_pfn(page);
        struct zone *zone = page_zone(page);
+       bool should_fhh = compound_has_hwpoisoned(page, order);

-       if (free_pages_prepare(page, order))
-               free_one_page(zone, page, pfn, order, fpi_flags);
+       if (!free_pages_prepare(page, order))
+               return;
+
+       if (should_fhh) {
+               free_has_hwpoisoned(page, order, fpi_flags);
+               return;
+       }
+
+       free_one_page(zone, page, pfn, order, fpi_flags);
 }

Under this idea, free_has_hwpoisoned() doesn't need
prepare_compound_page_to_free() anymore. It can just directly call
free_one_page() when it selects the right order.

The only imperfect part is compound_has_hwpoisoned(), that we need to
make a call to whether free_has_hwpoisoned() or not ealier than
free_pages_prepare(). free_pages_prepare() clears PAGE_FLAGS_SECOND
flags on the first tail page of a compound, which clears
PG_has_hwpoisoned. On the other hand, we can't easily exclude
PG_has_hwpoisoned from PAGE_FLAGS_SECOND. Because PG_has_hwpoisoned ==
PG_active, free_page_is_bad() will confuse and complaint that the
first tail page is still active.

I have implemented a working prototype and tested just fine. I can
send it out as v3 after clean it up.

> >
> > +/*
> > + * Point page's alloc tag to an existing one.
> > + */
> > +static void copy_alloc_tag(struct page *page, struct alloc_tag *tag)
>
> This should be defined in lib/alloc_tag.c
>
> > +{
> > +       union pgtag_ref_handle handle;
> > +       union codetag_ref ref;
> > +       unsigned long pfn = page_to_pfn(page);
> > +
> > +       if (!mem_alloc_profiling_enabled())
> > +               return;
> > +
> > +       /* tag is NULL if HugeTLB page is allocated in boot process. */
> > +       if (!tag)
> > +               return;
> > +
> > +       if (!get_page_tag_ref(page, &ref, &handle))
> > +               return;
> > +
> > +       /* Avoid overriding existing alloc tag from page. */
> > +       if (!ref.ct || is_codetag_empty(&ref)) {
>
> Is it an error if the page already has a valid tag?

Probably not important anymore, but I don't think it is an error. The
split block having the original folio's head page will have a valid
tag. We don't need to set + update for this block; otherwise
alloc_tag_add_check will WARN().

>
> > +               alloc_tag_ref_set(&ref, tag);
> > +               update_page_tag_ref(handle, &ref);
> > +       }
> > +       put_page_tag_ref(handle);
> > +}
> > +
> > +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> > +                                      unsigned long flags, struct
> > alloc_tag *tag)
> > +{
> > +       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > +       head->mapping = NULL;
> > +       head->private = 0;
> > +
> > +       clear_compound_head(head);
> > +       if (order)
> > +               prep_compound_page(head, order);
> > +
> > +       copy_alloc_tag(head, tag);
>
> Do we need a similar thing for memcg? Probably not, since it should have

Yes, I think we don't need to do anything for memgc, it is already
uncharged in __folio_put().

> been uncharged before freeing (as long as we're not using it for kmem pages)

You mean the folio was charged to kernel memory, right? From my
reading of uncharge_folio(), it covers the kmem case.

> Perhaps a comment on why we don't need to split memcg will be enough.
>
> Do we need a similar thing for page_owner? I think yes, although it won't
> crash or cause a warning, it will be inconsistent if we split page_owner in
> split_page()/__split_unmapped_folio()/etc but not in
> prep_compound_page_to_free().
>
> > +}
> >
> > I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> > for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> > HWPoison pages before pgalloc_tag_sub().
> >
> > > > > > +             total_freed += nr_pages;
> > > > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > > > +     }
> > > > > > +
> > > > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > > +}
> > > > > > +
> > > > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > > > >  {
> > > > > > +     struct folio *folio = page_folio(page);
> > > > > > +
> > > > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > > +             folio_clear_has_hwpoisoned(folio);
> > > > > > +             free_has_hwpoison_pages(page, order);
> > > > > > +             return;
> > > > > > +     }
> > > > > > +
> > > > >
> > > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > > Can we move this to free_pages_prepare() where we have some
> > > > > sanity checks (and also order-0 hwpoison page handling)?
> > > >
> > > > While free_pages_prepare() seems to be a better place to do the
> > > > has_hwpoisoned check, it is not a good place to do
> > > > free_has_hwpoison_pages().
> > >
> > > Why is it not a good place for free_has_hwpoison_pages()?
> > >
> > > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > > the buddy or using the page when it returns false.
> >
> > What I mean is, callers of free_pages_prepare() wouldn't know from the
> > single false return value whether 1) they should completely bail out
> > or 2) they should retry with free_has_hwpoison_pages.
>
> I was thinking that once free_pages_prepare() returns false, it already
> has done all the work to isolate HWPoison pages and freed healthy portions,
> so the caller doesn't have to do anything and can bail out completely.

Does this change the meaning of free_pages_prepare()'s return value?
My first take is it returns if the preparation+check works are good. I
don't know if doing the freeing work in free_pages_prepare() is a good
idea. But if it is fine, we can actually hide the ugly
compound_has_hwpoisoned() entirely inside free_pages_prepare():
- true means preparation + check work are all good but caller needs to
free prepared+checked pages itself
- false means one of the two:
  - some check failed and it is impossible to safely free pages for callers
  - this is a compound page with some HWPoison pages, and healthy
pages are already freed safely

Doesn't seem very clean with a boolean return type...

>
> > So for now I
> > think it'd better that free_frozen_pages does the check and act upon
> > the check result. Not sure if there is a better place, or worthy to
> > change free_pages_prepare()'s return type?
> >
> > > ...except compaction_free(), which I don't have much idea what it's
> > > doing.
> > >
> > > > > >       __free_frozen_pages(page, order, FPI_NONE);
> > > > > >  }
>
> --
> Cheers,
> Harry / Hyeonggon

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Harry Yoo 1 month ago

On Mon, Jan 05, 2026 at 07:40:00PM -0800, Jiaqi Yan wrote:
> On Tue, Dec 30, 2025 at 8:37 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> > > On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > In general, I think they're not designed to handle cases where
> > > > the allocation order and the free order differ (unless we split
> > > > metadata like __split_unmapped_folio() does).
> > >
> > > I believe the proper way to fix this is to do something similar to
> > > pgalloc_tag_split(), used by __split_unmapped_folio().
> >
> > Yeah, that will work.
> >
> > The difference is that pgalloc_tag_split(), split_page_owner(), and
> > split_page_memcg() only support splitting the whole page into many (> 1)
> > smaller pieces, but we're trying to create only one smaller page from the
> > original page.
> >
> > > When we split a
> > > new block from the original folio, we create a compound page from the
> > > block (just the way prep_compound_page_to_free does) and link the
> > > alloc tag of the original head page to the head of the new compound
> > > page.
> > >
> > > Something like copy_alloc_tag (to be added in v3) below to demo
> > > my idea, assuming tag=pgalloc_tag_get(original head page):
> 
> Actually I had another idea for page metadata like page alloc tag and
> page owner - how about we do the special handling AFTER
> free_pages_prepare()?

Sounds good to me.

> Looking at free_pages_prepare, it actually breaks down compound page
> (if it is compound, using free_tail_page_prepare), adjust page flags,
> and deals page metadata:
> 
> for (i = 1; i < (1 << order); i++) {
>     if (compound)
>         bad += free_tail_page_prepare(page, page + i);
> ...
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> reset_page_owner(page, order);
> page_table_check_free(page, order);
> pgalloc_tag_sub(page, 1 << order);
> 
> Even the original compound page has some HWPoison pages, these
> operations should be done in the same way when no HWPoison page exist.

and we're not losing hwpoison flag after free_pages_prepare()...
that actually sounds fine.

> The compound structure will be gone after free, the alloc tag will be
> reduced after free, the page owner will be reset after free.

Right.

> So instead of splitting these stuff into temporary compound pages, why
> not getting rid of them in the first place with free_pages_prepare()?
>
> This is unique to the freeing path, so we don't need to mimic
> everything __split_unmapped_folio() does.

Not 100% sure how it'd work in the memdesc world,
but for today, right.

> At high level, applying the idea to __free_pages_ok (or
> __free_frozen_pages) looks like this:
> 
> +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
>  {
> +       if (order == 0 || !PageCompound(page))
> +               return false;
> +
> +       return folio_test_has_hwpoisoned(page_folio(page));
> +}
> 
> static void __free_pages_ok(struct page *page, unsigned int order,
>  {
>         unsigned long pfn = page_to_pfn(page);
>         struct zone *zone = page_zone(page);
> +       bool should_fhh = compound_has_hwpoisoned(page, order);
> 
> -       if (free_pages_prepare(page, order))
> -               free_one_page(zone, page, pfn, order, fpi_flags);
> +       if (!free_pages_prepare(page, order))
> +               return;
> +
> +       if (should_fhh) {
> +               free_has_hwpoisoned(page, order, fpi_flags);
> +               return;
> +       }
> +
> +       free_one_page(zone, page, pfn, order, fpi_flags);
>  }
> 
> Under this idea, free_has_hwpoisoned() doesn't need
> prepare_compound_page_to_free() anymore. It can just directly call
> free_one_page() when it selects the right order.

Right.

> The only imperfect part is compound_has_hwpoisoned(), that we need to
> make a call to whether free_has_hwpoisoned() or not ealier than
> free_pages_prepare().

Right.

> free_pages_prepare() clears PAGE_FLAGS_SECOND
> flags on the first tail page of a compound, which clears
> PG_has_hwpoisoned. On the other hand, we can't easily exclude
> PG_has_hwpoisoned from PAGE_FLAGS_SECOND. Because PG_has_hwpoisoned ==
> PG_active, free_page_is_bad() will confuse and complaint that the
> first tail page is still active.

Adding a comment like this wouldn't hurt:

/*
 * Should be called before decomposing compound page in
 * free_pages_prepare().
 */
bool should_fhh = compound_has_hwpoisoned(page, order);

> I have implemented a working prototype and tested just fine. I can
> send it out as v3 after clean it up.
> 
> > > +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> > > +                                      unsigned long flags, struct
> > > alloc_tag *tag)
> > > +{
> > > +       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > +       head->mapping = NULL;
> > > +       head->private = 0;
> > > +
> > > +       clear_compound_head(head);
> > > +       if (order)
> > > +               prep_compound_page(head, order);
> > > +
> > > +       copy_alloc_tag(head, tag);
> >
> > Do we need a similar thing for memcg? Probably not, since it should have
> 
> Yes, I think we don't need to do anything for memgc, it is already
> uncharged in __folio_put().
> 
> > been uncharged before freeing (as long as we're not using it for kmem pages)
> 
> You mean the folio was charged to kernel memory, right? From my
> reading of uncharge_folio(), it covers the kmem case.

Yes but I think not all kmem pages go through uncharge_folio() given
free_pages_prepare() handles uncharging kmem pages.

> > Perhaps a comment on why we don't need to split memcg will be enough.
> >
> > Do we need a similar thing for page_owner? I think yes, although it won't
> > crash or cause a warning, it will be inconsistent if we split page_owner in
> > split_page()/__split_unmapped_folio()/etc but not in
> > prep_compound_page_to_free().
> >
> > > +}
> > >
> > > I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> > > for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> > > HWPoison pages before pgalloc_tag_sub().
> > >
> > > > > > > +             total_freed += nr_pages;
> > > > > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > > > > +     }
> > > > > > > +
> > > > > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > > > +}
> > > > > > > +
> > > > > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > > > > >  {
> > > > > > > +     struct folio *folio = page_folio(page);
> > > > > > > +
> > > > > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > > > +             folio_clear_has_hwpoisoned(folio);
> > > > > > > +             free_has_hwpoison_pages(page, order);
> > > > > > > +             return;
> > > > > > > +     }
> > > > > > > +
> > > > > >
> > > > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > > > Can we move this to free_pages_prepare() where we have some
> > > > > > sanity checks (and also order-0 hwpoison page handling)?
> > > > >
> > > > > While free_pages_prepare() seems to be a better place to do the
> > > > > has_hwpoisoned check, it is not a good place to do
> > > > > free_has_hwpoison_pages().
> > > >
> > > > Why is it not a good place for free_has_hwpoison_pages()?
> > > >
> > > > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > > > the buddy or using the page when it returns false.
> > >
> > > What I mean is, callers of free_pages_prepare() wouldn't know from the
> > > single false return value whether 1) they should completely bail out
> > > or 2) they should retry with free_has_hwpoison_pages.
> >
> > I was thinking that once free_pages_prepare() returns false, it already
> > has done all the work to isolate HWPoison pages and freed healthy portions,
> > so the caller doesn't have to do anything and can bail out completely.
> 
> Does this change the meaning of free_pages_prepare()'s return value?
> My first take is it returns if the preparation+check works are good.
>
> I don't know if doing the freeing work in free_pages_prepare() is a good
> idea.

As long as it's documented in a comment, why not.

It doesn't make callers' code less readable. The semantic is the same;
callers should free the page only if free_pages_prepare() returns true.

They don't have to care why they should not free it.
(either it should not be freed back to buddy because it's bad page,
 or it had hwpoison page(s) and healthy portions are already freed)

> But if it is fine, we can actually hide the ugly
> compound_has_hwpoisoned() entirely inside free_pages_prepare():
> - true means preparation + check work are all good but caller needs to
> free prepared+checked pages itself
> - false means one of the two:
>   - some check failed and it is impossible to safely free pages for callers
>   - this is a compound page with some HWPoison pages, and healthy
>     pages are already freed safely

Yes, something like this could be documented in free_pages_prepare().

> Doesn't seem very clean with a boolean return type...

The return type could be an enum type, but that wouldn't affect
the caller's behavior (bailing out) anyway?

No strong opinion from me on the type.

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

Posted by Jiaqi Yan 1 month ago

On Mon, Jan 5, 2026 at 11:35 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Jan 05, 2026 at 07:40:00PM -0800, Jiaqi Yan wrote:
> > On Tue, Dec 30, 2025 at 8:37 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> > > > On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > > > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > > In general, I think they're not designed to handle cases where
> > > > > the allocation order and the free order differ (unless we split
> > > > > metadata like __split_unmapped_folio() does).
> > > >
> > > > I believe the proper way to fix this is to do something similar to
> > > > pgalloc_tag_split(), used by __split_unmapped_folio().
> > >
> > > Yeah, that will work.
> > >
> > > The difference is that pgalloc_tag_split(), split_page_owner(), and
> > > split_page_memcg() only support splitting the whole page into many (> 1)
> > > smaller pieces, but we're trying to create only one smaller page from the
> > > original page.
> > >
> > > > When we split a
> > > > new block from the original folio, we create a compound page from the
> > > > block (just the way prep_compound_page_to_free does) and link the
> > > > alloc tag of the original head page to the head of the new compound
> > > > page.
> > > >
> > > > Something like copy_alloc_tag (to be added in v3) below to demo
> > > > my idea, assuming tag=pgalloc_tag_get(original head page):
> >
> > Actually I had another idea for page metadata like page alloc tag and
> > page owner - how about we do the special handling AFTER
> > free_pages_prepare()?
>
> Sounds good to me.
>
> > Looking at free_pages_prepare, it actually breaks down compound page
> > (if it is compound, using free_tail_page_prepare), adjust page flags,
> > and deals page metadata:
> >
> > for (i = 1; i < (1 << order); i++) {
> >     if (compound)
> >         bad += free_tail_page_prepare(page, page + i);
> > ...
> > page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > reset_page_owner(page, order);
> > page_table_check_free(page, order);
> > pgalloc_tag_sub(page, 1 << order);
> >
> > Even the original compound page has some HWPoison pages, these
> > operations should be done in the same way when no HWPoison page exist.
>
> and we're not losing hwpoison flag after free_pages_prepare()...
> that actually sounds fine.
>
> > The compound structure will be gone after free, the alloc tag will be
> > reduced after free, the page owner will be reset after free.
>
> Right.
>
> > So instead of splitting these stuff into temporary compound pages, why
> > not getting rid of them in the first place with free_pages_prepare()?
> >
> > This is unique to the freeing path, so we don't need to mimic
> > everything __split_unmapped_folio() does.
>
> Not 100% sure how it'd work in the memdesc world,
> but for today, right.
>
> > At high level, applying the idea to __free_pages_ok (or
> > __free_frozen_pages) looks like this:
> >
> > +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
> >  {
> > +       if (order == 0 || !PageCompound(page))
> > +               return false;
> > +
> > +       return folio_test_has_hwpoisoned(page_folio(page));
> > +}
> >
> > static void __free_pages_ok(struct page *page, unsigned int order,
> >  {
> >         unsigned long pfn = page_to_pfn(page);
> >         struct zone *zone = page_zone(page);
> > +       bool should_fhh = compound_has_hwpoisoned(page, order);
> >
> > -       if (free_pages_prepare(page, order))
> > -               free_one_page(zone, page, pfn, order, fpi_flags);
> > +       if (!free_pages_prepare(page, order))
> > +               return;
> > +
> > +       if (should_fhh) {
> > +               free_has_hwpoisoned(page, order, fpi_flags);
> > +               return;
> > +       }
> > +
> > +       free_one_page(zone, page, pfn, order, fpi_flags);
> >  }
> >
> > Under this idea, free_has_hwpoisoned() doesn't need
> > prepare_compound_page_to_free() anymore. It can just directly call
> > free_one_page() when it selects the right order.
>
> Right.
>
> > The only imperfect part is compound_has_hwpoisoned(), that we need to
> > make a call to whether free_has_hwpoisoned() or not ealier than
> > free_pages_prepare().
>
> Right.
>
> > free_pages_prepare() clears PAGE_FLAGS_SECOND
> > flags on the first tail page of a compound, which clears
> > PG_has_hwpoisoned. On the other hand, we can't easily exclude
> > PG_has_hwpoisoned from PAGE_FLAGS_SECOND. Because PG_has_hwpoisoned ==
> > PG_active, free_page_is_bad() will confuse and complaint that the
> > first tail page is still active.
>
> Adding a comment like this wouldn't hurt:

ack.

>
> /*
>  * Should be called before decomposing compound page in
>  * free_pages_prepare().
>  */
> bool should_fhh = compound_has_hwpoisoned(page, order);
>
> > I have implemented a working prototype and tested just fine. I can
> > send it out as v3 after clean it up.
> >
> > > > +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> > > > +                                      unsigned long flags, struct
> > > > alloc_tag *tag)
> > > > +{
> > > > +       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > > +       head->mapping = NULL;
> > > > +       head->private = 0;
> > > > +
> > > > +       clear_compound_head(head);
> > > > +       if (order)
> > > > +               prep_compound_page(head, order);
> > > > +
> > > > +       copy_alloc_tag(head, tag);
> > >
> > > Do we need a similar thing for memcg? Probably not, since it should have
> >
> > Yes, I think we don't need to do anything for memgc, it is already
> > uncharged in __folio_put().
> >
> > > been uncharged before freeing (as long as we're not using it for kmem pages)
> >
> > You mean the folio was charged to kernel memory, right? From my
> > reading of uncharge_folio(), it covers the kmem case.
>
> Yes but I think not all kmem pages go through uncharge_folio() given
> free_pages_prepare() handles uncharging kmem pages.

I see. Then doing free_has_hwpoisoned() *after* free_pages_prepare()
also address this concern, right?

>
> > > Perhaps a comment on why we don't need to split memcg will be enough.
> > >
> > > Do we need a similar thing for page_owner? I think yes, although it won't
> > > crash or cause a warning, it will be inconsistent if we split page_owner in
> > > split_page()/__split_unmapped_folio()/etc but not in
> > > prep_compound_page_to_free().
> > >
> > > > +}
> > > >
> > > > I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> > > > for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> > > > HWPoison pages before pgalloc_tag_sub().
> > > >
> > > > > > > > +             total_freed += nr_pages;
> > > > > > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > > > > > +     }
> > > > > > > > +
> > > > > > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > > > > > >  {
> > > > > > > > +     struct folio *folio = page_folio(page);
> > > > > > > > +
> > > > > > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > > > > +             folio_clear_has_hwpoisoned(folio);
> > > > > > > > +             free_has_hwpoison_pages(page, order);
> > > > > > > > +             return;
> > > > > > > > +     }
> > > > > > > > +
> > > > > > >
> > > > > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > > > > Can we move this to free_pages_prepare() where we have some
> > > > > > > sanity checks (and also order-0 hwpoison page handling)?
> > > > > >
> > > > > > While free_pages_prepare() seems to be a better place to do the
> > > > > > has_hwpoisoned check, it is not a good place to do
> > > > > > free_has_hwpoison_pages().
> > > > >
> > > > > Why is it not a good place for free_has_hwpoison_pages()?
> > > > >
> > > > > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > > > > the buddy or using the page when it returns false.
> > > >
> > > > What I mean is, callers of free_pages_prepare() wouldn't know from the
> > > > single false return value whether 1) they should completely bail out
> > > > or 2) they should retry with free_has_hwpoison_pages.
> > >
> > > I was thinking that once free_pages_prepare() returns false, it already
> > > has done all the work to isolate HWPoison pages and freed healthy portions,
> > > so the caller doesn't have to do anything and can bail out completely.
> >
> > Does this change the meaning of free_pages_prepare()'s return value?
> > My first take is it returns if the preparation+check works are good.
> >
> > I don't know if doing the freeing work in free_pages_prepare() is a good
> > idea.
>
> As long as it's documented in a comment, why not.
>
> It doesn't make callers' code less readable. The semantic is the same;
> callers should free the page only if free_pages_prepare() returns true.
>
> They don't have to care why they should not free it.
> (either it should not be freed back to buddy because it's bad page,
>  or it had hwpoison page(s) and healthy portions are already freed)
>
> > But if it is fine, we can actually hide the ugly
> > compound_has_hwpoisoned() entirely inside free_pages_prepare():
> > - true means preparation + check work are all good but caller needs to
> > free prepared+checked pages itself
> > - false means one of the two:
> >   - some check failed and it is impossible to safely free pages for callers
> >   - this is a compound page with some HWPoison pages, and healthy
> >     pages are already freed safely
>
> Yes, something like this could be documented in free_pages_prepare().
>
> > Doesn't seem very clean with a boolean return type...
>
> The return type could be an enum type, but that wouldn't affect
> the caller's behavior (bailing out) anyway?
>
> No strong opinion from me on the type.

I just noticed free_pages_prepare() does have an external caller
compaction_free(), so I want to be cautious about adding new freeing
behavoir to it.

Maybe I can wrap compound_has_hwpoisoned(), free_pages_prepare(), and
free_has_hwpoisoned() into free_pages_prepare_has_hwpoisoned() and
make it private to page_alloc.c.

Let me put this together in V3. I also want to correct my previous
latency measurement. Both the compared group and baseline take less
time (2ms vs 0.066ms), but it is not ~2x but ~30x.

>
> --
> Cheers,
> Harry / Hyeonggon

[PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio
[PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
[PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison