mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
free_pages_prepare() only handles poisoned order-0 pages.
In memory_failure() (hard offline), pages
are poisoned before attempting to split huge pages. If the split fails,
the page remains a compound (order > 0) but is already poisoned. However,
Soft-offline pages are always poisoned as order-0 after migration, so
they are unaffected.
The '!order' check causes these poisoned compound pages to skip
poison handling, leaving them in the buddy allocator.
Worst case, a poisoned compound page could be reallocated,
potentially leading to crashes, silent data corruption,
or unwanted memory containment actions before the poison bit is detected.
This patch removes the '&& !order' restriction. Cleanup functions in the
poison-handling block correctly handle non-zero order pages, making
this change safe.
Fixes: 79f5f8fab482 ("mm,hwpoison: rework soft offline for in-use pages")
Signed-off-by: Boudewijn van der Heide <boudewijn@delta-utec.com>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c380f063e8b7..64d15e56706c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1344,7 +1344,7 @@ __always_inline bool free_pages_prepare(struct page *page,
count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
}
- if (unlikely(PageHWPoison(page)) && !order) {
+ if (unlikely(PageHWPoison(page))) {
/* Do not let hwpoison pages hit pcplists/buddy */
reset_page_owner(page, order);
page_table_check_free(page, order);
--
2.47.3
Add Miaohe (memory failure maintainer)
On 13 Jan 2026, at 15:54, Boudewijn van der Heide wrote:
> free_pages_prepare() only handles poisoned order-0 pages.
> In memory_failure() (hard offline), pages
> are poisoned before attempting to split huge pages. If the split fails,
> the page remains a compound (order > 0) but is already poisoned. However,
> Soft-offline pages are always poisoned as order-0 after migration, so
> they are unaffected.
>
> The '!order' check causes these poisoned compound pages to skip
> poison handling, leaving them in the buddy allocator.
>
> Worst case, a poisoned compound page could be reallocated,
> potentially leading to crashes, silent data corruption,
> or unwanted memory containment actions before the poison bit is detected.
>
> This patch removes the '&& !order' restriction. Cleanup functions in the
> poison-handling block correctly handle non-zero order pages, making
> this change safe.
This is not a fix. IIUC, for >0 order free pages, memory failure uses
take_page_off_buddy() in a different code path.
Miaohe (cc’d) should be able to elaborate more on it.
>
> Fixes: 79f5f8fab482 ("mm,hwpoison: rework soft offline for in-use pages")
> Signed-off-by: Boudewijn van der Heide <boudewijn@delta-utec.com>
> ---
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c380f063e8b7..64d15e56706c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1344,7 +1344,7 @@ __always_inline bool free_pages_prepare(struct page *page,
> count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
> }
>
> - if (unlikely(PageHWPoison(page)) && !order) {
> + if (unlikely(PageHWPoison(page))) {
> /* Do not let hwpoison pages hit pcplists/buddy */
> reset_page_owner(page, order);
> page_table_check_free(page, order);
> --
> 2.47.3
Best Regards,
Yan, Zi
> > free_pages_prepare() only handles poisoned order-0 pages.
> > In memory_failure() (hard offline), pages
> > are poisoned before attempting to split huge pages. If the split fails,
> > the page remains a compound (order > 0) but is already poisoned. However,
> > Soft-offline pages are always poisoned as order-0 after migration, so
> > they are unaffected.
> >
> > The '!order' check causes these poisoned compound pages to skip
> > poison handling, leaving them in the buddy allocator.
> >
> > Worst case, a poisoned compound page could be reallocated,
> > potentially leading to crashes, silent data corruption,
> > or unwanted memory containment actions before the poison bit is detected.
> >
> > This patch removes the '&& !order' restriction. Cleanup functions in the
> > poison-handling block correctly handle non-zero order pages, making
> > this change safe.
> This is not a fix. IIUC, for >0 order free pages, memory failure uses
> take_page_off_buddy() in a different code path.
>
Thanks again for the quick response and clarification!
From my understanding,
you correctly noted that take_page_off_buddy() handles already-free pages,
removing them from the buddy lists and setting SetPageHWPoisonTakenOff().
This prevents those pages from re-entering the buddy allocator.
My concern is about in-use THP-backed compound pages:
1. A compound page is in use.
2. memory_failure() marks it poisoned (TestSetPageHWPoison).
3. try_to_split_thp_page() fails.
4. The process using the THP may be killed;
the page remains compound and poisoned.
5. Later, when the page is finally freed, it reaches free_pages_prepare();
'take_page_off_buddy()' is not invoked in this path.
At this point, the current check:
'if (unlikely(PageHWPoison(page)) && !order)'
will not trigger, because the order > 0.
> Miaohe (cc’d) should be able to elaborate more on it.
Thanks for Cc'ing Miaohe, hopefully Miaohe can provide some more insights!
Thanks,
Boudewijn
On 2026/1/14 22:48, Boudewijn van der Heide wrote: >>> free_pages_prepare() only handles poisoned order-0 pages. >>> In memory_failure() (hard offline), pages >>> are poisoned before attempting to split huge pages. If the split fails, >>> the page remains a compound (order > 0) but is already poisoned. However, >>> Soft-offline pages are always poisoned as order-0 after migration, so >>> they are unaffected. >>> >>> The '!order' check causes these poisoned compound pages to skip >>> poison handling, leaving them in the buddy allocator. >>> >>> Worst case, a poisoned compound page could be reallocated, >>> potentially leading to crashes, silent data corruption, >>> or unwanted memory containment actions before the poison bit is detected. >>> >>> This patch removes the '&& !order' restriction. Cleanup functions in the >>> poison-handling block correctly handle non-zero order pages, making >>> this change safe. > >> This is not a fix. IIUC, for >0 order free pages, memory failure uses >> take_page_off_buddy() in a different code path. >> > > Thanks again for the quick response and clarification! >>From my understanding, > you correctly noted that take_page_off_buddy() handles already-free pages, > removing them from the buddy lists and setting SetPageHWPoisonTakenOff(). > This prevents those pages from re-entering the buddy allocator. Thanks both. > > My concern is about in-use THP-backed compound pages: > 1. A compound page is in use. > 2. memory_failure() marks it poisoned (TestSetPageHWPoison). > 3. try_to_split_thp_page() fails. > 4. The process using the THP may be killed; > the page remains compound and poisoned. > 5. Later, when the page is finally freed, it reaches free_pages_prepare(); > 'take_page_off_buddy()' is not invoked in this path. Yes, this is also a problematic scenario for Hugetlb HugePage. And Jiaqi works on it now [1]. I think Jiaqi's patches might apply to THP scenario too. Add @Jiaqi to verify this. [1]: https://lore.kernel.org/all/20260112004923.888429-1-jiaqiyan@google.com/ Thanks. .
On Wed, Jan 14, 2026 at 11:55 PM Miaohe Lin <linmiaohe@huawei.com> wrote: > > On 2026/1/14 22:48, Boudewijn van der Heide wrote: > >>> free_pages_prepare() only handles poisoned order-0 pages. > >>> In memory_failure() (hard offline), pages > >>> are poisoned before attempting to split huge pages. If the split fails, > >>> the page remains a compound (order > 0) but is already poisoned. However, > >>> Soft-offline pages are always poisoned as order-0 after migration, so > >>> they are unaffected. > >>> > >>> The '!order' check causes these poisoned compound pages to skip > >>> poison handling, leaving them in the buddy allocator. > >>> > >>> Worst case, a poisoned compound page could be reallocated, > >>> potentially leading to crashes, silent data corruption, > >>> or unwanted memory containment actions before the poison bit is detected. > >>> > >>> This patch removes the '&& !order' restriction. Cleanup functions in the > >>> poison-handling block correctly handle non-zero order pages, making > >>> this change safe. > > > >> This is not a fix. IIUC, for >0 order free pages, memory failure uses > >> take_page_off_buddy() in a different code path. > >> > > > > Thanks again for the quick response and clarification! > >>From my understanding, > > you correctly noted that take_page_off_buddy() handles already-free pages, > > removing them from the buddy lists and setting SetPageHWPoisonTakenOff(). > > This prevents those pages from re-entering the buddy allocator. > > Thanks both. > > > > > My concern is about in-use THP-backed compound pages: > > 1. A compound page is in use. > > 2. memory_failure() marks it poisoned (TestSetPageHWPoison). > > 3. try_to_split_thp_page() fails. > > 4. The process using the THP may be killed; > > the page remains compound and poisoned. > > 5. Later, when the page is finally freed, it reaches free_pages_prepare(); > > 'take_page_off_buddy()' is not invoked in this path. I agree that Boudewijn's concern is valid when try_to_split_thp_page() fails. However, I don't think the fix here really works. For a compound / THP page, memory-failure() sets PG_HWPoison flag on the exact subpage within the compound page. I believe the page in free_pages_prepare() is almost going to be (if no always) the head of the compound page. So removing "!order" won't really help unless the head of the THP page happens to be HWPoison. > > Yes, this is also a problematic scenario for Hugetlb HugePage. And Jiaqi works on > it now [1]. I think Jiaqi's patches might apply to THP scenario too. Add @Jiaqi to > verify this. Yep, I think my work will also help solve the concern when try_to_split_thp_page() fails. > > [1]: https://lore.kernel.org/all/20260112004923.888429-1-jiaqiyan@google.com/ > > Thanks. > .
Thanks Jiaqi for the feedback, that is very helpful. (and thanks Miaohe for connecting the issues.) After going through the memory_failure(), I can see it indeed puts the PG_HWPoison flag on the specific subpage pointer, and therefore my fix won't work as-is. > > > > Yes, this is also a problematic scenario for Hugetlb HugePage. And Jiaqi works on > > it now [1]. I think Jiaqi's patches might apply to THP scenario too. Add @Jiaqi to > > verify this. > > Yep, I think my work will also help solve the concern when > try_to_split_thp_page() fails. Your fix makes a lot of sense for hugetlb, as it linearly scans through all the pages. From my understanding, your fix also provides the perfect architecture for also checking THP, though it doesn't yet cover the in-use THP case outlined. For THP I would need to trace the failed-split paths more carefully, to check where the equivalent path for THP would be. If there is work needed for THP, I'm happy to help. Would you prefer I work on THP support as a separate follow-up patch, after yours is merged, or do you prefer to integrate it in your patch series? > > > > [1]: https://lore.kernel.org/all/20260112004923.888429-1-jiaqiyan@google.com/ > > > > Thanks. > > . Thanks, Boudewijn
On Fri, Jan 16, 2026 at 6:12 AM Boudewijn van der Heide <boudewijn@delta-utec.com> wrote: > > Thanks Jiaqi for the feedback, that is very helpful. > (and thanks Miaohe for connecting the issues.) > > After going through the memory_failure(), > I can see it indeed puts the PG_HWPoison flag on the specific subpage pointer, > and therefore my fix won't work as-is. > > > > > > > Yes, this is also a problematic scenario for Hugetlb HugePage. And Jiaqi works on > > > it now [1]. I think Jiaqi's patches might apply to THP scenario too. Add @Jiaqi to > > > verify this. > > > > Yep, I think my work will also help solve the concern when > > try_to_split_thp_page() fails. > > Your fix makes a lot of sense for hugetlb, > as it linearly scans through all the pages. > From my understanding, > your fix also provides the perfect architecture for also checking THP, > though it doesn't yet cover the in-use THP case outlined. Oh, sorry I went ahead myself and assumed the split-failed folio would eventually be released to the buddy allocator at some point when userspace processes who owns/maps this THP are killed or exited. Zi and Miaohe, am I right about this? or do we need explicitly handle in-use and split-failed THP? > > For THP I would need to trace the failed-split paths more carefully, > to check where the equivalent path for THP would be. > > If there is work needed for THP, I'm happy to help. > Would you prefer I work on THP support as a separate follow-up patch, > after yours is merged, > or do you prefer to integrate it in your patch series? > > > > > > > [1]: https://lore.kernel.org/all/20260112004923.888429-1-jiaqiyan@google.com/ > > > > > > Thanks. > > > . > > Thanks, > Boudewijn
On 2026/1/24 12:42, Jiaqi Yan wrote: > On Fri, Jan 16, 2026 at 6:12 AM Boudewijn van der Heide > <boudewijn@delta-utec.com> wrote: >> >> Thanks Jiaqi for the feedback, that is very helpful. >> (and thanks Miaohe for connecting the issues.) >> >> After going through the memory_failure(), >> I can see it indeed puts the PG_HWPoison flag on the specific subpage pointer, >> and therefore my fix won't work as-is. >> >>>> >>>> Yes, this is also a problematic scenario for Hugetlb HugePage. And Jiaqi works on >>>> it now [1]. I think Jiaqi's patches might apply to THP scenario too. Add @Jiaqi to >>>> verify this. >>> >>> Yep, I think my work will also help solve the concern when >>> try_to_split_thp_page() fails. >> >> Your fix makes a lot of sense for hugetlb, >> as it linearly scans through all the pages. >> From my understanding, >> your fix also provides the perfect architecture for also checking THP, >> though it doesn't yet cover the in-use THP case outlined. > > Oh, sorry I went ahead myself and assumed the split-failed folio would > eventually be released to the buddy allocator at some point when > userspace processes who owns/maps this THP are killed or exited. > > Zi and Miaohe, am I right about this? or do we need explicitly handle > in-use and split-failed THP? IMHO, it's enough to handle poisoned sub-pages when in-use or split-failed THP eventually be released to the buddy. Thanks. .
© 2016 - 2026 Red Hat, Inc.