[PATCH] mm: page_isolation: Avoid hugepage scan step underflow

Kaitao Cheng posted 1 patch 5 days, 12 hours ago
mm/page_isolation.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
[PATCH] mm: page_isolation: Avoid hugepage scan step underflow
Posted by Kaitao Cheng 5 days, 12 hours ago
From: Kaitao Cheng <chengkaitao@kylinos.cn>

page_is_unmovable() checks HugeTLB pages without holding hugetlb_lock and
without pinning the folio. This is intentional for the pageblock scanning
paths, but it means the HugeTLB folio can be freed concurrently after
PageHuge() or folio_test_hugetlb() succeeds.

The existing code avoids folio_hstate() and uses size_to_hstate() because
the HugeTLB flag may already have been cleared. However, if
size_to_hstate() returns NULL, the code still falls through and computes
the scan step from folio_nr_pages(). If the folio has been freed and the
head/large state has been cleared, folio_nr_pages() can return 1. When the
current page is a tail page, subtracting folio_page_idx() from 1 can
underflow and make the scanner skip too far.

Treat a NULL hstate as unmovable so the scanner does not try to skip over
an unstable HugeTLB folio. Once a valid hstate is found, derive the number
of pages from the hstate instead of reading the folio size again. Also
validate the page index before computing the step to avoid underflow if the
page/folio relationship changed concurrently.

Fixes: a0a9f2180b90 ("mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/page_isolation.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c48ff5c00244..99f0b06efaf6 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -43,6 +43,7 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
 	 */
 	if (PageHuge(page) || PageCompound(page)) {
 		struct folio *folio = page_folio(page);
+		unsigned long idx, nr_pages;
 
 		if (folio_test_hugetlb(folio)) {
 			struct hstate *h;
@@ -55,14 +56,21 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
 			 * use folio_hstate() directly.
 			 */
 			h = size_to_hstate(folio_size(folio));
-			if (h && !hugepage_migration_supported(h))
+			if (!h || !hugepage_migration_supported(h))
 				return true;
 
+			nr_pages = pages_per_huge_page(h);
 		} else if (!folio_test_lru(folio)) {
 			return true;
+		} else {
+			nr_pages = folio_nr_pages(folio);
 		}
 
-		*step = folio_nr_pages(folio) - folio_page_idx(folio, page);
+		idx = folio_page_idx(folio, page);
+		if (idx >= nr_pages)
+			return true;
+
+		*step = nr_pages - idx;
 		return false;
 	}
 
-- 
2.50.1 (Apple Git-155)
Re: [PATCH] mm: page_isolation: Avoid hugepage scan step underflow
Posted by David Hildenbrand (Arm) 4 days, 16 hours ago
On 5/19/26 14:16, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> page_is_unmovable() checks HugeTLB pages without holding hugetlb_lock and
> without pinning the folio. This is intentional for the pageblock scanning
> paths, but it means the HugeTLB folio can be freed concurrently after
> PageHuge() or folio_test_hugetlb() succeeds.
> 
> The existing code avoids folio_hstate() and uses size_to_hstate() because
> the HugeTLB flag may already have been cleared. However, if
> size_to_hstate() returns NULL, the code still falls through and computes
> the scan step from folio_nr_pages(). If the folio has been freed and the
> head/large state has been cleared, folio_nr_pages() can return 1. When the
> current page is a tail page, subtracting folio_page_idx() from 1 can
> underflow and make the scanner skip too far.
> 
> Treat a NULL hstate as unmovable so the scanner does not try to skip over
> an unstable HugeTLB folio. Once a valid hstate is found, derive the number
> of pages from the hstate instead of reading the folio size again. Also
> validate the page index before computing the step to avoid underflow if the
> page/folio relationship changed concurrently.
> 
> Fixes: a0a9f2180b90 ("mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  mm/page_isolation.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index c48ff5c00244..99f0b06efaf6 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -43,6 +43,7 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
>  	 */
>  	if (PageHuge(page) || PageCompound(page)) {
>  		struct folio *folio = page_folio(page);
> +		unsigned long idx, nr_pages;
>  
>  		if (folio_test_hugetlb(folio)) {
>  			struct hstate *h;
> @@ -55,14 +56,21 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
>  			 * use folio_hstate() directly.
>  			 */
>  			h = size_to_hstate(folio_size(folio));
> -			if (h && !hugepage_migration_supported(h))
> +			if (!h || !hugepage_migration_supported(h))
>  				return true;
>  
> +			nr_pages = pages_per_huge_page(h);
>  		} else if (!folio_test_lru(folio)) {
>  			return true;
> +		} else {
> +			nr_pages = folio_nr_pages(folio);
>  		}
>  
> -		*step = folio_nr_pages(folio) - folio_page_idx(folio, page);
> +		idx = folio_page_idx(folio, page);
> +		if (idx >= nr_pages)
> +			return true;
> +
> +		*step = nr_pages - idx;
>  		return false;
>  	}
>  

We have very similar code in scan_movable_pages() that we previously fixed.

Seems like we can avoid folio_page_idx() completely by just doing "pfn |=
nr_pages - 1".

If, in corner cases, we skip over some unmovable pages, too bad. The function is
inherently racy.

-- 
Cheers,

David
Re: [PATCH] mm: page_isolation: Avoid hugepage scan step underflow
Posted by Andrew Morton 5 days, 7 hours ago
On Tue, 19 May 2026 20:16:46 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> page_is_unmovable() checks HugeTLB pages without holding hugetlb_lock and
> without pinning the folio. This is intentional for the pageblock scanning
> paths, but it means the HugeTLB folio can be freed concurrently after
> PageHuge() or folio_test_hugetlb() succeeds.

Thanks.  What are the userspace-visible runtime effects of this?

> The existing code avoids folio_hstate() and uses size_to_hstate() because
> the HugeTLB flag may already have been cleared. However, if
> size_to_hstate() returns NULL, the code still falls through and computes
> the scan step from folio_nr_pages(). If the folio has been freed and the
> head/large state has been cleared, folio_nr_pages() can return 1. When the
> current page is a tail page, subtracting folio_page_idx() from 1 can
> underflow and make the scanner skip too far.
> 
> Treat a NULL hstate as unmovable so the scanner does not try to skip over
> an unstable HugeTLB folio. Once a valid hstate is found, derive the number
> of pages from the hstate instead of reading the folio size again. Also
> validate the page index before computing the step to avoid underflow if the
> page/folio relationship changed concurrently.

This code sounds rather sketchy, and it sounds like it will remain
sketchy after the patch.  And AI review says "hey, this code is
sketchy":
https://sashiko.dev/#/patchset/20260519121646.40833-1-kaitao.cheng@linux.dev

> Fixes: a0a9f2180b90 ("mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock")

We might want to cc:stable on this, depends on the answer to my above
question.
Re: [PATCH] mm: page_isolation: Avoid hugepage scan step underflow
Posted by Kaitao Cheng 2 days, 15 hours ago
在 2026/5/20 01:54, Andrew Morton 写道:
> On Tue, 19 May 2026 20:16:46 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 
>> page_is_unmovable() checks HugeTLB pages without holding hugetlb_lock and
>> without pinning the folio. This is intentional for the pageblock scanning
>> paths, but it means the HugeTLB folio can be freed concurrently after
>> PageHuge() or folio_test_hugetlb() succeeds.
> 
> Thanks.  What are the userspace-visible runtime effects of this?

The direct user-visible effect is not memory corruption, but degraded forward
progress in the page isolation / contiguous allocation path.

If the race makes folio_nr_pages() return 1 while the current page is still
treated as a tail page of the old HugeTLB folio, the computed step can
underflow:

        step = folio_nr_pages(folio) - folio_page_idx(folio, page);

The caller then does:

        start_pfn += step;

With unsigned arithmetic this can wrap and move start_pfn backwards, typically
near the beginning of the old hugepage range rather than advancing past it. In
many cases this only means rescanning part of the same hugepage range, so the
effect may be limited to extra scanning work.

However, it still violates the scanner's forward-progress assumption: step
is expected to advance start_pfn. If the same transient state is observed
repeatedly, the scanner can keep revisiting the same PFNs, causing excessive
latency and, in the worst case, an apparent stall in operations that rely on
page isolation or contiguous allocation.


Here is another point raised by AI, the old code also used folio_test_lru()
on a folio pointer obtained without holding a reference. If the folio is
freed and the old head page is reused or observed as a tail page of another
compound page, folio_test_lru() can reach const_folio_flags(), which asserts
that the passed folio is not a tail page. On DEBUG_VM kernels, that can
trigger a VM_BUG_ON_PGFLAGS() crash.

>> The existing code avoids folio_hstate() and uses size_to_hstate() because
>> the HugeTLB flag may already have been cleared. However, if
>> size_to_hstate() returns NULL, the code still falls through and computes
>> the scan step from folio_nr_pages(). If the folio has been freed and the
>> head/large state has been cleared, folio_nr_pages() can return 1. When the
>> current page is a tail page, subtracting folio_page_idx() from 1 can
>> underflow and make the scanner skip too far.
>>
>> Treat a NULL hstate as unmovable so the scanner does not try to skip over
>> an unstable HugeTLB folio. Once a valid hstate is found, derive the number
>> of pages from the hstate instead of reading the folio size again. Also
>> validate the page index before computing the step to avoid underflow if the
>> page/folio relationship changed concurrently.
> 
> This code sounds rather sketchy, and it sounds like it will remain
> sketchy after the patch.  And AI review says "hey, this code is
> sketchy":
> https://sashiko.dev/#/patchset/20260519121646.40833-1-kaitao.cheng@linux.dev

Following David Hildenbrand's suggestion, I made some changes as shown below.
I'm not sure whether there are still any other issues.

--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -41,8 +41,14 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
         * We need not scan over tail pages because we don't
         * handle each tail page individually in migration.
         */
-       if (PageHuge(page) || PageCompound(page)) {
+       if (PageCompound(page)) {
                struct folio *folio = page_folio(page);
+               unsigned long nr_pages, pfn;
+               unsigned int order;
+
+               order = compound_order(&folio->page);
+               if (order > MAX_FOLIO_ORDER)
+                       return true;

                if (folio_test_hugetlb(folio)) {
                        struct hstate *h;
@@ -54,15 +60,16 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
                         * The huge page may be freed so can not
                         * use folio_hstate() directly.
                         */
-                       h = size_to_hstate(folio_size(folio));
-                       if (h && !hugepage_migration_supported(h))
+                       h = size_to_hstate(PAGE_SIZE << order);
+                       if (!h || !hugepage_migration_supported(h))
                                return true;
-
-               } else if (!folio_test_lru(folio)) {
+               } else if (!PageLRU(page)) {
                        return true;
                }

-               *step = folio_nr_pages(folio) - folio_page_idx(folio, page);
+               nr_pages = 1UL << order;
+               pfn = page_to_pfn(page);
+               *step = (pfn | (nr_pages - 1)) + 1 - pfn;
                return false;
        }

-- 
Thanks
Kaitao Cheng