mm/khugepaged: Fix skipping of alloc sleep after second failure

[PATCH] mm/khugepaged: Fix skipping of alloc sleep after second failure

Posted by Zhiheng Tao 2 months, 2 weeks ago

In khugepaged_do_scan(), two consecutive allocation failures cause
the logic to skip the dedicated 60s throttling sleep
(khugepaged_alloc_sleep_millisecs), forcing a fallback to the
shorter 10s scanning interval via the outer loop

Since fragmentation is unlikely to resolve in 10s, this results in
wasted CPU cycles on immediate retries.

Reorder the failure logic to ensure khugepaged_alloc_sleep() is
always called on each allocation failure.

Fixes: c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
Signed-off-by: Zhiheng Tao <junchuan.tzh@antgroup.com>
---
 mm/khugepaged.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index abe54f0..c3f9721 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2562,12 +2562,12 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 		if (result == SCAN_ALLOC_HUGE_PAGE_FAIL) {
 			/*
 			 * If fail to allocate the first time, try to sleep for
-			 * a while.  When hit again, cancel the scan.
+			 * a while.  When hit again, sleep and cancel the scan.
 			 */
+			khugepaged_alloc_sleep();
 			if (!wait)
 				break;
 			wait = false;
-			khugepaged_alloc_sleep();
 		}
 	}
 }
-- 
1.8.3.1

Re: [PATCH] mm/khugepaged: Fix skipping of alloc sleep after second failure

Posted by David Hildenbrand (Red Hat) 2 months, 2 weeks ago

On 11/24/25 07:19, Zhiheng Tao wrote:
> In khugepaged_do_scan(), two consecutive allocation failures cause
> the logic to skip the dedicated 60s throttling sleep
> (khugepaged_alloc_sleep_millisecs), forcing a fallback to the
> shorter 10s scanning interval via the outer loop
> 
> Since fragmentation is unlikely to resolve in 10s, this results in
> wasted CPU cycles on immediate retries.

Why shouldn't memory comapction be able to compact a single THP in 10s?

Why should it resolve in 60s?

> 
> Reorder the failure logic to ensure khugepaged_alloc_sleep() is
> always called on each allocation failure.
> 
> Fixes: c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")

What are we fixing here? This sounds like a change that might be better 
on some systems, but worse on others?

We really need more information on when/how an issue was hit, and how 
this patch here really moves the needle in any way.

-- 
Cheers

David

Re: [PATCH] mm/khugepaged: Fix skipping of alloc sleep after second failure

Posted by Zhiheng Tao 2 months, 2 weeks ago

On Mon, Nov 24, 2025 at 10:14:20AM +0100, David Hildenbrand (Red Hat) wrote:
> On 11/24/25 07:19, Zhiheng Tao wrote:
> >In khugepaged_do_scan(), two consecutive allocation failures cause
> >the logic to skip the dedicated 60s throttling sleep
> >(khugepaged_alloc_sleep_millisecs), forcing a fallback to the
> >shorter 10s scanning interval via the outer loop
> >
> >Since fragmentation is unlikely to resolve in 10s, this results in
> >wasted CPU cycles on immediate retries.
> 
> Why shouldn't memory comapction be able to compact a single THP in 10s?
> 
> Why should it resolve in 60s?
> 
It may resolve in 10s or 60s. The problem is that the sleep controlled
by khugepaged_alloc_sleep_millisecs should not be skipped if allocation
fails.
> >
> >Reorder the failure logic to ensure khugepaged_alloc_sleep() is
> >always called on each allocation failure.
> >
> >Fixes: c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
> 
> What are we fixing here? This sounds like a change that might be
> better on some systems, but worse on others?
> 
> We really need more information on when/how an issue was hit, and
> how this patch here really moves the needle in any way.
> 
It works better. The missing of khugepaged_alloc_sleep() is not
introduced by this change. Maybe I should remove "Fix".
> -- 
> Cheers
> 
> David

Re: [PATCH] mm/khugepaged: Fix skipping of alloc sleep after second failure

Posted by Lance Yang 2 months, 2 weeks ago


On 2025/11/24 17:14, David Hildenbrand (Red Hat) wrote:
> On 11/24/25 07:19, Zhiheng Tao wrote:
>> In khugepaged_do_scan(), two consecutive allocation failures cause
>> the logic to skip the dedicated 60s throttling sleep
>> (khugepaged_alloc_sleep_millisecs), forcing a fallback to the
>> shorter 10s scanning interval via the outer loop
>>
>> Since fragmentation is unlikely to resolve in 10s, this results in
>> wasted CPU cycles on immediate retries.
> 
> Why shouldn't memory comapction be able to compact a single THP in 10s?
> 
> Why should it resolve in 60s?
> 
>>
>> Reorder the failure logic to ensure khugepaged_alloc_sleep() is
>> always called on each allocation failure.
>>
>> Fixes: c6a7f445a272 ("mm: khugepaged: don't carry huge page to the 
>> next loop for !CONFIG_NUMA")
> 
> What are we fixing here? This sounds like a change that might be better 
> on some systems, but worse on others?

Seems like we're not honoring khugepaged_alloc_sleep_millisecs on the
second allocation failure... but is that actually a problem?

> 
> We really need more information on when/how an issue was hit, and how 
> this patch here really moves the needle in any way.

+1

Re: [PATCH] mm/khugepaged: Fix skipping of alloc sleep after second failure

Posted by Zhiheng Tao 2 months, 2 weeks ago

On Mon, Nov 24, 2025 at 05:27:23PM +0800, Lance Yang wrote:
> 
> 
> On 2025/11/24 17:14, David Hildenbrand (Red Hat) wrote:
> >On 11/24/25 07:19, Zhiheng Tao wrote:
> >>In khugepaged_do_scan(), two consecutive allocation failures cause
> >>the logic to skip the dedicated 60s throttling sleep
> >>(khugepaged_alloc_sleep_millisecs), forcing a fallback to the
> >>shorter 10s scanning interval via the outer loop
> >>
> >>Since fragmentation is unlikely to resolve in 10s, this results in
> >>wasted CPU cycles on immediate retries.
> >
> >Why shouldn't memory comapction be able to compact a single THP in 10s?
> >
> >Why should it resolve in 60s?
> >
> >>
> >>Reorder the failure logic to ensure khugepaged_alloc_sleep() is
> >>always called on each allocation failure.
> >>
> >>Fixes: c6a7f445a272 ("mm: khugepaged: don't carry huge page to
> >>the next loop for !CONFIG_NUMA")
> >
> >What are we fixing here? This sounds like a change that might be
> >better on some systems, but worse on others?
> 
> Seems like we're not honoring khugepaged_alloc_sleep_millisecs on the
> second allocation failure... but is that actually a problem?
> 
Is it more appropriate to honor the second allocation failure?
It was a problem before commit c6a7f445a272 when
khugepaged_pages_to_scan=512.
> >
> >We really need more information on when/how an issue was hit, and
> >how this patch here really moves the needle in any way.
> 
> +1
>