mm: khugepaged refuses to freeze

[stable-6.6.y] mm: khugepaged refuses to freeze

Posted by Sergey Senozhatsky 3 days, 22 hours ago

Greetings,

I'm looking at a slightly unusual issue where khugepaged refuses to
freeze during system suspend:

...
 PM: suspend entry (s2idle)
 Filesystems sync: 0.003 seconds
 Freezing user space processes
 Freezing user space processes completed (elapsed 0.003 seconds)
 OOM killer disabled.
 Freezing remaining freezable tasks
 Freezing remaining freezable tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
 task:khugepaged      state:D stack:0     pid:1345  ppid:2      flags:0x00004000
 Call Trace:
  <TASK>
  schedule+0x523/0x16a0
  ? sysvec_apic_timer_interrupt+0xf/0x90
  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
  ? wait_for_completion_io_timeout+0xc5/0x170
  schedule_timeout+0x23b/0x6e0
  ? __pfx_process_timeout+0x10/0x10
  ? wait_for_completion_io_timeout+0xc5/0x170
  io_schedule_timeout+0x3f/0x80
  wait_for_completion_io_timeout+0xe4/0x170
  submit_bio_wait+0x79/0xc0
  swap_readpage+0x150/0x2d0
  ? __pfx_submit_bio_wait_endio+0x10/0x10
  swap_cluster_readahead+0x3be/0x750
  ? __pfx_workingset_update_node+0x10/0x10
  shmem_swapin+0xa7/0x100
  shmem_swapin_folio+0xcd/0x2e0
  shmem_get_folio+0x237/0x580
  collapse_file+0x247/0x1280
  hpage_collapse_scan_file+0x26e/0x380
  khugepaged+0x43b/0x810
  kthread+0xfb/0x120
  ? __pfx_khugepaged+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x38/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>
...

The system is using zram swap.  I wonder if khugepaged should
be suspend/freeze aware.  Does something like below make sense?
Or is the problem elsewhere?

---
 mm/khugepaged.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eff9e3061925..fa6a018b20a8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		xas_set(&xas, index);
 		folio = xas_load(&xas);
 
+		if (try_to_freeze())
+			goto xa_unlocked;
+
 		VM_BUG_ON(index != xas.xa_index);
 		if (is_shmem) {
 			if (!folio) {
-- 
2.53.0.rc2.204.g2597b5adb4-goog

Re: [stable-6.6.y] mm: khugepaged refuses to freeze

Posted by Baolin Wang 3 days, 21 hours ago


On 2/6/26 10:47 AM, Sergey Senozhatsky wrote:
> Greetings,
> 
> I'm looking at a slightly unusual issue where khugepaged refuses to
> freeze during system suspend:
> 
> ...
>   PM: suspend entry (s2idle)
>   Filesystems sync: 0.003 seconds
>   Freezing user space processes
>   Freezing user space processes completed (elapsed 0.003 seconds)
>   OOM killer disabled.
>   Freezing remaining freezable tasks
>   Freezing remaining freezable tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
>   task:khugepaged      state:D stack:0     pid:1345  ppid:2      flags:0x00004000
>   Call Trace:
>    <TASK>
>    schedule+0x523/0x16a0
>    ? sysvec_apic_timer_interrupt+0xf/0x90
>    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>    ? wait_for_completion_io_timeout+0xc5/0x170
>    schedule_timeout+0x23b/0x6e0
>    ? __pfx_process_timeout+0x10/0x10
>    ? wait_for_completion_io_timeout+0xc5/0x170
>    io_schedule_timeout+0x3f/0x80
>    wait_for_completion_io_timeout+0xe4/0x170
>    submit_bio_wait+0x79/0xc0
>    swap_readpage+0x150/0x2d0
>    ? __pfx_submit_bio_wait_endio+0x10/0x10
>    swap_cluster_readahead+0x3be/0x750
>    ? __pfx_workingset_update_node+0x10/0x10
>    shmem_swapin+0xa7/0x100
>    shmem_swapin_folio+0xcd/0x2e0
>    shmem_get_folio+0x237/0x580
>    collapse_file+0x247/0x1280
>    hpage_collapse_scan_file+0x26e/0x380
>    khugepaged+0x43b/0x810
>    kthread+0xfb/0x120
>    ? __pfx_khugepaged+0x10/0x10
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork+0x38/0x50
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork_asm+0x1b/0x30
>    </TASK>
> ...
> 
> The system is using zram swap.  I wonder if khugepaged should
> be suspend/freeze aware.  Does something like below make sense?
> Or is the problem elsewhere?
> 
> ---
>   mm/khugepaged.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index eff9e3061925..fa6a018b20a8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   		xas_set(&xas, index);
>   		folio = xas_load(&xas);
>   
> +		if (try_to_freeze())
> +			goto xa_unlocked;
> +
>   		VM_BUG_ON(index != xas.xa_index);
>   		if (is_shmem) {
>   			if (!folio) {

Your analysis is reasonable. When the system is freezing, khugepaged is 
still trying to swap-in shmem to collapse, which prevents the system 
from entering suspend state. However, it’s not only shmem that will swap 
in, collapsing anonymous folios may also trigger swap-in operations.

Therefore, I think we should skip all collapse scans for anonymous and 
file pages in the main scan function khugepaged_do_scan() if the system 
is attempting to freeze.

Some sample code is as follows:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa1e57fd2c46..cfa7882585ad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2560,9 +2560,18 @@ static void khugepaged_do_scan(struct 
collapse_control *cc)
         lru_add_drain_all();

         while (true) {
+               bool was_frozen;
+
                 cond_resched();

-               if (unlikely(kthread_should_stop()))
+               if (unlikely(kthread_freezable_should_stop(&was_frozen)))
+                       break;
+
+               /*
+                * We can speed up thawing tasks if we don't call 
khugepaged_scan_mm_slot()
+                * after returning from the refrigerator
+                */
+               if (was_frozen)
                         break;

                 spin_lock(&khugepaged_mm_lock);

Re: [stable-6.6.y] mm: khugepaged refuses to freeze

Posted by Sergey Senozhatsky 3 days, 21 hours ago

On (26/02/06 11:33), Baolin Wang wrote:
> >   Freezing remaining freezable tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
> >   task:khugepaged      state:D stack:0     pid:1345  ppid:2      flags:0x00004000
> >   Call Trace:
> >    <TASK>
> >    schedule+0x523/0x16a0
> >    ? sysvec_apic_timer_interrupt+0xf/0x90
> >    ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >    ? wait_for_completion_io_timeout+0xc5/0x170
> >    schedule_timeout+0x23b/0x6e0
> >    ? __pfx_process_timeout+0x10/0x10
> >    ? wait_for_completion_io_timeout+0xc5/0x170
> >    io_schedule_timeout+0x3f/0x80
> >    wait_for_completion_io_timeout+0xe4/0x170
> >    submit_bio_wait+0x79/0xc0
> >    swap_readpage+0x150/0x2d0
> >    ? __pfx_submit_bio_wait_endio+0x10/0x10
> >    swap_cluster_readahead+0x3be/0x750
> >    ? __pfx_workingset_update_node+0x10/0x10
> >    shmem_swapin+0xa7/0x100
> >    shmem_swapin_folio+0xcd/0x2e0
> >    shmem_get_folio+0x237/0x580
> >    collapse_file+0x247/0x1280
> >    hpage_collapse_scan_file+0x26e/0x380
> >    khugepaged+0x43b/0x810
> >    kthread+0xfb/0x120
> >    ? __pfx_khugepaged+0x10/0x10
> >    ? __pfx_kthread+0x10/0x10
> >    ret_from_fork+0x38/0x50
> >    ? __pfx_kthread+0x10/0x10
> >    ret_from_fork_asm+0x1b/0x30
> >    </TASK>
> > ...
> > 
> > The system is using zram swap.  I wonder if khugepaged should
> > be suspend/freeze aware.  Does something like below make sense?
> > Or is the problem elsewhere?
> > 
> > ---
> >   mm/khugepaged.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index eff9e3061925..fa6a018b20a8 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >   		xas_set(&xas, index);
> >   		folio = xas_load(&xas);
> > +		if (try_to_freeze())
> > +			goto xa_unlocked;
> > +
> >   		VM_BUG_ON(index != xas.xa_index);
> >   		if (is_shmem) {
> >   			if (!folio) {
> 
> Your analysis is reasonable. When the system is freezing, khugepaged is
> still trying to swap-in shmem to collapse, which prevents the system from
> entering suspend state. However, it’s not only shmem that will swap in,
> collapsing anonymous folios may also trigger swap-in operations.

Right, I thought about it but wasn't sure.  Could the inner loop (e.g.
collapse_file() in this particular case) loop long enough to fail suspend
w/o ever giving the outer loop (khugepaged_do_scan()) a chance to freeze?

Re: [stable-6.6.y] mm: khugepaged refuses to freeze

Posted by Sergey Senozhatsky 3 days, 20 hours ago

On (26/02/06 12:38), Sergey Senozhatsky wrote:
[..]
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index eff9e3061925..fa6a018b20a8 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> > >   		xas_set(&xas, index);
> > >   		folio = xas_load(&xas);
> > > +		if (try_to_freeze())
> > > +			goto xa_unlocked;
> > > +
> > >   		VM_BUG_ON(index != xas.xa_index);
> > >   		if (is_shmem) {
> > >   			if (!folio) {
> > 
> > Your analysis is reasonable. When the system is freezing, khugepaged is
> > still trying to swap-in shmem to collapse, which prevents the system from
> > entering suspend state. However, it’s not only shmem that will swap in,
> > collapsing anonymous folios may also trigger swap-in operations.
> 
> Right, I thought about it but wasn't sure.  Could the inner loop (e.g.
> collapse_file() in this particular case) loop long enough to fail suspend
> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to freeze?

For inner loops I wondered if cond_resched() could be an indicator of
where try_to_freeze() should be placed.  Those cond_resched() calls
are there for a reason, after all.   E.g. something like:

---

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa6a018b20a8..cee08466a069 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2431,6 +2431,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 		unsigned long hstart, hend;
 
 		cond_resched();
+		if (try_to_freeze())
+			break;
+
 		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
 			progress++;
 			break;
@@ -2453,6 +2456,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 			bool mmap_locked = true;
 
 			cond_resched();
+			if (try_to_freeze())
+				goto breakouterloop;
+
 			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
 				goto breakouterloop;

Re: [stable-6.6.y] mm: khugepaged refuses to freeze

Posted by Baolin Wang 3 days, 20 hours ago


On 2/6/26 12:31 PM, Sergey Senozhatsky wrote:
> On (26/02/06 12:38), Sergey Senozhatsky wrote:
> [..]
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index eff9e3061925..fa6a018b20a8 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>    		xas_set(&xas, index);
>>>>    		folio = xas_load(&xas);
>>>> +		if (try_to_freeze())
>>>> +			goto xa_unlocked;
>>>> +
>>>>    		VM_BUG_ON(index != xas.xa_index);
>>>>    		if (is_shmem) {
>>>>    			if (!folio) {
>>>
>>> Your analysis is reasonable. When the system is freezing, khugepaged is
>>> still trying to swap-in shmem to collapse, which prevents the system from
>>> entering suspend state. However, it’s not only shmem that will swap in,
>>> collapsing anonymous folios may also trigger swap-in operations.
>>
>> Right, I thought about it but wasn't sure.  Could the inner loop (e.g.
>> collapse_file() in this particular case) loop long enough to fail suspend
>> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to freeze?

Yes, that’s possible. However, if we add a try_to_freeze() check in the 
inner loop, we need to consider various scenarios (such as anonymous 
folio swap-in and other potential cases?), which feels too hacky to me.

> For inner loops I wondered if cond_resched() could be an indicator of
> where try_to_freeze() should be placed.  Those cond_resched() calls
> are there for a reason, after all.   E.g. something like:
> 
> ---
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fa6a018b20a8..cee08466a069 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2431,6 +2431,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>   		unsigned long hstart, hend;
>   
>   		cond_resched();
> +		if (try_to_freeze())
> +			break;
> +
>   		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>   			progress++;
>   			break;
> @@ -2453,6 +2456,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>   			bool mmap_locked = true;
>   
>   			cond_resched();
> +			if (try_to_freeze())
> +				goto breakouterloop;
> +
>   			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>   				goto breakouterloop;

This looks better than the previous version. Let’s also wait to see if 
others have any better suggestions.

Re: [stable-6.6.y] mm: khugepaged refuses to freeze

Posted by David Hildenbrand (Arm) 3 days, 16 hours ago

On 2/6/26 06:12, Baolin Wang wrote:
> 
> 
> On 2/6/26 12:31 PM, Sergey Senozhatsky wrote:
>> On (26/02/06 12:38), Sergey Senozhatsky wrote:
>> [..]
>>>
>>> Right, I thought about it but wasn't sure.  Could the inner loop (e.g.
>>> collapse_file() in this particular case) loop long enough to fail 
>>> suspend
>>> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to 
>>> freeze?
> 
> Yes, that’s possible. However, if we add a try_to_freeze() check in the 
> inner loop, we need to consider various scenarios (such as anonymous 
> folio swap-in and other potential cases?), which feels too hacky to me.
> 
>> For inner loops I wondered if cond_resched() could be an indicator of
>> where try_to_freeze() should be placed.  Those cond_resched() calls
>> are there for a reason, after all.   E.g. something like:
>>
>> ---
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index fa6a018b20a8..cee08466a069 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2431,6 +2431,9 @@ static unsigned int 
>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>           unsigned long hstart, hend;
>>           cond_resched();
>> +        if (try_to_freeze())
>> +            break;
>> +
>>           if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>>               progress++;
>>               break;
>> @@ -2453,6 +2456,9 @@ static unsigned int 
>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>               bool mmap_locked = true;
>>               cond_resched();
>> +            if (try_to_freeze())
>> +                goto breakouterloop;
>> +
>>               if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>>                   goto breakouterloop;
> 
> This looks better than the previous version. Let’s also wait to see if 
> others have any better suggestions.

What prevents other callpaths (faults, read(), write(), etc) from 
similarly triggering swapin?

I recall that there is a notifier when the system is preparing to sleep 
(pm notifier or something). Could we simply hook into that to tell 
khugepaged to suspend+resume?

Essentially, making hpage_collapse_test_exit_or_disable() break our for us.

-- 
Cheers,

David

Re: [stable-6.6.y] mm: khugepaged refuses to freeze

Posted by Baolin Wang 3 days, 16 hours ago


On 2/6/26 4:36 PM, David Hildenbrand (Arm) wrote:
> On 2/6/26 06:12, Baolin Wang wrote:
>>
>>
>> On 2/6/26 12:31 PM, Sergey Senozhatsky wrote:
>>> On (26/02/06 12:38), Sergey Senozhatsky wrote:
>>> [..]
>>>>
>>>> Right, I thought about it but wasn't sure.  Could the inner loop (e.g.
>>>> collapse_file() in this particular case) loop long enough to fail 
>>>> suspend
>>>> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to 
>>>> freeze?
>>
>> Yes, that’s possible. However, if we add a try_to_freeze() check in 
>> the inner loop, we need to consider various scenarios (such as 
>> anonymous folio swap-in and other potential cases?), which feels too 
>> hacky to me.
>>
>>> For inner loops I wondered if cond_resched() could be an indicator of
>>> where try_to_freeze() should be placed.  Those cond_resched() calls
>>> are there for a reason, after all.   E.g. something like:
>>>
>>> ---
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index fa6a018b20a8..cee08466a069 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2431,6 +2431,9 @@ static unsigned int 
>>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>           unsigned long hstart, hend;
>>>           cond_resched();
>>> +        if (try_to_freeze())
>>> +            break;
>>> +
>>>           if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>>>               progress++;
>>>               break;
>>> @@ -2453,6 +2456,9 @@ static unsigned int 
>>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>               bool mmap_locked = true;
>>>               cond_resched();
>>> +            if (try_to_freeze())
>>> +                goto breakouterloop;
>>> +
>>>               if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>>>                   goto breakouterloop;
>>
>> This looks better than the previous version. Let’s also wait to see if 
>> others have any better suggestions.
> 
> What prevents other callpaths (faults, read(), write(), etc) from 
> similarly triggering swapin?

Usually it’s just a userspace process triggering one page fault to swap 
a page in, then will return to userspace. There aren’t other kernel 
threads like khugepaged continuously do swap-in in a loop.

> I recall that there is a notifier when the system is preparing to sleep 
> (pm notifier or something). Could we simply hook into that to tell 
> khugepaged to suspend+resume?

Do you mean “struct dev_pm_ops”, which is used to register PM callbacks 
for devices? However, I don’t know how to use it with a kernel thread.

Also look at how kswapd does it, kswapd also uses 
kthread_freezable_should_stop() to check the freeze state.


> Essentially, making hpage_collapse_test_exit_or_disable() break our for us.

Ah, yes, even better:)

Re: [stable-6.6.y] mm: khugepaged refuses to freeze

Posted by David Hildenbrand (Arm) 3 days, 16 hours ago

>> I recall that there is a notifier when the system is preparing to 
>> sleep (pm notifier or something). Could we simply hook into that to 
>> tell khugepaged to suspend+resume?
> 
> Do you mean “struct dev_pm_ops”, which is used to register PM callbacks 
> for devices? However, I don’t know how to use it with a kernel thread.
> 
> Also look at how kswapd does it, kswapd also uses 
> kthread_freezable_should_stop() to check the freeze state.

Right, mimicking what kswapd does sound reasonable!

-- 
Cheers,

David