mm/khugepaged.c | 3 +++ 1 file changed, 3 insertions(+)
Greetings,
I'm looking at a slightly unusual issue where khugepaged refuses to
freeze during system suspend:
...
PM: suspend entry (s2idle)
Filesystems sync: 0.003 seconds
Freezing user space processes
Freezing user space processes completed (elapsed 0.003 seconds)
OOM killer disabled.
Freezing remaining freezable tasks
Freezing remaining freezable tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
task:khugepaged state:D stack:0 pid:1345 ppid:2 flags:0x00004000
Call Trace:
<TASK>
schedule+0x523/0x16a0
? sysvec_apic_timer_interrupt+0xf/0x90
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? wait_for_completion_io_timeout+0xc5/0x170
schedule_timeout+0x23b/0x6e0
? __pfx_process_timeout+0x10/0x10
? wait_for_completion_io_timeout+0xc5/0x170
io_schedule_timeout+0x3f/0x80
wait_for_completion_io_timeout+0xe4/0x170
submit_bio_wait+0x79/0xc0
swap_readpage+0x150/0x2d0
? __pfx_submit_bio_wait_endio+0x10/0x10
swap_cluster_readahead+0x3be/0x750
? __pfx_workingset_update_node+0x10/0x10
shmem_swapin+0xa7/0x100
shmem_swapin_folio+0xcd/0x2e0
shmem_get_folio+0x237/0x580
collapse_file+0x247/0x1280
hpage_collapse_scan_file+0x26e/0x380
khugepaged+0x43b/0x810
kthread+0xfb/0x120
? __pfx_khugepaged+0x10/0x10
? __pfx_kthread+0x10/0x10
ret_from_fork+0x38/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
...
The system is using zram swap. I wonder if khugepaged should
be suspend/freeze aware. Does something like below make sense?
Or is the problem elsewhere?
---
mm/khugepaged.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eff9e3061925..fa6a018b20a8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
xas_set(&xas, index);
folio = xas_load(&xas);
+ if (try_to_freeze())
+ goto xa_unlocked;
+
VM_BUG_ON(index != xas.xa_index);
if (is_shmem) {
if (!folio) {
--
2.53.0.rc2.204.g2597b5adb4-goog
On 2/6/26 10:47 AM, Sergey Senozhatsky wrote:
> Greetings,
>
> I'm looking at a slightly unusual issue where khugepaged refuses to
> freeze during system suspend:
>
> ...
> PM: suspend entry (s2idle)
> Filesystems sync: 0.003 seconds
> Freezing user space processes
> Freezing user space processes completed (elapsed 0.003 seconds)
> OOM killer disabled.
> Freezing remaining freezable tasks
> Freezing remaining freezable tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
> task:khugepaged state:D stack:0 pid:1345 ppid:2 flags:0x00004000
> Call Trace:
> <TASK>
> schedule+0x523/0x16a0
> ? sysvec_apic_timer_interrupt+0xf/0x90
> ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> ? wait_for_completion_io_timeout+0xc5/0x170
> schedule_timeout+0x23b/0x6e0
> ? __pfx_process_timeout+0x10/0x10
> ? wait_for_completion_io_timeout+0xc5/0x170
> io_schedule_timeout+0x3f/0x80
> wait_for_completion_io_timeout+0xe4/0x170
> submit_bio_wait+0x79/0xc0
> swap_readpage+0x150/0x2d0
> ? __pfx_submit_bio_wait_endio+0x10/0x10
> swap_cluster_readahead+0x3be/0x750
> ? __pfx_workingset_update_node+0x10/0x10
> shmem_swapin+0xa7/0x100
> shmem_swapin_folio+0xcd/0x2e0
> shmem_get_folio+0x237/0x580
> collapse_file+0x247/0x1280
> hpage_collapse_scan_file+0x26e/0x380
> khugepaged+0x43b/0x810
> kthread+0xfb/0x120
> ? __pfx_khugepaged+0x10/0x10
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x38/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
> ...
>
> The system is using zram swap. I wonder if khugepaged should
> be suspend/freeze aware. Does something like below make sense?
> Or is the problem elsewhere?
>
> ---
> mm/khugepaged.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index eff9e3061925..fa6a018b20a8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> xas_set(&xas, index);
> folio = xas_load(&xas);
>
> + if (try_to_freeze())
> + goto xa_unlocked;
> +
> VM_BUG_ON(index != xas.xa_index);
> if (is_shmem) {
> if (!folio) {
Your analysis is reasonable. When the system is freezing, khugepaged is
still trying to swap-in shmem to collapse, which prevents the system
from entering suspend state. However, it’s not only shmem that will swap
in, collapsing anonymous folios may also trigger swap-in operations.
Therefore, I think we should skip all collapse scans for anonymous and
file pages in the main scan function khugepaged_do_scan() if the system
is attempting to freeze.
Some sample code is as follows:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa1e57fd2c46..cfa7882585ad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2560,9 +2560,18 @@ static void khugepaged_do_scan(struct
collapse_control *cc)
lru_add_drain_all();
while (true) {
+ bool was_frozen;
+
cond_resched();
- if (unlikely(kthread_should_stop()))
+ if (unlikely(kthread_freezable_should_stop(&was_frozen)))
+ break;
+
+ /*
+ * We can speed up thawing tasks if we don't call
khugepaged_scan_mm_slot()
+ * after returning from the refrigerator
+ */
+ if (was_frozen)
break;
spin_lock(&khugepaged_mm_lock);
On (26/02/06 11:33), Baolin Wang wrote:
> > Freezing remaining freezable tasks failed after 20.004 seconds (1 tasks refusing to freeze, wq_busy=0):
> > task:khugepaged state:D stack:0 pid:1345 ppid:2 flags:0x00004000
> > Call Trace:
> > <TASK>
> > schedule+0x523/0x16a0
> > ? sysvec_apic_timer_interrupt+0xf/0x90
> > ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > ? wait_for_completion_io_timeout+0xc5/0x170
> > schedule_timeout+0x23b/0x6e0
> > ? __pfx_process_timeout+0x10/0x10
> > ? wait_for_completion_io_timeout+0xc5/0x170
> > io_schedule_timeout+0x3f/0x80
> > wait_for_completion_io_timeout+0xe4/0x170
> > submit_bio_wait+0x79/0xc0
> > swap_readpage+0x150/0x2d0
> > ? __pfx_submit_bio_wait_endio+0x10/0x10
> > swap_cluster_readahead+0x3be/0x750
> > ? __pfx_workingset_update_node+0x10/0x10
> > shmem_swapin+0xa7/0x100
> > shmem_swapin_folio+0xcd/0x2e0
> > shmem_get_folio+0x237/0x580
> > collapse_file+0x247/0x1280
> > hpage_collapse_scan_file+0x26e/0x380
> > khugepaged+0x43b/0x810
> > kthread+0xfb/0x120
> > ? __pfx_khugepaged+0x10/0x10
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork+0x38/0x50
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork_asm+0x1b/0x30
> > </TASK>
> > ...
> >
> > The system is using zram swap. I wonder if khugepaged should
> > be suspend/freeze aware. Does something like below make sense?
> > Or is the problem elsewhere?
> >
> > ---
> > mm/khugepaged.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index eff9e3061925..fa6a018b20a8 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> > xas_set(&xas, index);
> > folio = xas_load(&xas);
> > + if (try_to_freeze())
> > + goto xa_unlocked;
> > +
> > VM_BUG_ON(index != xas.xa_index);
> > if (is_shmem) {
> > if (!folio) {
>
> Your analysis is reasonable. When the system is freezing, khugepaged is
> still trying to swap-in shmem to collapse, which prevents the system from
> entering suspend state. However, it’s not only shmem that will swap in,
> collapsing anonymous folios may also trigger swap-in operations.
Right, I thought about it but wasn't sure. Could the inner loop (e.g.
collapse_file() in this particular case) loop long enough to fail suspend
w/o ever giving the outer loop (khugepaged_do_scan()) a chance to freeze?
On (26/02/06 12:38), Sergey Senozhatsky wrote:
[..]
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index eff9e3061925..fa6a018b20a8 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> > > xas_set(&xas, index);
> > > folio = xas_load(&xas);
> > > + if (try_to_freeze())
> > > + goto xa_unlocked;
> > > +
> > > VM_BUG_ON(index != xas.xa_index);
> > > if (is_shmem) {
> > > if (!folio) {
> >
> > Your analysis is reasonable. When the system is freezing, khugepaged is
> > still trying to swap-in shmem to collapse, which prevents the system from
> > entering suspend state. However, it’s not only shmem that will swap in,
> > collapsing anonymous folios may also trigger swap-in operations.
>
> Right, I thought about it but wasn't sure. Could the inner loop (e.g.
> collapse_file() in this particular case) loop long enough to fail suspend
> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to freeze?
For inner loops I wondered if cond_resched() could be an indicator of
where try_to_freeze() should be placed. Those cond_resched() calls
are there for a reason, after all. E.g. something like:
---
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa6a018b20a8..cee08466a069 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2431,6 +2431,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
unsigned long hstart, hend;
cond_resched();
+ if (try_to_freeze())
+ break;
+
if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
progress++;
break;
@@ -2453,6 +2456,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
bool mmap_locked = true;
cond_resched();
+ if (try_to_freeze())
+ goto breakouterloop;
+
if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
goto breakouterloop;
On 2/6/26 12:31 PM, Sergey Senozhatsky wrote:
> On (26/02/06 12:38), Sergey Senozhatsky wrote:
> [..]
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index eff9e3061925..fa6a018b20a8 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1894,6 +1894,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>> xas_set(&xas, index);
>>>> folio = xas_load(&xas);
>>>> + if (try_to_freeze())
>>>> + goto xa_unlocked;
>>>> +
>>>> VM_BUG_ON(index != xas.xa_index);
>>>> if (is_shmem) {
>>>> if (!folio) {
>>>
>>> Your analysis is reasonable. When the system is freezing, khugepaged is
>>> still trying to swap-in shmem to collapse, which prevents the system from
>>> entering suspend state. However, it’s not only shmem that will swap in,
>>> collapsing anonymous folios may also trigger swap-in operations.
>>
>> Right, I thought about it but wasn't sure. Could the inner loop (e.g.
>> collapse_file() in this particular case) loop long enough to fail suspend
>> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to freeze?
Yes, that’s possible. However, if we add a try_to_freeze() check in the
inner loop, we need to consider various scenarios (such as anonymous
folio swap-in and other potential cases?), which feels too hacky to me.
> For inner loops I wondered if cond_resched() could be an indicator of
> where try_to_freeze() should be placed. Those cond_resched() calls
> are there for a reason, after all. E.g. something like:
>
> ---
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fa6a018b20a8..cee08466a069 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2431,6 +2431,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> unsigned long hstart, hend;
>
> cond_resched();
> + if (try_to_freeze())
> + break;
> +
> if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> progress++;
> break;
> @@ -2453,6 +2456,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> bool mmap_locked = true;
>
> cond_resched();
> + if (try_to_freeze())
> + goto breakouterloop;
> +
> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> goto breakouterloop;
This looks better than the previous version. Let’s also wait to see if
others have any better suggestions.
On 2/6/26 06:12, Baolin Wang wrote:
>
>
> On 2/6/26 12:31 PM, Sergey Senozhatsky wrote:
>> On (26/02/06 12:38), Sergey Senozhatsky wrote:
>> [..]
>>>
>>> Right, I thought about it but wasn't sure. Could the inner loop (e.g.
>>> collapse_file() in this particular case) loop long enough to fail
>>> suspend
>>> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to
>>> freeze?
>
> Yes, that’s possible. However, if we add a try_to_freeze() check in the
> inner loop, we need to consider various scenarios (such as anonymous
> folio swap-in and other potential cases?), which feels too hacky to me.
>
>> For inner loops I wondered if cond_resched() could be an indicator of
>> where try_to_freeze() should be placed. Those cond_resched() calls
>> are there for a reason, after all. E.g. something like:
>>
>> ---
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index fa6a018b20a8..cee08466a069 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2431,6 +2431,9 @@ static unsigned int
>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>> unsigned long hstart, hend;
>> cond_resched();
>> + if (try_to_freeze())
>> + break;
>> +
>> if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>> progress++;
>> break;
>> @@ -2453,6 +2456,9 @@ static unsigned int
>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>> bool mmap_locked = true;
>> cond_resched();
>> + if (try_to_freeze())
>> + goto breakouterloop;
>> +
>> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>> goto breakouterloop;
>
> This looks better than the previous version. Let’s also wait to see if
> others have any better suggestions.
What prevents other callpaths (faults, read(), write(), etc) from
similarly triggering swapin?
I recall that there is a notifier when the system is preparing to sleep
(pm notifier or something). Could we simply hook into that to tell
khugepaged to suspend+resume?
Essentially, making hpage_collapse_test_exit_or_disable() break our for us.
--
Cheers,
David
On 2/6/26 4:36 PM, David Hildenbrand (Arm) wrote:
> On 2/6/26 06:12, Baolin Wang wrote:
>>
>>
>> On 2/6/26 12:31 PM, Sergey Senozhatsky wrote:
>>> On (26/02/06 12:38), Sergey Senozhatsky wrote:
>>> [..]
>>>>
>>>> Right, I thought about it but wasn't sure. Could the inner loop (e.g.
>>>> collapse_file() in this particular case) loop long enough to fail
>>>> suspend
>>>> w/o ever giving the outer loop (khugepaged_do_scan()) a chance to
>>>> freeze?
>>
>> Yes, that’s possible. However, if we add a try_to_freeze() check in
>> the inner loop, we need to consider various scenarios (such as
>> anonymous folio swap-in and other potential cases?), which feels too
>> hacky to me.
>>
>>> For inner loops I wondered if cond_resched() could be an indicator of
>>> where try_to_freeze() should be placed. Those cond_resched() calls
>>> are there for a reason, after all. E.g. something like:
>>>
>>> ---
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index fa6a018b20a8..cee08466a069 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2431,6 +2431,9 @@ static unsigned int
>>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>> unsigned long hstart, hend;
>>> cond_resched();
>>> + if (try_to_freeze())
>>> + break;
>>> +
>>> if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>>> progress++;
>>> break;
>>> @@ -2453,6 +2456,9 @@ static unsigned int
>>> khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>> bool mmap_locked = true;
>>> cond_resched();
>>> + if (try_to_freeze())
>>> + goto breakouterloop;
>>> +
>>> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>>> goto breakouterloop;
>>
>> This looks better than the previous version. Let’s also wait to see if
>> others have any better suggestions.
>
> What prevents other callpaths (faults, read(), write(), etc) from
> similarly triggering swapin?
Usually it’s just a userspace process triggering one page fault to swap
a page in, then will return to userspace. There aren’t other kernel
threads like khugepaged continuously do swap-in in a loop.
> I recall that there is a notifier when the system is preparing to sleep
> (pm notifier or something). Could we simply hook into that to tell
> khugepaged to suspend+resume?
Do you mean “struct dev_pm_ops”, which is used to register PM callbacks
for devices? However, I don’t know how to use it with a kernel thread.
Also look at how kswapd does it, kswapd also uses
kthread_freezable_should_stop() to check the freeze state.
> Essentially, making hpage_collapse_test_exit_or_disable() break our for us.
Ah, yes, even better:)
>> I recall that there is a notifier when the system is preparing to >> sleep (pm notifier or something). Could we simply hook into that to >> tell khugepaged to suspend+resume? > > Do you mean “struct dev_pm_ops”, which is used to register PM callbacks > for devices? However, I don’t know how to use it with a kernel thread. > > Also look at how kswapd does it, kswapd also uses > kthread_freezable_should_stop() to check the freeze state. Right, mimicking what kswapd does sound reasonable! -- Cheers, David
© 2016 - 2026 Red Hat, Inc.