[RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues

Zihuan Zhang posted 9 patches 1 month, 4 weeks ago
Documentation/filesystems/proc.rst | 14 ++++++-
fs/proc/base.c                     | 64 ++++++++++++++++++++++++++++++
include/linux/freezer.h            | 20 ++++++++++
include/linux/sched.h              |  3 ++
kernel/fork.c                      |  1 +
kernel/power/process.c             | 23 ++++++++++-
kernel/sched/core.c                |  2 +
7 files changed, 124 insertions(+), 3 deletions(-)
[RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 4 weeks ago
The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
asynchronous I/O, and deep event loops—the original freezer model has shown its age.

## Background

Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:

- Signal-based logic cannot freeze uninterruptible (D-state) tasks
- Dependencies between processes can cause freeze retries 
- Retry-based recovery introduces unpredictable suspend latency

## Real-world problem illustration

Consider the following scenario during suspend:

Freeze Window Begins

    [process A] - epoll_wait()
        │
        ▼
    [process B] - event source (already frozen)

→ A enters D-state because of waiting for B
→ Cannot respond to freezing signal
→ Freezer retries in a loop
→ Suspend latency spikes

In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**. 
Worse, the kernel has no insight into the root cause and simply retries blindly.

## Proposed solution: Freeze priority model

To address this, we propose a **layered freeze model** based on per-task freeze priorities.

### Design

We introduce 4 levels of freeze priority:


| Priority | Level             | Description                       |
|----------|-------------------|-----------------------------------|
| 0        | HIGH              | D-state TASKs                     |
| 1        | NORMAL            | regular  use space TASKS          |
| 2        | LOW               | not yet used                      |
| 4        | NEVER_FREEZE      | zombie TASKs , PF_SUSPNED_TASK    |


The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.

Although introducing more fine-grained freeze_priority levels improves extensibility and allows better modeling of task dependencies, 
it may also introduce additional overhead during task traversal, potentially affecting freezer performance.

In our test environment, increasing the maximum freeze retries to 16 only added ~4ms of overhead to the total suspend latency,
suggesting the added robustness comes at a relatively low cost. However, for latency-critical systems, this trade-off should be carefully evaluated.

## Benefits

- Solves D-state process freeze stalls caused by premature freezing of dependencies
- Enables more robust and reliable suspend/resume on complex userspace systems
- Introduces extensibility: tasks can be categorized by role, urgency, or dependency
- Reduces race conditions by introducing deterministic freezing order

## Previous Discussion
Link: https://lore.kernel.org/all/20250606062502.19607-1-zhangzihuan@kylinos.cn/
Link: https://lore.kernel.org/all/1ca889fd-6ead-4d4f-a3c7-361ea05bb659@kylinos.cn/

## Future directions

This framework opens up several promising areas for further development:

1. Adaptive behavior based on runtime statistics or retry feedback
The freezer adapts dynamically during suspend/hibernate based on the number of retries and which tasks failed to freeze. 
Tasks that failed in previous rounds will be assigned a higher freeze priority, improving convergence speed and reducing unnecessary retries.

2. cgroup-aware hierarchical freezing for containerized systems
The design supports cgroup-aware task traversal and freezing. 
This ensures compatibility with containerized environments, allowing for better control and visibility when freezing processes in different cgroups.

3. Unified freezing of userspace processes and kernel threads
Based on extensive testing, we found that freezing userspace tasks and kernel threads together works reliably in practice. 
Separating them does not resolve dependency issues between user and kernel context. Moreover, most kernel threads are marked as non-freezable,
so including them in the same freeze pass does not impact correctness and simplifies the logic.

Although the current implementation is relatively simple, it already helps alleviate some suspend failures caused by tasks stuck in D state.
In our testing, we observed that certain D-state tasks are triggered by filesystem sync operations during the freezing phase.
At this stage, we don't yet have a comprehensive solution for that class of problems.
This patchset represents a testable version of our design. We plan to further investigate and address such filesystem-related D-state issues in future revisions.

Patch summary:
 - Patch 1-3: Core infrastructure: field, API, layered freeze logic
 - Patch 4-7: Default priorities and dynamic adjustments
 - Patch 8: Statistics: freeze pass retry count
 - Patch 9: Procfs interface for userspace access

Zihuan Zhang (9):
  freezer: Introduce freeze_priority field in task_struct
  freezer: Introduce API to set per-task freeze priority
  freezer: Add per-priority layered freeze logic
  freezer: Set default freeze priority for userspace tasks
  freezer: set default freeze priority for PF_SUSPEND_TASK processes
  freezer: Set default freeze priority for zombie tasks
  freezer: raise freeze priority of tasks failed to freeze last time
  freezer: Add retry count statistics for freeze pass iterations
  proc: Add /proc/<pid>/freeze_priority interface

 Documentation/filesystems/proc.rst | 14 ++++++-
 fs/proc/base.c                     | 64 ++++++++++++++++++++++++++++++
 include/linux/freezer.h            | 20 ++++++++++
 include/linux/sched.h              |  3 ++
 kernel/fork.c                      |  1 +
 kernel/power/process.c             | 23 ++++++++++-
 kernel/sched/core.c                |  2 +
 7 files changed, 124 insertions(+), 3 deletions(-)

-- 
2.25.1
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Michal Hocko 1 month, 4 weeks ago
On Thu 07-08-25 20:14:09, Zihuan Zhang wrote:
> The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
> Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
> asynchronous I/O, and deep event loops—the original freezer model has shown its age.

A modern userspace might be more complex or convoluted but I do not
think the above statement is accurate or even correct.

> ## Background
> 
> Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
> It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:
> 
> - Signal-based logic cannot freeze uninterruptible (D-state) tasks
> - Dependencies between processes can cause freeze retries 
> - Retry-based recovery introduces unpredictable suspend latency
> 
> ## Real-world problem illustration
> 
> Consider the following scenario during suspend:
> 
> Freeze Window Begins
> 
>     [process A] - epoll_wait()
>         │
>         ▼
>     [process B] - event source (already frozen)
> 
> → A enters D-state because of waiting for B

I thought opoll_wait was waiting in interruptible sleep.

> → Cannot respond to freezing signal
> → Freezer retries in a loop
> → Suspend latency spikes
> 
> In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**. 
> Worse, the kernel has no insight into the root cause and simply retries blindly.
> 
> ## Proposed solution: Freeze priority model
> 
> To address this, we propose a **layered freeze model** based on per-task freeze priorities.
> 
> ### Design
> 
> We introduce 4 levels of freeze priority:
> 
> 
> | Priority | Level             | Description                       |
> |----------|-------------------|-----------------------------------|
> | 0        | HIGH              | D-state TASKs                     |
> | 1        | NORMAL            | regular  use space TASKS          |
> | 2        | LOW               | not yet used                      |
> | 4        | NEVER_FREEZE      | zombie TASKs , PF_SUSPNED_TASK    |
> 
> 
> The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
> This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
> By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.

I really fail to see how that is supposed to work to be honest. If a
process is running in the userspace then the priority shouldn't really
matter much. Tasks will get a signal, freeze themselves and you are
done. If they are running in the userspace and e.g. sleeping while not
TASK_FREEZABLE then priority simply makes no difference. And if they are
TASK_FREEZABLE then the priority doens't matter either.

What am I missing?
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 4 weeks ago
Hi,

在 2025/8/7 21:25, Michal Hocko 写道:
> On Thu 07-08-25 20:14:09, Zihuan Zhang wrote:
>> The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
>> Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
>> asynchronous I/O, and deep event loops—the original freezer model has shown its age.
> A modern userspace might be more complex or convoluted but I do not
> think the above statement is accurate or even correct.
You’re right — that statement may not be accurate. I’ll be more careful 
with the wording.
>> ## Background
>>
>> Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
>> It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:
>>
>> - Signal-based logic cannot freeze uninterruptible (D-state) tasks
>> - Dependencies between processes can cause freeze retries
>> - Retry-based recovery introduces unpredictable suspend latency
>>
>> ## Real-world problem illustration
>>
>> Consider the following scenario during suspend:
>>
>> Freeze Window Begins
>>
>>      [process A] - epoll_wait()
>>          │
>>          ▼
>>      [process B] - event source (already frozen)
>>
>> → A enters D-state because of waiting for B
> I thought opoll_wait was waiting in interruptible sleep.

Apologies — my description may not be entirely accurate.

But there are some dmesg logs:

[   62.880497] PM: suspend entry (deep)
[   63.130639] Filesystems sync: 0.249 seconds
[   63.130643] PM: Preparing system for sleep (deep)
[   63.226398] Freezing user space processes
[   63.227193] freeze round: 0, task to freeze: 681
[   63.228110] freeze round: 1, task to freeze: 1
[   63.230064] task:Xorg            state:D stack:0     pid:1404  tgid:1404  ppid:1348   task_flags:0x400100 flags:0x00004004
[   63.230068] Call Trace:
[   63.230069]  <TASK>
[   63.230071]  __schedule+0x52e/0xea0
[   63.230077]  schedule+0x27/0x80
[   63.230079]  schedule_timeout+0xf2/0x100
[   63.230082]  wait_for_completion+0x85/0x130
[   63.230085]  __flush_work+0x21f/0x310
[   63.230087]  ? __pfx_wq_barrier_func+0x10/0x10
[   63.230091]  drm_mode_rmfb+0x138/0x1b0
[   63.230093]  ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10
[   63.230095]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   63.230097]  drm_ioctl_kernel+0xa5/0x100
[   63.230099]  drm_ioctl+0x270/0x4b0
[   63.230101]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   63.230104]  ? syscall_exit_work+0x108/0x140
[   63.230107]  radeon_drm_ioctl+0x4a/0x80 [radeon]
[   63.230141]  __x64_sys_ioctl+0x93/0xe0
[   63.230144]  ? syscall_trace_enter+0xfa/0x1c0
[   63.230146]  do_syscall_64+0x7d/0x2c0
[   63.230148]  ? do_syscall_64+0x1f3/0x2c0
[   63.230150]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   63.230153] RIP: 0033:0x7f1aa132550b
[   63.230154] RSP: 002b:00007ffebab69678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   63.230156] RAX: ffffffffffffffda RBX: 00007ffebab696bc RCX: 00007f1aa132550b
[   63.230158] RDX: 00007ffebab696bc RSI: 00000000c00464af RDI: 000000000000000e
[   63.230159] RBP: 00000000c00464af R08: 00007f1aa0c41220 R09: 000055a71ce32310
[   63.230160] R10: 0000000000000087 R11: 0000000000000246 R12: 000055a71b813660
[   63.230161] R13: 000000000000000e R14: 0000000003a8f5cd R15: 000055a71b6bbfb0
[   63.230164]  </TASK>
[   63.230248] freeze round: 2, task to freeze: 1


You can find it in this patch

link: 
https://lore.kernel.org/all/20250619035355.33402-1-zhangzihuan@kylinos.cn/

>> → Cannot respond to freezing signal
>> → Freezer retries in a loop
>> → Suspend latency spikes
>>
>> In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**.
>> Worse, the kernel has no insight into the root cause and simply retries blindly.
>>
>> ## Proposed solution: Freeze priority model
>>
>> To address this, we propose a **layered freeze model** based on per-task freeze priorities.
>>
>> ### Design
>>
>> We introduce 4 levels of freeze priority:
>>
>>
>> | Priority | Level             | Description                       |
>> |----------|-------------------|-----------------------------------|
>> | 0        | HIGH              | D-state TASKs                     |
>> | 1        | NORMAL            | regular  use space TASKS          |
>> | 2        | LOW               | not yet used                      |
>> | 4        | NEVER_FREEZE      | zombie TASKs , PF_SUSPNED_TASK    |
>>
>>
>> The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
>> This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
>> By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.
> I really fail to see how that is supposed to work to be honest. If a
> process is running in the userspace then the priority shouldn't really
> matter much. Tasks will get a signal, freeze themselves and you are
> done. If they are running in the userspace and e.g. sleeping while not
> TASK_FREEZABLE then priority simply makes no difference. And if they are
> TASK_FREEZABLE then the priority doens't matter either.
>
> What am I missing?
under ideal conditions, if a userspace task is TASK_FREEZABLE, receives 
the freezing() signal, and enters the refrigerator in a timely manner, 
then freeze priority wouldn’t make a difference.

However, in practice, we’ve observed cases where tasks appear stuck in 
uninterruptible sleep (D state) during the freeze phase  — and thus 
cannot respond to signals or enter the refrigerator. These tasks are 
technically TASK_FREEZABLE, but due to the nature of their sleep state, 
they don’t freeze promptly, and may require multiple retry rounds, or 
cause the entire suspend to fail.
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Oleg Nesterov 1 month, 3 weeks ago
On 08/08, Zihuan Zhang wrote:
>
> 在 2025/8/7 21:25, Michal Hocko 写道:
> >If they are running in the userspace and e.g. sleeping while not
> >TASK_FREEZABLE then priority simply makes no difference. And if they are
> >TASK_FREEZABLE then the priority doens't matter either.
> >
> >What am I missing?

I too do not understand how can this series improve the freezer.

> under ideal conditions, if a userspace task is TASK_FREEZABLE, receives the
> freezing() signal, and enters the refrigerator in a timely manner,

Note that __freeze_task() won't even send a signal to a sleeping
TASK_FREEZABLE task, __freeze_task() will just change its state to
TASK_FROZEN.

Oleg.

Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 3 weeks ago
Hi,

在 2025/8/8 15:57, Oleg Nesterov 写道:
> On 08/08, Zihuan Zhang wrote:
>> 在 2025/8/7 21:25, Michal Hocko 写道:
>>> If they are running in the userspace and e.g. sleeping while not
>>> TASK_FREEZABLE then priority simply makes no difference. And if they are
>>> TASK_FREEZABLE then the priority doens't matter either.
>>>
>>> What am I missing?
> I too do not understand how can this series improve the freezer.

Thanks for your question — actually, I just replied to Michal with a 
similar explanation, but I really appreciate you raising the same point, 
so let me add a bit more context here.

Right now, we're trying to address the case where certain tasks fail to 
freeze (often due to short-lived D-state issues). Our current workaround 
is to increase the number of freeze iterations in the next suspend 
attempt for those tasks.

While this isn't a perfect solution, the overhead of a few extra 
iterations is minimal compared to the cost of retrying the whole suspend 
cycle due to a stuck D-state task. So for now, we believe this is a 
reasonable tradeoff until we find a more deterministic way to 
preemptively detect and prioritize problematic tasks.

Happy to hear your thoughts or suggestions if you think there's a better 
direction to explore.

>> under ideal conditions, if a userspace task is TASK_FREEZABLE, receives the
>> freezing() signal, and enters the refrigerator in a timely manner,
> Note that __freeze_task() won't even send a signal to a sleeping
> TASK_FREEZABLE task, __freeze_task() will just change its state to
> TASK_FROZEN.
>
> Oleg.
>
You are right.
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Michal Hocko 1 month, 3 weeks ago
On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
[...]
> However, in practice, we’ve observed cases where tasks appear stuck in
> uninterruptible sleep (D state) during the freeze phase  — and thus cannot
> respond to signals or enter the refrigerator. These tasks are technically
> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
> freeze promptly, and may require multiple retry rounds, or cause the entire
> suspend to fail.

Right, but that is an inherent problem of the freezer implemenatation.
It is not really clear to me how priorities or layers improve on that.
Could you please elaborate on that?
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 3 weeks ago
在 2025/8/8 15:00, Michal Hocko 写道:
> On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
> [...]
>> However, in practice, we’ve observed cases where tasks appear stuck in
>> uninterruptible sleep (D state) during the freeze phase  — and thus cannot
>> respond to signals or enter the refrigerator. These tasks are technically
>> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
>> freeze promptly, and may require multiple retry rounds, or cause the entire
>> suspend to fail.
> Right, but that is an inherent problem of the freezer implemenatation.
> It is not really clear to me how priorities or layers improve on that.
> Could you please elaborate on that?

Thanks for the follow-up.

 From our observations, we’ve seen processes like Xorg that are in a 
normal state before freezing begins, but enter D state during the freeze 
window. Upon investigation,

we found that these processes often depend on other user processes 
(e.g., I/O helpers or system services), and when those dependencies are 
frozen first, the dependent process (like Xorg) gets stuck and can’t be 
frozen itself.

This led us to treat such processes as “hard to freeze” tasks — not 
because they’re inherently unfreezable, but because they are more likely 
to become problematic if not frozen early enough.

So our model works as follows:
     •    By default, freezer tries to freeze all freezable tasks in 
each round.
     •    With our approach, we only attempt to freeze tasks whose 
freeze_priority is less than or equal to the current round number.
     •    This ensures that higher-priority (i.e., harder-to-freeze) 
tasks are attempted earlier, increasing the chance that they freeze 
before being blocked by others.

Since we cannot know in advance which tasks will be difficult to freeze, 
we use heuristics:
     •    Any task that causes freeze failure or is found in D state 
during the freeze window is treated as hard-to-freeze in the next 
attempt and its priority is increased.
     •    Additionally, users can manually raise/reduce the freeze 
priority of known problematic tasks via an exposed sysfs interface, 
giving them fine-grained control.

This doesn’t change the fundamental logic of the freezer — it still 
retries until all tasks are frozen — but by adjusting the traversal order,

  we’ve observed significantly fewer retries and more reliable success 
in scenarios where these D state transitions occur.
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Michal Hocko 1 month, 3 weeks ago
On Fri 08-08-25 15:52:31, Zihuan Zhang wrote:
> 
> 在 2025/8/8 15:00, Michal Hocko 写道:
> > On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
> > [...]
> > > However, in practice, we’ve observed cases where tasks appear stuck in
> > > uninterruptible sleep (D state) during the freeze phase  — and thus cannot
> > > respond to signals or enter the refrigerator. These tasks are technically
> > > TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
> > > freeze promptly, and may require multiple retry rounds, or cause the entire
> > > suspend to fail.
> > Right, but that is an inherent problem of the freezer implemenatation.
> > It is not really clear to me how priorities or layers improve on that.
> > Could you please elaborate on that?
> 
> Thanks for the follow-up.
> 
> From our observations, we’ve seen processes like Xorg that are in a normal
> state before freezing begins, but enter D state during the freeze window.
> Upon investigation,
> 
> we found that these processes often depend on other user processes (e.g.,
> I/O helpers or system services), and when those dependencies are frozen
> first, the dependent process (like Xorg) gets stuck and can’t be frozen
> itself.

OK, I see.

> This led us to treat such processes as “hard to freeze” tasks — not because
> they’re inherently unfreezable, but because they are more likely to become
> problematic if not frozen early enough.
> 
> So our model works as follows:
>     •    By default, freezer tries to freeze all freezable tasks in each
> round.
>     •    With our approach, we only attempt to freeze tasks whose
> freeze_priority is less than or equal to the current round number.
>     •    This ensures that higher-priority (i.e., harder-to-freeze) tasks
> are attempted earlier, increasing the chance that they freeze before being
> blocked by others.
> 
> Since we cannot know in advance which tasks will be difficult to freeze, we
> use heuristics:
>     •    Any task that causes freeze failure or is found in D state during
> the freeze window is treated as hard-to-freeze in the next attempt and its
> priority is increased.
>     •    Additionally, users can manually raise/reduce the freeze priority
> of known problematic tasks via an exposed sysfs interface, giving them
> fine-grained control.

This would have been a very useful information for the changelog so that
we can understand what you are trying to achieve.

> This doesn’t change the fundamental logic of the freezer — it still retries
> until all tasks are frozen — but by adjusting the traversal order,
> 
>  we’ve observed significantly fewer retries and more reliable success in
> scenarios where these D state transitions occur.
 
OK, I believe I do understand what you are trying to achieve but I am
not conviced this is a robust way to deal with the problem. This all
seems highly timing specific that might work in very specific usecase
but you are essentially trying to fight tiny race windows with a very
probabilitistic interface.

Also the interface seems to be really coarse grained and it can easily
turn out insufficient for other usecases while it is not entirely clear
to me how this could be extended for those.

I believe it would be more useful to find sources of those freezer
blockers and try to address those. Making more blocked tasks
__set_task_frozen compatible sounds like a general improvement in
itself.

Thanks
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 3 weeks ago
在 2025/8/8 16:58, Michal Hocko 写道:
> On Fri 08-08-25 15:52:31, Zihuan Zhang wrote:
>> 在 2025/8/8 15:00, Michal Hocko 写道:
>>> On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
>>> [...]
>>>> However, in practice, we’ve observed cases where tasks appear stuck in
>>>> uninterruptible sleep (D state) during the freeze phase  — and thus cannot
>>>> respond to signals or enter the refrigerator. These tasks are technically
>>>> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
>>>> freeze promptly, and may require multiple retry rounds, or cause the entire
>>>> suspend to fail.
>>> Right, but that is an inherent problem of the freezer implemenatation.
>>> It is not really clear to me how priorities or layers improve on that.
>>> Could you please elaborate on that?
>> Thanks for the follow-up.
>>
>>  From our observations, we’ve seen processes like Xorg that are in a normal
>> state before freezing begins, but enter D state during the freeze window.
>> Upon investigation,
>>
>> we found that these processes often depend on other user processes (e.g.,
>> I/O helpers or system services), and when those dependencies are frozen
>> first, the dependent process (like Xorg) gets stuck and can’t be frozen
>> itself.
> OK, I see.
>
>> This led us to treat such processes as “hard to freeze” tasks — not because
>> they’re inherently unfreezable, but because they are more likely to become
>> problematic if not frozen early enough.
>>
>> So our model works as follows:
>>      •    By default, freezer tries to freeze all freezable tasks in each
>> round.
>>      •    With our approach, we only attempt to freeze tasks whose
>> freeze_priority is less than or equal to the current round number.
>>      •    This ensures that higher-priority (i.e., harder-to-freeze) tasks
>> are attempted earlier, increasing the chance that they freeze before being
>> blocked by others.
>>
>> Since we cannot know in advance which tasks will be difficult to freeze, we
>> use heuristics:
>>      •    Any task that causes freeze failure or is found in D state during
>> the freeze window is treated as hard-to-freeze in the next attempt and its
>> priority is increased.
>>      •    Additionally, users can manually raise/reduce the freeze priority
>> of known problematic tasks via an exposed sysfs interface, giving them
>> fine-grained control.
> This would have been a very useful information for the changelog so that
> we can understand what you are trying to achieve.
>
Got it, I’ll add that info to the changelog. Thanks!
>> This doesn’t change the fundamental logic of the freezer — it still retries
>> until all tasks are frozen — but by adjusting the traversal order,
>>
>>   we’ve observed significantly fewer retries and more reliable success in
>> scenarios where these D state transitions occur.
>   
> OK, I believe I do understand what you are trying to achieve but I am
> not conviced this is a robust way to deal with the problem. This all
> seems highly timing specific that might work in very specific usecase
> but you are essentially trying to fight tiny race windows with a very
> probabilitistic interface.

Actually, our approach does not conflict with solving the problem. We 
plan to keep the freeze priority mechanism disabled by default and only 
enable it when issues arise, so as to maintain the consistency of the 
existing code flow as much as possible. It acts like a fallback mechanism.

We acknowledge that the causes of D-state tasks are complex and require 
high effort to fully resolve, which the current freezer mechanism cannot 
achieve. Our solution is low-cost and able to capture some problematic 
tasks effectively.

> Also the interface seems to be really coarse grained and it can easily
> turn out insufficient for other usecases while it is not entirely clear
> to me how this could be extended for those.
  We recognize that the current interface is relatively coarse-grained 
and may not be sufficient for all scenarios. The present implementation 
is a basic version.

Our plan is to introduce a classification-based mechanism that assigns 
different freeze priorities according to process categories. For 
example, filesystem and graphics-related processes will be given higher 
default freeze priority, as they are critical in the freezing workflow. 
This classification approach helps target important processes more 
precisely.

However, this requires further testing and refinement before full 
deployment. We believe this incremental, category-based design will make 
the mechanism more effective and adaptable over time while keeping it 
manageable.
> I believe it would be more useful to find sources of those freezer
> blockers and try to address those. Making more blocked tasks
> __set_task_frozen compatible sounds like a general improvement in
> itself.

we have already identified some causes of D-state tasks, many of which 
are related to the filesystem. On some systems, certain processes 
frequently execute ext4_sync_file, and under contention this can lead to 
D-state tasks.

  6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026 
tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
[ 6616.650485] Call Trace:
[ 6616.650486]  <TASK>
[ 6616.650489]  __schedule+0x532/0xea0
[ 6616.650494]  schedule+0x27/0x80
[ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
[ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 6616.650502]  ext4_sync_file+0x1ba/0x380
[ 6616.650505]  do_fsync+0x3b/0x80
[ 6616.650507]  __x64_sys_fdatasync+0x17/0x20
[ 6616.650509]  do_syscall_64+0x7d/0x2c0
[ 6616.650512]  ? syscall_exit_work+0x108/0x140
[ 6616.650515]  ? do_syscall_64+0x1f3/0x2c0
[ 6616.650517]  ? syscall_exit_work+0x108/0x140
[ 6616.650519]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650522]  ? audit_reset_context.part.0+0x284/0x2f0
[ 6616.650524]  ? syscall_exit_work+0x108/0x140
[ 6616.650527]  ? do_syscall_64+0x1f3/0x2c0
[ 6616.650529]  ? futex_unqueue+0x4e/0x80
[ 6616.650531]  ? __futex_wait+0x9b/0x100
[ 6616.650534]  ? __pfx_futex_wake_mark+0x10/0x10
[ 6616.650536]  ? timerqueue_del+0x2e/0x50
[ 6616.650539]  ? __remove_hrtimer+0x39/0x70
[ 6616.650542]  ? hrtimer_try_to_cancel+0x85/0x100
[ 6616.650544]  ? hrtimer_cancel+0x15/0x30
[ 6616.650546]  ? futex_wait+0x7d/0x110
[ 6616.650549]  ? __pfx_hrtimer_wakeup+0x10/0x10
[ 6616.650552]  ? audit_reset_context.part.0+0x284/0x2f0
[ 6616.650554]  ? syscall_exit_work+0x108/0x140
[ 6616.650556]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650558]  ? switch_fpu_return+0x4f/0xd0
[ 6616.650560]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650563]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 6616.650565] RIP: 0033:0x7f095ef8f3eb
[ 6616.650567] RSP: 002b:00007f07409fa360 EFLAGS: 00000293 ORIG_RAX: 
000000000000004b
[ 6616.650569] RAX: ffffffffffffffda RBX: 00000d38021f03a0 RCX: 
00007f095ef8f3eb
[ 6616.650570] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
000000000000009a
[ 6616.650571] RBP: 00007f07409fa410 R08: 0000000000000000 R09: 
00007f07409fa570
[ 6616.650572] R10: 00007f0960a60000 R11: 0000000000000293 R12: 
00000d38021f0380
[ 6616.650573] R13: 000055c28c70b400 R14: 00007f07409fa3a0 R15: 
00007f07409fa380


While the kernel already supports freezing the filesystem, which can 
address this problem, it is quite expensive — enabling this feature 
increases the suspend time by about  3~4 seconds in our tests. We are 
therefore exploring lower-cost approaches to mitigate the issue without 
such a heavy performance impact.

root@zzhwaxy-pc:/sys/power# echo 1 > freeze_filesystems
root@zzhwaxy-pc:/sys/power# sudo dmesg | grep -E 'suspend'
[ 9844.984658] PM: suspend entry (deep)
[ 9850.998197] PM: suspend exit

root@zzhwaxy-pc:/sys/power# echo 0 > freeze_filesystems
root@zzhwaxy-pc:/sys/power# sudo dmesg | grep -E 'suspend'
[ 9893.928486] PM: suspend entry (deep)
[ 9896.239425] PM: suspend exit

> Thanks
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Michal Hocko 1 month, 3 weeks ago
On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
> 
> 在 2025/8/8 16:58, Michal Hocko 写道:
[...]
> > Also the interface seems to be really coarse grained and it can easily
> > turn out insufficient for other usecases while it is not entirely clear
> > to me how this could be extended for those.
>  We recognize that the current interface is relatively coarse-grained and
> may not be sufficient for all scenarios. The present implementation is a
> basic version.
> 
> Our plan is to introduce a classification-based mechanism that assigns
> different freeze priorities according to process categories. For example,
> filesystem and graphics-related processes will be given higher default
> freeze priority, as they are critical in the freezing workflow. This
> classification approach helps target important processes more precisely.
> 
> However, this requires further testing and refinement before full
> deployment. We believe this incremental, category-based design will make the
> mechanism more effective and adaptable over time while keeping it
> manageable.

Unless there is a clear path for a more extendable interface then
introducing this one is a no-go. We do not want to grow different ways
to establish freezing policies.

But much more fundamentally. So far I haven't really seen any argument
why different priorities help with the underlying problem other than the
timing might be slightly different if you change the order of freezing.
This to me sounds like the proposed scheme mostly works around the
problem you are seeing and as such is not a really good candidate to be
merged as a long term solution. Not to mention with a user API that
needs to be maintained for ever.

So NAK from me on the interface.

> > I believe it would be more useful to find sources of those freezer
> > blockers and try to address those. Making more blocked tasks
> > __set_task_frozen compatible sounds like a general improvement in
> > itself.
> 
> we have already identified some causes of D-state tasks, many of which are
> related to the filesystem. On some systems, certain processes frequently
> execute ext4_sync_file, and under contention this can lead to D-state tasks.

Please work with maintainers of those subsystems to find proper
solutions.

-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 3 weeks ago
Hi all,

We encountered an issue where the number of freeze retries increased due 
to processes stuck in D state. The logs point to jbd2-related activity.

log1:

6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
[ 6616.650485] Call Trace:
[ 6616.650486]  <TASK>
[ 6616.650489]  __schedule+0x532/0xea0
[ 6616.650494]  schedule+0x27/0x80
[ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
[ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 6616.650502]  ext4_sync_file+0x1ba/0x380
[ 6616.650505]  do_fsync+0x3b/0x80

log2:

[  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
[  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
[  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
[  631.262167] Filesystems sync: 0.424 seconds
[  631.262821] Freezing user space processes
[  631.263839] freeze round: 1, task to freeze: 852
[  631.265128] freeze round: 2, task to freeze: 2
[  631.267039] freeze round: 3, task to freeze: 2
[  631.271176] freeze round: 4, task to freeze: 2
[  631.279160] freeze round: 5, task to freeze: 2
[  631.287152] freeze round: 6, task to freeze: 2
[  631.295346] freeze round: 7, task to freeze: 2
[  631.301747] freeze round: 8, task to freeze: 2
[  631.309346] freeze round: 9, task to freeze: 2
[  631.317353] freeze round: 10, task to freeze: 2
[  631.325348] freeze round: 11, task to freeze: 2
[  631.333353] freeze round: 12, task to freeze: 2
[  631.341358] freeze round: 13, task to freeze: 2
[  631.349357] freeze round: 14, task to freeze: 2
[  631.357363] freeze round: 15, task to freeze: 2
[  631.365361] freeze round: 16, task to freeze: 2
[  631.373379] freeze round: 17, task to freeze: 2
[  631.381366] freeze round: 18, task to freeze: 2
[  631.389365] freeze round: 19, task to freeze: 2
[  631.397371] freeze round: 20, task to freeze: 2
[  631.405373] freeze round: 21, task to freeze: 2
[  631.413373] freeze round: 22, task to freeze: 2
[  631.421392] freeze round: 23, task to freeze: 1
[  631.429948] freeze round: 24, task to freeze: 1
[  631.438295] freeze round: 25, task to freeze: 1
[  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
[  631.446387] freeze round: 26, task to freeze: 0
[  631.446390] Freezing user space processes completed (elapsed 0.183 
seconds)
[  631.446392] OOM killer disabled.
[  631.446393] Freezing remaining freezable tasks
[  631.446656] freeze round: 1, task to freeze: 4
[  631.447976] freeze round: 2, task to freeze: 0
[  631.447978] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[  631.447980] PM: suspend debug: Waiting for 1 second(s).
[  632.450858] OOM killer enabled.
[  632.450859] Restarting tasks: Starting
[  632.453140] Restarting tasks: Done
[  632.453173] random: crng reseeded on system resumption
[  632.453370] PM: suspend exit
[  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
[  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)

This is the reason:

[  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)


During freezing, user processes executing jbd2_log_wait_commit enter D 
state because this function calls wait_event and can take tens of 
milliseconds to complete. This long execution time, coupled with 
possible competition with the freezer, causes repeated freeze retries.

While we understand that jbd2 is a freezable kernel thread, we would 
like to know if there is a way to freeze it earlier or freeze some 
critical processes proactively to reduce this contention.

Thanks for your input and suggestions.

在 2025/8/11 18:58, Michal Hocko 写道:
> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
>> 在 2025/8/8 16:58, Michal Hocko 写道:
> [...]
>>> Also the interface seems to be really coarse grained and it can easily
>>> turn out insufficient for other usecases while it is not entirely clear
>>> to me how this could be extended for those.
>>   We recognize that the current interface is relatively coarse-grained and
>> may not be sufficient for all scenarios. The present implementation is a
>> basic version.
>>
>> Our plan is to introduce a classification-based mechanism that assigns
>> different freeze priorities according to process categories. For example,
>> filesystem and graphics-related processes will be given higher default
>> freeze priority, as they are critical in the freezing workflow. This
>> classification approach helps target important processes more precisely.
>>
>> However, this requires further testing and refinement before full
>> deployment. We believe this incremental, category-based design will make the
>> mechanism more effective and adaptable over time while keeping it
>> manageable.
> Unless there is a clear path for a more extendable interface then
> introducing this one is a no-go. We do not want to grow different ways
> to establish freezing policies.
>
> But much more fundamentally. So far I haven't really seen any argument
> why different priorities help with the underlying problem other than the
> timing might be slightly different if you change the order of freezing.
> This to me sounds like the proposed scheme mostly works around the
> problem you are seeing and as such is not a really good candidate to be
> merged as a long term solution. Not to mention with a user API that
> needs to be maintained for ever.
>
> So NAK from me on the interface.
>
Thanks for the feedback. I understand your concern that changing the 
freezer priority order looks like working around the symptom rather than 
solving the root cause.

Since the last discussion, we have analyzed the D-state processes 
further and identified that the long wait time is caused by 
jbd2_log_wait_commit. This wait happens because user tasks call into 
this function during fsync/fdatasync and it can take tens of 
milliseconds to complete. When this coincides with the freezer 
operation, the tasks are stuck in D state and retried multiple times, 
increasing the total freeze time.

Although we know that jbd2 is a freezable kernel thread, we are 
exploring whether freezing it earlier — or freezing certain key 
processes first — could reduce this contention and improve freeze 
completion time.


>>> I believe it would be more useful to find sources of those freezer
>>> blockers and try to address those. Making more blocked tasks
>>> __set_task_frozen compatible sounds like a general improvement in
>>> itself.
>> we have already identified some causes of D-state tasks, many of which are
>> related to the filesystem. On some systems, certain processes frequently
>> execute ext4_sync_file, and under contention this can lead to D-state tasks.
> Please work with maintainers of those subsystems to find proper
> solutions.

We’ve pulled in the jbd2 maintainer to get feedback on whether changing 
the freeze ordering for jbd2 is safe or if there’s a better approach to 
avoid the repeated retries caused by this wait.
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Darrick J. Wong 1 month, 3 weeks ago
On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
> Hi all,
> 
> We encountered an issue where the number of freeze retries increased due to
> processes stuck in D state. The logs point to jbd2-related activity.
> 
> log1:
> 
> 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
> tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
> [ 6616.650485] Call Trace:
> [ 6616.650486]  <TASK>
> [ 6616.650489]  __schedule+0x532/0xea0
> [ 6616.650494]  schedule+0x27/0x80
> [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
> [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
> [ 6616.650502]  ext4_sync_file+0x1ba/0x380
> [ 6616.650505]  do_fsync+0x3b/0x80
> 
> log2:
> 
> [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
> [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
> [  631.262167] Filesystems sync: 0.424 seconds
> [  631.262821] Freezing user space processes
> [  631.263839] freeze round: 1, task to freeze: 852
> [  631.265128] freeze round: 2, task to freeze: 2
> [  631.267039] freeze round: 3, task to freeze: 2
> [  631.271176] freeze round: 4, task to freeze: 2
> [  631.279160] freeze round: 5, task to freeze: 2
> [  631.287152] freeze round: 6, task to freeze: 2
> [  631.295346] freeze round: 7, task to freeze: 2
> [  631.301747] freeze round: 8, task to freeze: 2
> [  631.309346] freeze round: 9, task to freeze: 2
> [  631.317353] freeze round: 10, task to freeze: 2
> [  631.325348] freeze round: 11, task to freeze: 2
> [  631.333353] freeze round: 12, task to freeze: 2
> [  631.341358] freeze round: 13, task to freeze: 2
> [  631.349357] freeze round: 14, task to freeze: 2
> [  631.357363] freeze round: 15, task to freeze: 2
> [  631.365361] freeze round: 16, task to freeze: 2
> [  631.373379] freeze round: 17, task to freeze: 2
> [  631.381366] freeze round: 18, task to freeze: 2
> [  631.389365] freeze round: 19, task to freeze: 2
> [  631.397371] freeze round: 20, task to freeze: 2
> [  631.405373] freeze round: 21, task to freeze: 2
> [  631.413373] freeze round: 22, task to freeze: 2
> [  631.421392] freeze round: 23, task to freeze: 1
> [  631.429948] freeze round: 24, task to freeze: 1
> [  631.438295] freeze round: 25, task to freeze: 1
> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> [  631.446387] freeze round: 26, task to freeze: 0
> [  631.446390] Freezing user space processes completed (elapsed 0.183
> seconds)
> [  631.446392] OOM killer disabled.
> [  631.446393] Freezing remaining freezable tasks
> [  631.446656] freeze round: 1, task to freeze: 4
> [  631.447976] freeze round: 2, task to freeze: 0
> [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
> seconds)
> [  631.447980] PM: suspend debug: Waiting for 1 second(s).
> [  632.450858] OOM killer enabled.
> [  632.450859] Restarting tasks: Starting
> [  632.453140] Restarting tasks: Done
> [  632.453173] random: crng reseeded on system resumption
> [  632.453370] PM: suspend exit
> [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
> [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> 
> This is the reason:
> 
> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> 
> 
> During freezing, user processes executing jbd2_log_wait_commit enter D state
> because this function calls wait_event and can take tens of milliseconds to
> complete. This long execution time, coupled with possible competition with
> the freezer, causes repeated freeze retries.
> 
> While we understand that jbd2 is a freezable kernel thread, we would like to
> know if there is a way to freeze it earlier or freeze some critical
> processes proactively to reduce this contention.

Freeze the filesystem before you start freezing kthreads?  That should
quiesce the jbd2 workers and pause anyone trying to write to the fs.
Maybe the missing piece here is the device model not knowing how to call
bdev_freeze prior to a suspend?

That said, I think that doesn't 100% work for XFS because it has
kworkers for metadata buffer read completions, and freezes don't affect
read operations...

(just my clueless 2c)

--D

> Thanks for your input and suggestions.
> 
> 在 2025/8/11 18:58, Michal Hocko 写道:
> > On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
> > > 在 2025/8/8 16:58, Michal Hocko 写道:
> > [...]
> > > > Also the interface seems to be really coarse grained and it can easily
> > > > turn out insufficient for other usecases while it is not entirely clear
> > > > to me how this could be extended for those.
> > >   We recognize that the current interface is relatively coarse-grained and
> > > may not be sufficient for all scenarios. The present implementation is a
> > > basic version.
> > > 
> > > Our plan is to introduce a classification-based mechanism that assigns
> > > different freeze priorities according to process categories. For example,
> > > filesystem and graphics-related processes will be given higher default
> > > freeze priority, as they are critical in the freezing workflow. This
> > > classification approach helps target important processes more precisely.
> > > 
> > > However, this requires further testing and refinement before full
> > > deployment. We believe this incremental, category-based design will make the
> > > mechanism more effective and adaptable over time while keeping it
> > > manageable.
> > Unless there is a clear path for a more extendable interface then
> > introducing this one is a no-go. We do not want to grow different ways
> > to establish freezing policies.
> > 
> > But much more fundamentally. So far I haven't really seen any argument
> > why different priorities help with the underlying problem other than the
> > timing might be slightly different if you change the order of freezing.
> > This to me sounds like the proposed scheme mostly works around the
> > problem you are seeing and as such is not a really good candidate to be
> > merged as a long term solution. Not to mention with a user API that
> > needs to be maintained for ever.
> > 
> > So NAK from me on the interface.
> > 
> Thanks for the feedback. I understand your concern that changing the freezer
> priority order looks like working around the symptom rather than solving the
> root cause.
> 
> Since the last discussion, we have analyzed the D-state processes further
> and identified that the long wait time is caused by jbd2_log_wait_commit.
> This wait happens because user tasks call into this function during
> fsync/fdatasync and it can take tens of milliseconds to complete. When this
> coincides with the freezer operation, the tasks are stuck in D state and
> retried multiple times, increasing the total freeze time.
> 
> Although we know that jbd2 is a freezable kernel thread, we are exploring
> whether freezing it earlier — or freezing certain key processes first —
> could reduce this contention and improve freeze completion time.
> 
> 
> > > > I believe it would be more useful to find sources of those freezer
> > > > blockers and try to address those. Making more blocked tasks
> > > > __set_task_frozen compatible sounds like a general improvement in
> > > > itself.
> > > we have already identified some causes of D-state tasks, many of which are
> > > related to the filesystem. On some systems, certain processes frequently
> > > execute ext4_sync_file, and under contention this can lead to D-state tasks.
> > Please work with maintainers of those subsystems to find proper
> > solutions.
> 
> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
> freeze ordering for jbd2 is safe or if there’s a better approach to avoid
> the repeated retries caused by this wait.
> 
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 3 weeks ago
Hi,

在 2025/8/13 01:26, Darrick J. Wong 写道:
> On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
>> Hi all,
>>
>> We encountered an issue where the number of freeze retries increased due to
>> processes stuck in D state. The logs point to jbd2-related activity.
>>
>> log1:
>>
>> 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
>> tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
>> [ 6616.650485] Call Trace:
>> [ 6616.650486]  <TASK>
>> [ 6616.650489]  __schedule+0x532/0xea0
>> [ 6616.650494]  schedule+0x27/0x80
>> [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
>> [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
>> [ 6616.650502]  ext4_sync_file+0x1ba/0x380
>> [ 6616.650505]  do_fsync+0x3b/0x80
>>
>> log2:
>>
>> [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
>> [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>> [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
>> [  631.262167] Filesystems sync: 0.424 seconds
>> [  631.262821] Freezing user space processes
>> [  631.263839] freeze round: 1, task to freeze: 852
>> [  631.265128] freeze round: 2, task to freeze: 2
>> [  631.267039] freeze round: 3, task to freeze: 2
>> [  631.271176] freeze round: 4, task to freeze: 2
>> [  631.279160] freeze round: 5, task to freeze: 2
>> [  631.287152] freeze round: 6, task to freeze: 2
>> [  631.295346] freeze round: 7, task to freeze: 2
>> [  631.301747] freeze round: 8, task to freeze: 2
>> [  631.309346] freeze round: 9, task to freeze: 2
>> [  631.317353] freeze round: 10, task to freeze: 2
>> [  631.325348] freeze round: 11, task to freeze: 2
>> [  631.333353] freeze round: 12, task to freeze: 2
>> [  631.341358] freeze round: 13, task to freeze: 2
>> [  631.349357] freeze round: 14, task to freeze: 2
>> [  631.357363] freeze round: 15, task to freeze: 2
>> [  631.365361] freeze round: 16, task to freeze: 2
>> [  631.373379] freeze round: 17, task to freeze: 2
>> [  631.381366] freeze round: 18, task to freeze: 2
>> [  631.389365] freeze round: 19, task to freeze: 2
>> [  631.397371] freeze round: 20, task to freeze: 2
>> [  631.405373] freeze round: 21, task to freeze: 2
>> [  631.413373] freeze round: 22, task to freeze: 2
>> [  631.421392] freeze round: 23, task to freeze: 1
>> [  631.429948] freeze round: 24, task to freeze: 1
>> [  631.438295] freeze round: 25, task to freeze: 1
>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>> [  631.446387] freeze round: 26, task to freeze: 0
>> [  631.446390] Freezing user space processes completed (elapsed 0.183
>> seconds)
>> [  631.446392] OOM killer disabled.
>> [  631.446393] Freezing remaining freezable tasks
>> [  631.446656] freeze round: 1, task to freeze: 4
>> [  631.447976] freeze round: 2, task to freeze: 0
>> [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
>> seconds)
>> [  631.447980] PM: suspend debug: Waiting for 1 second(s).
>> [  632.450858] OOM killer enabled.
>> [  632.450859] Restarting tasks: Starting
>> [  632.453140] Restarting tasks: Done
>> [  632.453173] random: crng reseeded on system resumption
>> [  632.453370] PM: suspend exit
>> [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
>> [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>
>> This is the reason:
>>
>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>
>>
>> During freezing, user processes executing jbd2_log_wait_commit enter D state
>> because this function calls wait_event and can take tens of milliseconds to
>> complete. This long execution time, coupled with possible competition with
>> the freezer, causes repeated freeze retries.
>>
>> While we understand that jbd2 is a freezable kernel thread, we would like to
>> know if there is a way to freeze it earlier or freeze some critical
>> processes proactively to reduce this contention.
> Freeze the filesystem before you start freezing kthreads?  That should
> quiesce the jbd2 workers and pause anyone trying to write to the fs.
Indeed, freezing the filesystem can work.

However, this approach is quite expensive: it increases the total 
suspend time by about 3 to 4 seconds. Because of this overhead, we are 
exploring alternative solutions with lower cost.

We have tested it:

https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/ 

> Maybe the missing piece here is the device model not knowing how to call
> bdev_freeze prior to a suspend?
Currently, suspend flow seem to does not invoke bdev_freeze(). Do you 
have any plans or insights on improving or integrating this 
functionality more smoothly into the device model and suspend sequence?
> That said, I think that doesn't 100% work for XFS because it has
> kworkers for metadata buffer read completions, and freezes don't affect
> read operations...

Does read activity also cause processes to enter D (uninterruptible 
sleep) state?

 From what I understand, it’s usually writes or synchronous operations 
that do, but I’m curious if reads can also lead to D state under certain 
conditions.

> (just my clueless 2c)
>
> --D
>
>> Thanks for your input and suggestions.
>>
>> 在 2025/8/11 18:58, Michal Hocko 写道:
>>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
>>>> 在 2025/8/8 16:58, Michal Hocko 写道:
>>> [...]
>>>>> Also the interface seems to be really coarse grained and it can easily
>>>>> turn out insufficient for other usecases while it is not entirely clear
>>>>> to me how this could be extended for those.
>>>>    We recognize that the current interface is relatively coarse-grained and
>>>> may not be sufficient for all scenarios. The present implementation is a
>>>> basic version.
>>>>
>>>> Our plan is to introduce a classification-based mechanism that assigns
>>>> different freeze priorities according to process categories. For example,
>>>> filesystem and graphics-related processes will be given higher default
>>>> freeze priority, as they are critical in the freezing workflow. This
>>>> classification approach helps target important processes more precisely.
>>>>
>>>> However, this requires further testing and refinement before full
>>>> deployment. We believe this incremental, category-based design will make the
>>>> mechanism more effective and adaptable over time while keeping it
>>>> manageable.
>>> Unless there is a clear path for a more extendable interface then
>>> introducing this one is a no-go. We do not want to grow different ways
>>> to establish freezing policies.
>>>
>>> But much more fundamentally. So far I haven't really seen any argument
>>> why different priorities help with the underlying problem other than the
>>> timing might be slightly different if you change the order of freezing.
>>> This to me sounds like the proposed scheme mostly works around the
>>> problem you are seeing and as such is not a really good candidate to be
>>> merged as a long term solution. Not to mention with a user API that
>>> needs to be maintained for ever.
>>>
>>> So NAK from me on the interface.
>>>
>> Thanks for the feedback. I understand your concern that changing the freezer
>> priority order looks like working around the symptom rather than solving the
>> root cause.
>>
>> Since the last discussion, we have analyzed the D-state processes further
>> and identified that the long wait time is caused by jbd2_log_wait_commit.
>> This wait happens because user tasks call into this function during
>> fsync/fdatasync and it can take tens of milliseconds to complete. When this
>> coincides with the freezer operation, the tasks are stuck in D state and
>> retried multiple times, increasing the total freeze time.
>>
>> Although we know that jbd2 is a freezable kernel thread, we are exploring
>> whether freezing it earlier — or freezing certain key processes first —
>> could reduce this contention and improve freeze completion time.
>>
>>
>>>>> I believe it would be more useful to find sources of those freezer
>>>>> blockers and try to address those. Making more blocked tasks
>>>>> __set_task_frozen compatible sounds like a general improvement in
>>>>> itself.
>>>> we have already identified some causes of D-state tasks, many of which are
>>>> related to the filesystem. On some systems, certain processes frequently
>>>> execute ext4_sync_file, and under contention this can lead to D-state tasks.
>>> Please work with maintainers of those subsystems to find proper
>>> solutions.
>> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
>> freeze ordering for jbd2 is safe or if there’s a better approach to avoid
>> the repeated retries caused by this wait.
>>
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Darrick J. Wong 1 month, 3 weeks ago
On Wed, Aug 13, 2025 at 01:48:37PM +0800, Zihuan Zhang wrote:
> Hi,
> 
> 在 2025/8/13 01:26, Darrick J. Wong 写道:
> > On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
> > > Hi all,
> > > 
> > > We encountered an issue where the number of freeze retries increased due to
> > > processes stuck in D state. The logs point to jbd2-related activity.
> > > 
> > > log1:
> > > 
> > > 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
> > > tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
> > > [ 6616.650485] Call Trace:
> > > [ 6616.650486]  <TASK>
> > > [ 6616.650489]  __schedule+0x532/0xea0
> > > [ 6616.650494]  schedule+0x27/0x80
> > > [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
> > > [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > [ 6616.650502]  ext4_sync_file+0x1ba/0x380
> > > [ 6616.650505]  do_fsync+0x3b/0x80
> > > 
> > > log2:
> > > 
> > > [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
> > > [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> > > [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
> > > [  631.262167] Filesystems sync: 0.424 seconds
> > > [  631.262821] Freezing user space processes
> > > [  631.263839] freeze round: 1, task to freeze: 852
> > > [  631.265128] freeze round: 2, task to freeze: 2
> > > [  631.267039] freeze round: 3, task to freeze: 2
> > > [  631.271176] freeze round: 4, task to freeze: 2
> > > [  631.279160] freeze round: 5, task to freeze: 2
> > > [  631.287152] freeze round: 6, task to freeze: 2
> > > [  631.295346] freeze round: 7, task to freeze: 2
> > > [  631.301747] freeze round: 8, task to freeze: 2
> > > [  631.309346] freeze round: 9, task to freeze: 2
> > > [  631.317353] freeze round: 10, task to freeze: 2
> > > [  631.325348] freeze round: 11, task to freeze: 2
> > > [  631.333353] freeze round: 12, task to freeze: 2
> > > [  631.341358] freeze round: 13, task to freeze: 2
> > > [  631.349357] freeze round: 14, task to freeze: 2
> > > [  631.357363] freeze round: 15, task to freeze: 2
> > > [  631.365361] freeze round: 16, task to freeze: 2
> > > [  631.373379] freeze round: 17, task to freeze: 2
> > > [  631.381366] freeze round: 18, task to freeze: 2
> > > [  631.389365] freeze round: 19, task to freeze: 2
> > > [  631.397371] freeze round: 20, task to freeze: 2
> > > [  631.405373] freeze round: 21, task to freeze: 2
> > > [  631.413373] freeze round: 22, task to freeze: 2
> > > [  631.421392] freeze round: 23, task to freeze: 1
> > > [  631.429948] freeze round: 24, task to freeze: 1
> > > [  631.438295] freeze round: 25, task to freeze: 1
> > > [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> > > [  631.446387] freeze round: 26, task to freeze: 0
> > > [  631.446390] Freezing user space processes completed (elapsed 0.183
> > > seconds)
> > > [  631.446392] OOM killer disabled.
> > > [  631.446393] Freezing remaining freezable tasks
> > > [  631.446656] freeze round: 1, task to freeze: 4
> > > [  631.447976] freeze round: 2, task to freeze: 0
> > > [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
> > > seconds)
> > > [  631.447980] PM: suspend debug: Waiting for 1 second(s).
> > > [  632.450858] OOM killer enabled.
> > > [  632.450859] Restarting tasks: Starting
> > > [  632.453140] Restarting tasks: Done
> > > [  632.453173] random: crng reseeded on system resumption
> > > [  632.453370] PM: suspend exit
> > > [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
> > > [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> > > 
> > > This is the reason:
> > > 
> > > [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> > > 
> > > 
> > > During freezing, user processes executing jbd2_log_wait_commit enter D state
> > > because this function calls wait_event and can take tens of milliseconds to
> > > complete. This long execution time, coupled with possible competition with
> > > the freezer, causes repeated freeze retries.
> > > 
> > > While we understand that jbd2 is a freezable kernel thread, we would like to
> > > know if there is a way to freeze it earlier or freeze some critical
> > > processes proactively to reduce this contention.
> > Freeze the filesystem before you start freezing kthreads?  That should
> > quiesce the jbd2 workers and pause anyone trying to write to the fs.
> Indeed, freezing the filesystem can work.
> 
> However, this approach is quite expensive: it increases the total suspend
> time by about 3 to 4 seconds. Because of this overhead, we are exploring
> alternative solutions with lower cost.

Indeed it does, because now XFS and friends will actually shut down
their background workers and flush all the dirty data and metadata to
disk.  On the other hand, if the system crashes while suspended, there's
a lot less recovery work to be done.

Granted the kernel (or userspace) will usually sync() before suspending
so that's not been a huge problem in production afaict.

> We have tested it:
> 
> https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/
> 
> > Maybe the missing piece here is the device model not knowing how to call
> > bdev_freeze prior to a suspend?
> Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have
> any plans or insights on improving or integrating this functionality more
> smoothly into the device model and suspend sequence?
> > That said, I think that doesn't 100% work for XFS because it has
> > kworkers for metadata buffer read completions, and freezes don't affect
> > read operations...
> 
> Does read activity also cause processes to enter D (uninterruptible sleep)
> state?

Usually.

> From what I understand, it’s usually writes or synchronous operations that
> do, but I’m curious if reads can also lead to D state under certain
> conditions.

Anything that sets the task state to uninterruptible.

--D

> > (just my clueless 2c)
> > 
> > --D
> > 
> > > Thanks for your input and suggestions.
> > > 
> > > 在 2025/8/11 18:58, Michal Hocko 写道:
> > > > On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
> > > > > 在 2025/8/8 16:58, Michal Hocko 写道:
> > > > [...]
> > > > > > Also the interface seems to be really coarse grained and it can easily
> > > > > > turn out insufficient for other usecases while it is not entirely clear
> > > > > > to me how this could be extended for those.
> > > > >    We recognize that the current interface is relatively coarse-grained and
> > > > > may not be sufficient for all scenarios. The present implementation is a
> > > > > basic version.
> > > > > 
> > > > > Our plan is to introduce a classification-based mechanism that assigns
> > > > > different freeze priorities according to process categories. For example,
> > > > > filesystem and graphics-related processes will be given higher default
> > > > > freeze priority, as they are critical in the freezing workflow. This
> > > > > classification approach helps target important processes more precisely.
> > > > > 
> > > > > However, this requires further testing and refinement before full
> > > > > deployment. We believe this incremental, category-based design will make the
> > > > > mechanism more effective and adaptable over time while keeping it
> > > > > manageable.
> > > > Unless there is a clear path for a more extendable interface then
> > > > introducing this one is a no-go. We do not want to grow different ways
> > > > to establish freezing policies.
> > > > 
> > > > But much more fundamentally. So far I haven't really seen any argument
> > > > why different priorities help with the underlying problem other than the
> > > > timing might be slightly different if you change the order of freezing.
> > > > This to me sounds like the proposed scheme mostly works around the
> > > > problem you are seeing and as such is not a really good candidate to be
> > > > merged as a long term solution. Not to mention with a user API that
> > > > needs to be maintained for ever.
> > > > 
> > > > So NAK from me on the interface.
> > > > 
> > > Thanks for the feedback. I understand your concern that changing the freezer
> > > priority order looks like working around the symptom rather than solving the
> > > root cause.
> > > 
> > > Since the last discussion, we have analyzed the D-state processes further
> > > and identified that the long wait time is caused by jbd2_log_wait_commit.
> > > This wait happens because user tasks call into this function during
> > > fsync/fdatasync and it can take tens of milliseconds to complete. When this
> > > coincides with the freezer operation, the tasks are stuck in D state and
> > > retried multiple times, increasing the total freeze time.
> > > 
> > > Although we know that jbd2 is a freezable kernel thread, we are exploring
> > > whether freezing it earlier — or freezing certain key processes first —
> > > could reduce this contention and improve freeze completion time.
> > > 
> > > 
> > > > > > I believe it would be more useful to find sources of those freezer
> > > > > > blockers and try to address those. Making more blocked tasks
> > > > > > __set_task_frozen compatible sounds like a general improvement in
> > > > > > itself.
> > > > > we have already identified some causes of D-state tasks, many of which are
> > > > > related to the filesystem. On some systems, certain processes frequently
> > > > > execute ext4_sync_file, and under contention this can lead to D-state tasks.
> > > > Please work with maintainers of those subsystems to find proper
> > > > solutions.
> > > We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
> > > freeze ordering for jbd2 is safe or if there’s a better approach to avoid
> > > the repeated retries caused by this wait.
> > > 
> 
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 2 weeks ago
在 2025/8/15 00:43, Darrick J. Wong 写道:
> On Wed, Aug 13, 2025 at 01:48:37PM +0800, Zihuan Zhang wrote:
>> Hi,
>>
>> 在 2025/8/13 01:26, Darrick J. Wong 写道:
>>> On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
>>>> Hi all,
>>>>
>>>> We encountered an issue where the number of freeze retries increased due to
>>>> processes stuck in D state. The logs point to jbd2-related activity.
>>>>
>>>> log1:
>>>>
>>>> 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
>>>> tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
>>>> [ 6616.650485] Call Trace:
>>>> [ 6616.650486]  <TASK>
>>>> [ 6616.650489]  __schedule+0x532/0xea0
>>>> [ 6616.650494]  schedule+0x27/0x80
>>>> [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
>>>> [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>> [ 6616.650502]  ext4_sync_file+0x1ba/0x380
>>>> [ 6616.650505]  do_fsync+0x3b/0x80
>>>>
>>>> log2:
>>>>
>>>> [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
>>>> [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>>> [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
>>>> [  631.262167] Filesystems sync: 0.424 seconds
>>>> [  631.262821] Freezing user space processes
>>>> [  631.263839] freeze round: 1, task to freeze: 852
>>>> [  631.265128] freeze round: 2, task to freeze: 2
>>>> [  631.267039] freeze round: 3, task to freeze: 2
>>>> [  631.271176] freeze round: 4, task to freeze: 2
>>>> [  631.279160] freeze round: 5, task to freeze: 2
>>>> [  631.287152] freeze round: 6, task to freeze: 2
>>>> [  631.295346] freeze round: 7, task to freeze: 2
>>>> [  631.301747] freeze round: 8, task to freeze: 2
>>>> [  631.309346] freeze round: 9, task to freeze: 2
>>>> [  631.317353] freeze round: 10, task to freeze: 2
>>>> [  631.325348] freeze round: 11, task to freeze: 2
>>>> [  631.333353] freeze round: 12, task to freeze: 2
>>>> [  631.341358] freeze round: 13, task to freeze: 2
>>>> [  631.349357] freeze round: 14, task to freeze: 2
>>>> [  631.357363] freeze round: 15, task to freeze: 2
>>>> [  631.365361] freeze round: 16, task to freeze: 2
>>>> [  631.373379] freeze round: 17, task to freeze: 2
>>>> [  631.381366] freeze round: 18, task to freeze: 2
>>>> [  631.389365] freeze round: 19, task to freeze: 2
>>>> [  631.397371] freeze round: 20, task to freeze: 2
>>>> [  631.405373] freeze round: 21, task to freeze: 2
>>>> [  631.413373] freeze round: 22, task to freeze: 2
>>>> [  631.421392] freeze round: 23, task to freeze: 1
>>>> [  631.429948] freeze round: 24, task to freeze: 1
>>>> [  631.438295] freeze round: 25, task to freeze: 1
>>>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>>> [  631.446387] freeze round: 26, task to freeze: 0
>>>> [  631.446390] Freezing user space processes completed (elapsed 0.183
>>>> seconds)
>>>> [  631.446392] OOM killer disabled.
>>>> [  631.446393] Freezing remaining freezable tasks
>>>> [  631.446656] freeze round: 1, task to freeze: 4
>>>> [  631.447976] freeze round: 2, task to freeze: 0
>>>> [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
>>>> seconds)
>>>> [  631.447980] PM: suspend debug: Waiting for 1 second(s).
>>>> [  632.450858] OOM killer enabled.
>>>> [  632.450859] Restarting tasks: Starting
>>>> [  632.453140] Restarting tasks: Done
>>>> [  632.453173] random: crng reseeded on system resumption
>>>> [  632.453370] PM: suspend exit
>>>> [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
>>>> [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>>>
>>>> This is the reason:
>>>>
>>>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>>>
>>>>
>>>> During freezing, user processes executing jbd2_log_wait_commit enter D state
>>>> because this function calls wait_event and can take tens of milliseconds to
>>>> complete. This long execution time, coupled with possible competition with
>>>> the freezer, causes repeated freeze retries.
>>>>
>>>> While we understand that jbd2 is a freezable kernel thread, we would like to
>>>> know if there is a way to freeze it earlier or freeze some critical
>>>> processes proactively to reduce this contention.
>>> Freeze the filesystem before you start freezing kthreads?  That should
>>> quiesce the jbd2 workers and pause anyone trying to write to the fs.
>> Indeed, freezing the filesystem can work.
>>
>> However, this approach is quite expensive: it increases the total suspend
>> time by about 3 to 4 seconds. Because of this overhead, we are exploring
>> alternative solutions with lower cost.
> Indeed it does, because now XFS and friends will actually shut down
> their background workers and flush all the dirty data and metadata to
> disk.  On the other hand, if the system crashes while suspended, there's
> a lot less recovery work to be done.
>
> Granted the kernel (or userspace) will usually sync() before suspending
> so that's not been a huge problem in production afaict.


Thank you for your explanation!

>> We have tested it:
>>
>> https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/
>>
>>> Maybe the missing piece here is the device model not knowing how to call
>>> bdev_freeze prior to a suspend?
>> Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have
>> any plans or insights on improving or integrating this functionality more
>> smoothly into the device model and suspend sequence?
>>> That said, I think that doesn't 100% work for XFS because it has
>>> kworkers for metadata buffer read completions, and freezes don't affect
>>> read operations...
>> Does read activity also cause processes to enter D (uninterruptible sleep)
>> state?
> Usually.

I think you are right.

read operations like vfs_read also cause it.

[   79.179682] PM: suspend entry (deep)
[   79.302703] Filesystems sync: 0.123 seconds
[   79.385416] Freezing user space processes
[   79.386223] round:0 todo:673
[   79.387025] currnet process has not been frozen :Xorg pid:1588
[   79.387026] task:Xorg            state:D stack:0     pid:1588 
tgid:1588  ppid:1471   flags:0x00000004
[   79.387030] Call Trace:
[   79.387031]  <TASK>
[   79.387032]  __schedule+0x46c/0xe40
[   79.387038]  schedule+0x32/0xb0
[   79.387040]  schedule_timeout+0x23d/0x2a0
[   79.387043]  ? pollwake+0x78/0xa0
[   79.387046]  wait_for_completion+0x8c/0x180
[   79.387048]  __flush_work+0x204/0x2d0
[   79.387051]  ? __pfx_wq_barrier_func+0x10/0x10
[   79.387054]  drm_mode_rmfb+0x1a0/0x200
[   79.387057]  ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10
[   79.387058]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   79.387060]  drm_ioctl_kernel+0xbc/0x150
[   79.387062]  ? __stack_depot_save+0x38/0x4c0
[   79.387066]  drm_ioctl+0x270/0x470
[   79.387068]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   79.387072]  radeon_drm_ioctl+0x4a/0x80 [radeon]
[   79.387108]  __x64_sys_ioctl+0x8c/0xc0
[   79.387110]  do_syscall_64+0x7e/0x270
[   79.387112]  ? __fsnotify_parent+0x113/0x370
[   79.387114]  ? drm_read+0x284/0x320
[   79.387117]  ? syscall_exit_work+0x110/0x140
[   79.387120]  ? vfs_read+0x220/0x2f0
[   79.387122]  ? vfs_read+0x220/0x2f0
[   79.387123]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387126]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387128]  ? syscall_exit_work+0x110/0x140
[   79.387130]  ? do_syscall_64+0x10f/0x270
[   79.387131]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387133]  ? syscall_exit_work+0x110/0x140
[   79.387135]  ? do_syscall_64+0x10f/0x270
[   79.387137]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387139]  ? syscall_exit_work+0x110/0x140
[   79.387141]  ? do_syscall_64+0x10f/0x270
[   79.387142]  ? syscall_exit_work+0x110/0x140
[   79.387144]  ? do_syscall_64+0x10f/0x270
[   79.387145]  ? irqtime_account_irq+0x40/0xc0
[   79.387148]  ? irqentry_exit_to_user_mode+0x74/0x1e0
[   79.387150]  entry_SYSCALL_64_after_hwframe+0x76/0xe0
[   79.387153] RIP: 0033:0x7f91baf2550b
[   79.387155] RSP: 002b:00007ffc673d5668 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[   79.387157] RAX: ffffffffffffffda RBX: 00007ffc673d56ac RCX: 
00007f91baf2550b
[   79.387158] RDX: 00007ffc673d56ac RSI: 00000000c00464af RDI: 
000000000000000e
[   79.387159] RBP: 00000000c00464af R08: 00007f91ba860220 R09: 
000056429d1d9fa0
[   79.387160] R10: 0000000000000103 R11: 0000000000000246 R12: 
000056429ba931e0
[   79.387161] R13: 000000000000000e R14: 00000000049f0b22 R15: 
000056429b93bfb0
[   79.387164]  </TASK>
[   79.387255] round:1 todo:1

>>  From what I understand, it’s usually writes or synchronous operations that
>> do, but I’m curious if reads can also lead to D state under certain
>> conditions.
> Anything that sets the task state to uninterruptible.
>
> --D
>
>>> (just my clueless 2c)
>>>
>>> --D
>>>
>>>> Thanks for your input and suggestions.
>>>>
>>>> 在 2025/8/11 18:58, Michal Hocko 写道:
>>>>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
>>>>>> 在 2025/8/8 16:58, Michal Hocko 写道:
>>>>> [...]
>>>>>>> Also the interface seems to be really coarse grained and it can easily
>>>>>>> turn out insufficient for other usecases while it is not entirely clear
>>>>>>> to me how this could be extended for those.
>>>>>>     We recognize that the current interface is relatively coarse-grained and
>>>>>> may not be sufficient for all scenarios. The present implementation is a
>>>>>> basic version.
>>>>>>
>>>>>> Our plan is to introduce a classification-based mechanism that assigns
>>>>>> different freeze priorities according to process categories. For example,
>>>>>> filesystem and graphics-related processes will be given higher default
>>>>>> freeze priority, as they are critical in the freezing workflow. This
>>>>>> classification approach helps target important processes more precisely.
>>>>>>
>>>>>> However, this requires further testing and refinement before full
>>>>>> deployment. We believe this incremental, category-based design will make the
>>>>>> mechanism more effective and adaptable over time while keeping it
>>>>>> manageable.
>>>>> Unless there is a clear path for a more extendable interface then
>>>>> introducing this one is a no-go. We do not want to grow different ways
>>>>> to establish freezing policies.
>>>>>
>>>>> But much more fundamentally. So far I haven't really seen any argument
>>>>> why different priorities help with the underlying problem other than the
>>>>> timing might be slightly different if you change the order of freezing.
>>>>> This to me sounds like the proposed scheme mostly works around the
>>>>> problem you are seeing and as such is not a really good candidate to be
>>>>> merged as a long term solution. Not to mention with a user API that
>>>>> needs to be maintained for ever.
>>>>>
>>>>> So NAK from me on the interface.
>>>>>
>>>> Thanks for the feedback. I understand your concern that changing the freezer
>>>> priority order looks like working around the symptom rather than solving the
>>>> root cause.
>>>>
>>>> Since the last discussion, we have analyzed the D-state processes further
>>>> and identified that the long wait time is caused by jbd2_log_wait_commit.
>>>> This wait happens because user tasks call into this function during
>>>> fsync/fdatasync and it can take tens of milliseconds to complete. When this
>>>> coincides with the freezer operation, the tasks are stuck in D state and
>>>> retried multiple times, increasing the total freeze time.
>>>>
>>>> Although we know that jbd2 is a freezable kernel thread, we are exploring
>>>> whether freezing it earlier — or freezing certain key processes first —
>>>> could reduce this contention and improve freeze completion time.
>>>>
>>>>
>>>>>>> I believe it would be more useful to find sources of those freezer
>>>>>>> blockers and try to address those. Making more blocked tasks
>>>>>>> __set_task_frozen compatible sounds like a general improvement in
>>>>>>> itself.
>>>>>> we have already identified some causes of D-state tasks, many of which are
>>>>>> related to the filesystem. On some systems, certain processes frequently
>>>>>> execute ext4_sync_file, and under contention this can lead to D-state tasks.
>>>>> Please work with maintainers of those subsystems to find proper
>>>>> solutions.
>>>> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
>>>> freeze ordering for jbd2 is safe or if there’s a better approach to avoid
>>>> the repeated retries caused by this wait.
>>>>
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Peter Zijlstra 1 month, 3 weeks ago
On Thu, Aug 07, 2025 at 08:14:09PM +0800, Zihuan Zhang wrote:

> Freeze Window Begins
> 
>     [process A] - epoll_wait()
>         │
>         ▼
>     [process B] - event source (already frozen)
> 

Can we make epoll_wait() TASK_FREEZABLE? AFAICT it doesn't hold any
resources, it just sits there waiting for stuff.
Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Posted by Zihuan Zhang 1 month, 2 weeks ago
在 2025/8/14 22:37, Peter Zijlstra 写道:
> On Thu, Aug 07, 2025 at 08:14:09PM +0800, Zihuan Zhang wrote:
>
>> Freeze Window Begins
>>
>>      [process A] - epoll_wait()
>>          │
>>          ▼
>>      [process B] - event source (already frozen)
>>
> Can we make epoll_wait() TASK_FREEZABLE? AFAICT it doesn't hold any
> resources, it just sits there waiting for stuff.

Based on the code, it’s ep_poll() that puts the task into the D state, 
most likely due to I/O or lower-level driver behavior. In fs/eventpoll.c:

Line:2097 __set_current_state 
<https://elixir.bootlin.com/linux/v6.16/C/ident/__set_current_state>(TASK_INTERRUPTIBLE 
<https://elixir.bootlin.com/linux/v6.16/C/ident/TASK_INTERRUPTIBLE>);

Simply changing the task state may not actually address the root cause. 
Currently, our approach is to identify tasks that are more likely to 
cause such issues and freeze them earlier or later in the process to 
avoid conflicts.