Documentation/filesystems/proc.rst | 14 ++++++- fs/proc/base.c | 64 ++++++++++++++++++++++++++++++ include/linux/freezer.h | 20 ++++++++++ include/linux/sched.h | 3 ++ kernel/fork.c | 1 + kernel/power/process.c | 23 ++++++++++- kernel/sched/core.c | 2 + 7 files changed, 124 insertions(+), 3 deletions(-)
The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat. Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC, asynchronous I/O, and deep event loops—the original freezer model has shown its age. ## Background Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally. It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations: - Signal-based logic cannot freeze uninterruptible (D-state) tasks - Dependencies between processes can cause freeze retries - Retry-based recovery introduces unpredictable suspend latency ## Real-world problem illustration Consider the following scenario during suspend: Freeze Window Begins [process A] - epoll_wait() │ ▼ [process B] - event source (already frozen) → A enters D-state because of waiting for B → Cannot respond to freezing signal → Freezer retries in a loop → Suspend latency spikes In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**. Worse, the kernel has no insight into the root cause and simply retries blindly. ## Proposed solution: Freeze priority model To address this, we propose a **layered freeze model** based on per-task freeze priorities. ### Design We introduce 4 levels of freeze priority: | Priority | Level | Description | |----------|-------------------|-----------------------------------| | 0 | HIGH | D-state TASKs | | 1 | NORMAL | regular use space TASKS | | 2 | LOW | not yet used | | 4 | NEVER_FREEZE | zombie TASKs , PF_SUSPNED_TASK | The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first. This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases. By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion. Although introducing more fine-grained freeze_priority levels improves extensibility and allows better modeling of task dependencies, it may also introduce additional overhead during task traversal, potentially affecting freezer performance. In our test environment, increasing the maximum freeze retries to 16 only added ~4ms of overhead to the total suspend latency, suggesting the added robustness comes at a relatively low cost. However, for latency-critical systems, this trade-off should be carefully evaluated. ## Benefits - Solves D-state process freeze stalls caused by premature freezing of dependencies - Enables more robust and reliable suspend/resume on complex userspace systems - Introduces extensibility: tasks can be categorized by role, urgency, or dependency - Reduces race conditions by introducing deterministic freezing order ## Previous Discussion Link: https://lore.kernel.org/all/20250606062502.19607-1-zhangzihuan@kylinos.cn/ Link: https://lore.kernel.org/all/1ca889fd-6ead-4d4f-a3c7-361ea05bb659@kylinos.cn/ ## Future directions This framework opens up several promising areas for further development: 1. Adaptive behavior based on runtime statistics or retry feedback The freezer adapts dynamically during suspend/hibernate based on the number of retries and which tasks failed to freeze. Tasks that failed in previous rounds will be assigned a higher freeze priority, improving convergence speed and reducing unnecessary retries. 2. cgroup-aware hierarchical freezing for containerized systems The design supports cgroup-aware task traversal and freezing. This ensures compatibility with containerized environments, allowing for better control and visibility when freezing processes in different cgroups. 3. Unified freezing of userspace processes and kernel threads Based on extensive testing, we found that freezing userspace tasks and kernel threads together works reliably in practice. Separating them does not resolve dependency issues between user and kernel context. Moreover, most kernel threads are marked as non-freezable, so including them in the same freeze pass does not impact correctness and simplifies the logic. Although the current implementation is relatively simple, it already helps alleviate some suspend failures caused by tasks stuck in D state. In our testing, we observed that certain D-state tasks are triggered by filesystem sync operations during the freezing phase. At this stage, we don't yet have a comprehensive solution for that class of problems. This patchset represents a testable version of our design. We plan to further investigate and address such filesystem-related D-state issues in future revisions. Patch summary: - Patch 1-3: Core infrastructure: field, API, layered freeze logic - Patch 4-7: Default priorities and dynamic adjustments - Patch 8: Statistics: freeze pass retry count - Patch 9: Procfs interface for userspace access Zihuan Zhang (9): freezer: Introduce freeze_priority field in task_struct freezer: Introduce API to set per-task freeze priority freezer: Add per-priority layered freeze logic freezer: Set default freeze priority for userspace tasks freezer: set default freeze priority for PF_SUSPEND_TASK processes freezer: Set default freeze priority for zombie tasks freezer: raise freeze priority of tasks failed to freeze last time freezer: Add retry count statistics for freeze pass iterations proc: Add /proc/<pid>/freeze_priority interface Documentation/filesystems/proc.rst | 14 ++++++- fs/proc/base.c | 64 ++++++++++++++++++++++++++++++ include/linux/freezer.h | 20 ++++++++++ include/linux/sched.h | 3 ++ kernel/fork.c | 1 + kernel/power/process.c | 23 ++++++++++- kernel/sched/core.c | 2 + 7 files changed, 124 insertions(+), 3 deletions(-) -- 2.25.1
On Thu 07-08-25 20:14:09, Zihuan Zhang wrote: > The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat. > Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC, > asynchronous I/O, and deep event loops—the original freezer model has shown its age. A modern userspace might be more complex or convoluted but I do not think the above statement is accurate or even correct. > ## Background > > Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally. > It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations: > > - Signal-based logic cannot freeze uninterruptible (D-state) tasks > - Dependencies between processes can cause freeze retries > - Retry-based recovery introduces unpredictable suspend latency > > ## Real-world problem illustration > > Consider the following scenario during suspend: > > Freeze Window Begins > > [process A] - epoll_wait() > │ > ▼ > [process B] - event source (already frozen) > > → A enters D-state because of waiting for B I thought opoll_wait was waiting in interruptible sleep. > → Cannot respond to freezing signal > → Freezer retries in a loop > → Suspend latency spikes > > In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**. > Worse, the kernel has no insight into the root cause and simply retries blindly. > > ## Proposed solution: Freeze priority model > > To address this, we propose a **layered freeze model** based on per-task freeze priorities. > > ### Design > > We introduce 4 levels of freeze priority: > > > | Priority | Level | Description | > |----------|-------------------|-----------------------------------| > | 0 | HIGH | D-state TASKs | > | 1 | NORMAL | regular use space TASKS | > | 2 | LOW | not yet used | > | 4 | NEVER_FREEZE | zombie TASKs , PF_SUSPNED_TASK | > > > The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first. > This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases. > By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion. I really fail to see how that is supposed to work to be honest. If a process is running in the userspace then the priority shouldn't really matter much. Tasks will get a signal, freeze themselves and you are done. If they are running in the userspace and e.g. sleeping while not TASK_FREEZABLE then priority simply makes no difference. And if they are TASK_FREEZABLE then the priority doens't matter either. What am I missing? -- Michal Hocko SUSE Labs
Hi, 在 2025/8/7 21:25, Michal Hocko 写道: > On Thu 07-08-25 20:14:09, Zihuan Zhang wrote: >> The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat. >> Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC, >> asynchronous I/O, and deep event loops—the original freezer model has shown its age. > A modern userspace might be more complex or convoluted but I do not > think the above statement is accurate or even correct. You’re right — that statement may not be accurate. I’ll be more careful with the wording. >> ## Background >> >> Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally. >> It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations: >> >> - Signal-based logic cannot freeze uninterruptible (D-state) tasks >> - Dependencies between processes can cause freeze retries >> - Retry-based recovery introduces unpredictable suspend latency >> >> ## Real-world problem illustration >> >> Consider the following scenario during suspend: >> >> Freeze Window Begins >> >> [process A] - epoll_wait() >> │ >> ▼ >> [process B] - event source (already frozen) >> >> → A enters D-state because of waiting for B > I thought opoll_wait was waiting in interruptible sleep. Apologies — my description may not be entirely accurate. But there are some dmesg logs: [ 62.880497] PM: suspend entry (deep) [ 63.130639] Filesystems sync: 0.249 seconds [ 63.130643] PM: Preparing system for sleep (deep) [ 63.226398] Freezing user space processes [ 63.227193] freeze round: 0, task to freeze: 681 [ 63.228110] freeze round: 1, task to freeze: 1 [ 63.230064] task:Xorg state:D stack:0 pid:1404 tgid:1404 ppid:1348 task_flags:0x400100 flags:0x00004004 [ 63.230068] Call Trace: [ 63.230069] <TASK> [ 63.230071] __schedule+0x52e/0xea0 [ 63.230077] schedule+0x27/0x80 [ 63.230079] schedule_timeout+0xf2/0x100 [ 63.230082] wait_for_completion+0x85/0x130 [ 63.230085] __flush_work+0x21f/0x310 [ 63.230087] ? __pfx_wq_barrier_func+0x10/0x10 [ 63.230091] drm_mode_rmfb+0x138/0x1b0 [ 63.230093] ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10 [ 63.230095] ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10 [ 63.230097] drm_ioctl_kernel+0xa5/0x100 [ 63.230099] drm_ioctl+0x270/0x4b0 [ 63.230101] ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10 [ 63.230104] ? syscall_exit_work+0x108/0x140 [ 63.230107] radeon_drm_ioctl+0x4a/0x80 [radeon] [ 63.230141] __x64_sys_ioctl+0x93/0xe0 [ 63.230144] ? syscall_trace_enter+0xfa/0x1c0 [ 63.230146] do_syscall_64+0x7d/0x2c0 [ 63.230148] ? do_syscall_64+0x1f3/0x2c0 [ 63.230150] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 63.230153] RIP: 0033:0x7f1aa132550b [ 63.230154] RSP: 002b:00007ffebab69678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 63.230156] RAX: ffffffffffffffda RBX: 00007ffebab696bc RCX: 00007f1aa132550b [ 63.230158] RDX: 00007ffebab696bc RSI: 00000000c00464af RDI: 000000000000000e [ 63.230159] RBP: 00000000c00464af R08: 00007f1aa0c41220 R09: 000055a71ce32310 [ 63.230160] R10: 0000000000000087 R11: 0000000000000246 R12: 000055a71b813660 [ 63.230161] R13: 000000000000000e R14: 0000000003a8f5cd R15: 000055a71b6bbfb0 [ 63.230164] </TASK> [ 63.230248] freeze round: 2, task to freeze: 1 You can find it in this patch link: https://lore.kernel.org/all/20250619035355.33402-1-zhangzihuan@kylinos.cn/ >> → Cannot respond to freezing signal >> → Freezer retries in a loop >> → Suspend latency spikes >> >> In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**. >> Worse, the kernel has no insight into the root cause and simply retries blindly. >> >> ## Proposed solution: Freeze priority model >> >> To address this, we propose a **layered freeze model** based on per-task freeze priorities. >> >> ### Design >> >> We introduce 4 levels of freeze priority: >> >> >> | Priority | Level | Description | >> |----------|-------------------|-----------------------------------| >> | 0 | HIGH | D-state TASKs | >> | 1 | NORMAL | regular use space TASKS | >> | 2 | LOW | not yet used | >> | 4 | NEVER_FREEZE | zombie TASKs , PF_SUSPNED_TASK | >> >> >> The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first. >> This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases. >> By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion. > I really fail to see how that is supposed to work to be honest. If a > process is running in the userspace then the priority shouldn't really > matter much. Tasks will get a signal, freeze themselves and you are > done. If they are running in the userspace and e.g. sleeping while not > TASK_FREEZABLE then priority simply makes no difference. And if they are > TASK_FREEZABLE then the priority doens't matter either. > > What am I missing? under ideal conditions, if a userspace task is TASK_FREEZABLE, receives the freezing() signal, and enters the refrigerator in a timely manner, then freeze priority wouldn’t make a difference. However, in practice, we’ve observed cases where tasks appear stuck in uninterruptible sleep (D state) during the freeze phase — and thus cannot respond to signals or enter the refrigerator. These tasks are technically TASK_FREEZABLE, but due to the nature of their sleep state, they don’t freeze promptly, and may require multiple retry rounds, or cause the entire suspend to fail.
On 08/08, Zihuan Zhang wrote: > > 在 2025/8/7 21:25, Michal Hocko 写道: > >If they are running in the userspace and e.g. sleeping while not > >TASK_FREEZABLE then priority simply makes no difference. And if they are > >TASK_FREEZABLE then the priority doens't matter either. > > > >What am I missing? I too do not understand how can this series improve the freezer. > under ideal conditions, if a userspace task is TASK_FREEZABLE, receives the > freezing() signal, and enters the refrigerator in a timely manner, Note that __freeze_task() won't even send a signal to a sleeping TASK_FREEZABLE task, __freeze_task() will just change its state to TASK_FROZEN. Oleg.
Hi, 在 2025/8/8 15:57, Oleg Nesterov 写道: > On 08/08, Zihuan Zhang wrote: >> 在 2025/8/7 21:25, Michal Hocko 写道: >>> If they are running in the userspace and e.g. sleeping while not >>> TASK_FREEZABLE then priority simply makes no difference. And if they are >>> TASK_FREEZABLE then the priority doens't matter either. >>> >>> What am I missing? > I too do not understand how can this series improve the freezer. Thanks for your question — actually, I just replied to Michal with a similar explanation, but I really appreciate you raising the same point, so let me add a bit more context here. Right now, we're trying to address the case where certain tasks fail to freeze (often due to short-lived D-state issues). Our current workaround is to increase the number of freeze iterations in the next suspend attempt for those tasks. While this isn't a perfect solution, the overhead of a few extra iterations is minimal compared to the cost of retrying the whole suspend cycle due to a stuck D-state task. So for now, we believe this is a reasonable tradeoff until we find a more deterministic way to preemptively detect and prioritize problematic tasks. Happy to hear your thoughts or suggestions if you think there's a better direction to explore. >> under ideal conditions, if a userspace task is TASK_FREEZABLE, receives the >> freezing() signal, and enters the refrigerator in a timely manner, > Note that __freeze_task() won't even send a signal to a sleeping > TASK_FREEZABLE task, __freeze_task() will just change its state to > TASK_FROZEN. > > Oleg. > You are right.
On Fri 08-08-25 09:13:30, Zihuan Zhang wrote: [...] > However, in practice, we’ve observed cases where tasks appear stuck in > uninterruptible sleep (D state) during the freeze phase — and thus cannot > respond to signals or enter the refrigerator. These tasks are technically > TASK_FREEZABLE, but due to the nature of their sleep state, they don’t > freeze promptly, and may require multiple retry rounds, or cause the entire > suspend to fail. Right, but that is an inherent problem of the freezer implemenatation. It is not really clear to me how priorities or layers improve on that. Could you please elaborate on that? -- Michal Hocko SUSE Labs
在 2025/8/8 15:00, Michal Hocko 写道: > On Fri 08-08-25 09:13:30, Zihuan Zhang wrote: > [...] >> However, in practice, we’ve observed cases where tasks appear stuck in >> uninterruptible sleep (D state) during the freeze phase — and thus cannot >> respond to signals or enter the refrigerator. These tasks are technically >> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t >> freeze promptly, and may require multiple retry rounds, or cause the entire >> suspend to fail. > Right, but that is an inherent problem of the freezer implemenatation. > It is not really clear to me how priorities or layers improve on that. > Could you please elaborate on that? Thanks for the follow-up. From our observations, we’ve seen processes like Xorg that are in a normal state before freezing begins, but enter D state during the freeze window. Upon investigation, we found that these processes often depend on other user processes (e.g., I/O helpers or system services), and when those dependencies are frozen first, the dependent process (like Xorg) gets stuck and can’t be frozen itself. This led us to treat such processes as “hard to freeze” tasks — not because they’re inherently unfreezable, but because they are more likely to become problematic if not frozen early enough. So our model works as follows: • By default, freezer tries to freeze all freezable tasks in each round. • With our approach, we only attempt to freeze tasks whose freeze_priority is less than or equal to the current round number. • This ensures that higher-priority (i.e., harder-to-freeze) tasks are attempted earlier, increasing the chance that they freeze before being blocked by others. Since we cannot know in advance which tasks will be difficult to freeze, we use heuristics: • Any task that causes freeze failure or is found in D state during the freeze window is treated as hard-to-freeze in the next attempt and its priority is increased. • Additionally, users can manually raise/reduce the freeze priority of known problematic tasks via an exposed sysfs interface, giving them fine-grained control. This doesn’t change the fundamental logic of the freezer — it still retries until all tasks are frozen — but by adjusting the traversal order, we’ve observed significantly fewer retries and more reliable success in scenarios where these D state transitions occur.
On Fri 08-08-25 15:52:31, Zihuan Zhang wrote: > > 在 2025/8/8 15:00, Michal Hocko 写道: > > On Fri 08-08-25 09:13:30, Zihuan Zhang wrote: > > [...] > > > However, in practice, we’ve observed cases where tasks appear stuck in > > > uninterruptible sleep (D state) during the freeze phase — and thus cannot > > > respond to signals or enter the refrigerator. These tasks are technically > > > TASK_FREEZABLE, but due to the nature of their sleep state, they don’t > > > freeze promptly, and may require multiple retry rounds, or cause the entire > > > suspend to fail. > > Right, but that is an inherent problem of the freezer implemenatation. > > It is not really clear to me how priorities or layers improve on that. > > Could you please elaborate on that? > > Thanks for the follow-up. > > From our observations, we’ve seen processes like Xorg that are in a normal > state before freezing begins, but enter D state during the freeze window. > Upon investigation, > > we found that these processes often depend on other user processes (e.g., > I/O helpers or system services), and when those dependencies are frozen > first, the dependent process (like Xorg) gets stuck and can’t be frozen > itself. OK, I see. > This led us to treat such processes as “hard to freeze” tasks — not because > they’re inherently unfreezable, but because they are more likely to become > problematic if not frozen early enough. > > So our model works as follows: > • By default, freezer tries to freeze all freezable tasks in each > round. > • With our approach, we only attempt to freeze tasks whose > freeze_priority is less than or equal to the current round number. > • This ensures that higher-priority (i.e., harder-to-freeze) tasks > are attempted earlier, increasing the chance that they freeze before being > blocked by others. > > Since we cannot know in advance which tasks will be difficult to freeze, we > use heuristics: > • Any task that causes freeze failure or is found in D state during > the freeze window is treated as hard-to-freeze in the next attempt and its > priority is increased. > • Additionally, users can manually raise/reduce the freeze priority > of known problematic tasks via an exposed sysfs interface, giving them > fine-grained control. This would have been a very useful information for the changelog so that we can understand what you are trying to achieve. > This doesn’t change the fundamental logic of the freezer — it still retries > until all tasks are frozen — but by adjusting the traversal order, > > we’ve observed significantly fewer retries and more reliable success in > scenarios where these D state transitions occur. OK, I believe I do understand what you are trying to achieve but I am not conviced this is a robust way to deal with the problem. This all seems highly timing specific that might work in very specific usecase but you are essentially trying to fight tiny race windows with a very probabilitistic interface. Also the interface seems to be really coarse grained and it can easily turn out insufficient for other usecases while it is not entirely clear to me how this could be extended for those. I believe it would be more useful to find sources of those freezer blockers and try to address those. Making more blocked tasks __set_task_frozen compatible sounds like a general improvement in itself. Thanks -- Michal Hocko SUSE Labs
在 2025/8/8 16:58, Michal Hocko 写道: > On Fri 08-08-25 15:52:31, Zihuan Zhang wrote: >> 在 2025/8/8 15:00, Michal Hocko 写道: >>> On Fri 08-08-25 09:13:30, Zihuan Zhang wrote: >>> [...] >>>> However, in practice, we’ve observed cases where tasks appear stuck in >>>> uninterruptible sleep (D state) during the freeze phase — and thus cannot >>>> respond to signals or enter the refrigerator. These tasks are technically >>>> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t >>>> freeze promptly, and may require multiple retry rounds, or cause the entire >>>> suspend to fail. >>> Right, but that is an inherent problem of the freezer implemenatation. >>> It is not really clear to me how priorities or layers improve on that. >>> Could you please elaborate on that? >> Thanks for the follow-up. >> >> From our observations, we’ve seen processes like Xorg that are in a normal >> state before freezing begins, but enter D state during the freeze window. >> Upon investigation, >> >> we found that these processes often depend on other user processes (e.g., >> I/O helpers or system services), and when those dependencies are frozen >> first, the dependent process (like Xorg) gets stuck and can’t be frozen >> itself. > OK, I see. > >> This led us to treat such processes as “hard to freeze” tasks — not because >> they’re inherently unfreezable, but because they are more likely to become >> problematic if not frozen early enough. >> >> So our model works as follows: >> • By default, freezer tries to freeze all freezable tasks in each >> round. >> • With our approach, we only attempt to freeze tasks whose >> freeze_priority is less than or equal to the current round number. >> • This ensures that higher-priority (i.e., harder-to-freeze) tasks >> are attempted earlier, increasing the chance that they freeze before being >> blocked by others. >> >> Since we cannot know in advance which tasks will be difficult to freeze, we >> use heuristics: >> • Any task that causes freeze failure or is found in D state during >> the freeze window is treated as hard-to-freeze in the next attempt and its >> priority is increased. >> • Additionally, users can manually raise/reduce the freeze priority >> of known problematic tasks via an exposed sysfs interface, giving them >> fine-grained control. > This would have been a very useful information for the changelog so that > we can understand what you are trying to achieve. > Got it, I’ll add that info to the changelog. Thanks! >> This doesn’t change the fundamental logic of the freezer — it still retries >> until all tasks are frozen — but by adjusting the traversal order, >> >> we’ve observed significantly fewer retries and more reliable success in >> scenarios where these D state transitions occur. > > OK, I believe I do understand what you are trying to achieve but I am > not conviced this is a robust way to deal with the problem. This all > seems highly timing specific that might work in very specific usecase > but you are essentially trying to fight tiny race windows with a very > probabilitistic interface. Actually, our approach does not conflict with solving the problem. We plan to keep the freeze priority mechanism disabled by default and only enable it when issues arise, so as to maintain the consistency of the existing code flow as much as possible. It acts like a fallback mechanism. We acknowledge that the causes of D-state tasks are complex and require high effort to fully resolve, which the current freezer mechanism cannot achieve. Our solution is low-cost and able to capture some problematic tasks effectively. > Also the interface seems to be really coarse grained and it can easily > turn out insufficient for other usecases while it is not entirely clear > to me how this could be extended for those. We recognize that the current interface is relatively coarse-grained and may not be sufficient for all scenarios. The present implementation is a basic version. Our plan is to introduce a classification-based mechanism that assigns different freeze priorities according to process categories. For example, filesystem and graphics-related processes will be given higher default freeze priority, as they are critical in the freezing workflow. This classification approach helps target important processes more precisely. However, this requires further testing and refinement before full deployment. We believe this incremental, category-based design will make the mechanism more effective and adaptable over time while keeping it manageable. > I believe it would be more useful to find sources of those freezer > blockers and try to address those. Making more blocked tasks > __set_task_frozen compatible sounds like a general improvement in > itself. we have already identified some causes of D-state tasks, many of which are related to the filesystem. On some systems, certain processes frequently execute ext4_sync_file, and under contention this can lead to D-state tasks. 6616.650482] task:ThreadPoolForeg state:D stack:0 pid:262026 tgid:4065 ppid:2490 task_flags:0x400040 flags:0x00004004 [ 6616.650485] Call Trace: [ 6616.650486] <TASK> [ 6616.650489] __schedule+0x532/0xea0 [ 6616.650494] schedule+0x27/0x80 [ 6616.650496] jbd2_log_wait_commit+0xa6/0x120 [ 6616.650499] ? __pfx_autoremove_wake_function+0x10/0x10 [ 6616.650502] ext4_sync_file+0x1ba/0x380 [ 6616.650505] do_fsync+0x3b/0x80 [ 6616.650507] __x64_sys_fdatasync+0x17/0x20 [ 6616.650509] do_syscall_64+0x7d/0x2c0 [ 6616.650512] ? syscall_exit_work+0x108/0x140 [ 6616.650515] ? do_syscall_64+0x1f3/0x2c0 [ 6616.650517] ? syscall_exit_work+0x108/0x140 [ 6616.650519] ? do_syscall_64+0x1d5/0x2c0 [ 6616.650522] ? audit_reset_context.part.0+0x284/0x2f0 [ 6616.650524] ? syscall_exit_work+0x108/0x140 [ 6616.650527] ? do_syscall_64+0x1f3/0x2c0 [ 6616.650529] ? futex_unqueue+0x4e/0x80 [ 6616.650531] ? __futex_wait+0x9b/0x100 [ 6616.650534] ? __pfx_futex_wake_mark+0x10/0x10 [ 6616.650536] ? timerqueue_del+0x2e/0x50 [ 6616.650539] ? __remove_hrtimer+0x39/0x70 [ 6616.650542] ? hrtimer_try_to_cancel+0x85/0x100 [ 6616.650544] ? hrtimer_cancel+0x15/0x30 [ 6616.650546] ? futex_wait+0x7d/0x110 [ 6616.650549] ? __pfx_hrtimer_wakeup+0x10/0x10 [ 6616.650552] ? audit_reset_context.part.0+0x284/0x2f0 [ 6616.650554] ? syscall_exit_work+0x108/0x140 [ 6616.650556] ? do_syscall_64+0x1d5/0x2c0 [ 6616.650558] ? switch_fpu_return+0x4f/0xd0 [ 6616.650560] ? do_syscall_64+0x1d5/0x2c0 [ 6616.650563] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 6616.650565] RIP: 0033:0x7f095ef8f3eb [ 6616.650567] RSP: 002b:00007f07409fa360 EFLAGS: 00000293 ORIG_RAX: 000000000000004b [ 6616.650569] RAX: ffffffffffffffda RBX: 00000d38021f03a0 RCX: 00007f095ef8f3eb [ 6616.650570] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000009a [ 6616.650571] RBP: 00007f07409fa410 R08: 0000000000000000 R09: 00007f07409fa570 [ 6616.650572] R10: 00007f0960a60000 R11: 0000000000000293 R12: 00000d38021f0380 [ 6616.650573] R13: 000055c28c70b400 R14: 00007f07409fa3a0 R15: 00007f07409fa380 While the kernel already supports freezing the filesystem, which can address this problem, it is quite expensive — enabling this feature increases the suspend time by about 3~4 seconds in our tests. We are therefore exploring lower-cost approaches to mitigate the issue without such a heavy performance impact. root@zzhwaxy-pc:/sys/power# echo 1 > freeze_filesystems root@zzhwaxy-pc:/sys/power# sudo dmesg | grep -E 'suspend' [ 9844.984658] PM: suspend entry (deep) [ 9850.998197] PM: suspend exit root@zzhwaxy-pc:/sys/power# echo 0 > freeze_filesystems root@zzhwaxy-pc:/sys/power# sudo dmesg | grep -E 'suspend' [ 9893.928486] PM: suspend entry (deep) [ 9896.239425] PM: suspend exit > Thanks
On Mon 11-08-25 17:13:43, Zihuan Zhang wrote: > > 在 2025/8/8 16:58, Michal Hocko 写道: [...] > > Also the interface seems to be really coarse grained and it can easily > > turn out insufficient for other usecases while it is not entirely clear > > to me how this could be extended for those. > We recognize that the current interface is relatively coarse-grained and > may not be sufficient for all scenarios. The present implementation is a > basic version. > > Our plan is to introduce a classification-based mechanism that assigns > different freeze priorities according to process categories. For example, > filesystem and graphics-related processes will be given higher default > freeze priority, as they are critical in the freezing workflow. This > classification approach helps target important processes more precisely. > > However, this requires further testing and refinement before full > deployment. We believe this incremental, category-based design will make the > mechanism more effective and adaptable over time while keeping it > manageable. Unless there is a clear path for a more extendable interface then introducing this one is a no-go. We do not want to grow different ways to establish freezing policies. But much more fundamentally. So far I haven't really seen any argument why different priorities help with the underlying problem other than the timing might be slightly different if you change the order of freezing. This to me sounds like the proposed scheme mostly works around the problem you are seeing and as such is not a really good candidate to be merged as a long term solution. Not to mention with a user API that needs to be maintained for ever. So NAK from me on the interface. > > I believe it would be more useful to find sources of those freezer > > blockers and try to address those. Making more blocked tasks > > __set_task_frozen compatible sounds like a general improvement in > > itself. > > we have already identified some causes of D-state tasks, many of which are > related to the filesystem. On some systems, certain processes frequently > execute ext4_sync_file, and under contention this can lead to D-state tasks. Please work with maintainers of those subsystems to find proper solutions. -- Michal Hocko SUSE Labs
Hi all, We encountered an issue where the number of freeze retries increased due to processes stuck in D state. The logs point to jbd2-related activity. log1: 6616.650482] task:ThreadPoolForeg state:D stack:0 pid:262026 tgid:4065 ppid:2490 task_flags:0x400040 flags:0x00004004 [ 6616.650485] Call Trace: [ 6616.650486] <TASK> [ 6616.650489] __schedule+0x532/0xea0 [ 6616.650494] schedule+0x27/0x80 [ 6616.650496] jbd2_log_wait_commit+0xa6/0x120 [ 6616.650499] ? __pfx_autoremove_wake_function+0x10/0x10 [ 6616.650502] ext4_sync_file+0x1ba/0x380 [ 6616.650505] do_fsync+0x3b/0x80 log2: [ 631.206315] jdb2_log_wait_log_commit completed (elapsed 0.002 seconds) [ 631.215325] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) [ 631.240704] jdb2_log_wait_log_commit completed (elapsed 0.386 seconds) [ 631.262167] Filesystems sync: 0.424 seconds [ 631.262821] Freezing user space processes [ 631.263839] freeze round: 1, task to freeze: 852 [ 631.265128] freeze round: 2, task to freeze: 2 [ 631.267039] freeze round: 3, task to freeze: 2 [ 631.271176] freeze round: 4, task to freeze: 2 [ 631.279160] freeze round: 5, task to freeze: 2 [ 631.287152] freeze round: 6, task to freeze: 2 [ 631.295346] freeze round: 7, task to freeze: 2 [ 631.301747] freeze round: 8, task to freeze: 2 [ 631.309346] freeze round: 9, task to freeze: 2 [ 631.317353] freeze round: 10, task to freeze: 2 [ 631.325348] freeze round: 11, task to freeze: 2 [ 631.333353] freeze round: 12, task to freeze: 2 [ 631.341358] freeze round: 13, task to freeze: 2 [ 631.349357] freeze round: 14, task to freeze: 2 [ 631.357363] freeze round: 15, task to freeze: 2 [ 631.365361] freeze round: 16, task to freeze: 2 [ 631.373379] freeze round: 17, task to freeze: 2 [ 631.381366] freeze round: 18, task to freeze: 2 [ 631.389365] freeze round: 19, task to freeze: 2 [ 631.397371] freeze round: 20, task to freeze: 2 [ 631.405373] freeze round: 21, task to freeze: 2 [ 631.413373] freeze round: 22, task to freeze: 2 [ 631.421392] freeze round: 23, task to freeze: 1 [ 631.429948] freeze round: 24, task to freeze: 1 [ 631.438295] freeze round: 25, task to freeze: 1 [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) [ 631.446387] freeze round: 26, task to freeze: 0 [ 631.446390] Freezing user space processes completed (elapsed 0.183 seconds) [ 631.446392] OOM killer disabled. [ 631.446393] Freezing remaining freezable tasks [ 631.446656] freeze round: 1, task to freeze: 4 [ 631.447976] freeze round: 2, task to freeze: 0 [ 631.447978] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 631.447980] PM: suspend debug: Waiting for 1 second(s). [ 632.450858] OOM killer enabled. [ 632.450859] Restarting tasks: Starting [ 632.453140] Restarting tasks: Done [ 632.453173] random: crng reseeded on system resumption [ 632.453370] PM: suspend exit [ 632.462799] jdb2_log_wait_log_commit completed (elapsed 0.000 seconds) [ 632.466114] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) This is the reason: [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) During freezing, user processes executing jbd2_log_wait_commit enter D state because this function calls wait_event and can take tens of milliseconds to complete. This long execution time, coupled with possible competition with the freezer, causes repeated freeze retries. While we understand that jbd2 is a freezable kernel thread, we would like to know if there is a way to freeze it earlier or freeze some critical processes proactively to reduce this contention. Thanks for your input and suggestions. 在 2025/8/11 18:58, Michal Hocko 写道: > On Mon 11-08-25 17:13:43, Zihuan Zhang wrote: >> 在 2025/8/8 16:58, Michal Hocko 写道: > [...] >>> Also the interface seems to be really coarse grained and it can easily >>> turn out insufficient for other usecases while it is not entirely clear >>> to me how this could be extended for those. >> We recognize that the current interface is relatively coarse-grained and >> may not be sufficient for all scenarios. The present implementation is a >> basic version. >> >> Our plan is to introduce a classification-based mechanism that assigns >> different freeze priorities according to process categories. For example, >> filesystem and graphics-related processes will be given higher default >> freeze priority, as they are critical in the freezing workflow. This >> classification approach helps target important processes more precisely. >> >> However, this requires further testing and refinement before full >> deployment. We believe this incremental, category-based design will make the >> mechanism more effective and adaptable over time while keeping it >> manageable. > Unless there is a clear path for a more extendable interface then > introducing this one is a no-go. We do not want to grow different ways > to establish freezing policies. > > But much more fundamentally. So far I haven't really seen any argument > why different priorities help with the underlying problem other than the > timing might be slightly different if you change the order of freezing. > This to me sounds like the proposed scheme mostly works around the > problem you are seeing and as such is not a really good candidate to be > merged as a long term solution. Not to mention with a user API that > needs to be maintained for ever. > > So NAK from me on the interface. > Thanks for the feedback. I understand your concern that changing the freezer priority order looks like working around the symptom rather than solving the root cause. Since the last discussion, we have analyzed the D-state processes further and identified that the long wait time is caused by jbd2_log_wait_commit. This wait happens because user tasks call into this function during fsync/fdatasync and it can take tens of milliseconds to complete. When this coincides with the freezer operation, the tasks are stuck in D state and retried multiple times, increasing the total freeze time. Although we know that jbd2 is a freezable kernel thread, we are exploring whether freezing it earlier — or freezing certain key processes first — could reduce this contention and improve freeze completion time. >>> I believe it would be more useful to find sources of those freezer >>> blockers and try to address those. Making more blocked tasks >>> __set_task_frozen compatible sounds like a general improvement in >>> itself. >> we have already identified some causes of D-state tasks, many of which are >> related to the filesystem. On some systems, certain processes frequently >> execute ext4_sync_file, and under contention this can lead to D-state tasks. > Please work with maintainers of those subsystems to find proper > solutions. We’ve pulled in the jbd2 maintainer to get feedback on whether changing the freeze ordering for jbd2 is safe or if there’s a better approach to avoid the repeated retries caused by this wait.
On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote: > Hi all, > > We encountered an issue where the number of freeze retries increased due to > processes stuck in D state. The logs point to jbd2-related activity. > > log1: > > 6616.650482] task:ThreadPoolForeg state:D stack:0 pid:262026 > tgid:4065 ppid:2490 task_flags:0x400040 flags:0x00004004 > [ 6616.650485] Call Trace: > [ 6616.650486] <TASK> > [ 6616.650489] __schedule+0x532/0xea0 > [ 6616.650494] schedule+0x27/0x80 > [ 6616.650496] jbd2_log_wait_commit+0xa6/0x120 > [ 6616.650499] ? __pfx_autoremove_wake_function+0x10/0x10 > [ 6616.650502] ext4_sync_file+0x1ba/0x380 > [ 6616.650505] do_fsync+0x3b/0x80 > > log2: > > [ 631.206315] jdb2_log_wait_log_commit completed (elapsed 0.002 seconds) > [ 631.215325] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) > [ 631.240704] jdb2_log_wait_log_commit completed (elapsed 0.386 seconds) > [ 631.262167] Filesystems sync: 0.424 seconds > [ 631.262821] Freezing user space processes > [ 631.263839] freeze round: 1, task to freeze: 852 > [ 631.265128] freeze round: 2, task to freeze: 2 > [ 631.267039] freeze round: 3, task to freeze: 2 > [ 631.271176] freeze round: 4, task to freeze: 2 > [ 631.279160] freeze round: 5, task to freeze: 2 > [ 631.287152] freeze round: 6, task to freeze: 2 > [ 631.295346] freeze round: 7, task to freeze: 2 > [ 631.301747] freeze round: 8, task to freeze: 2 > [ 631.309346] freeze round: 9, task to freeze: 2 > [ 631.317353] freeze round: 10, task to freeze: 2 > [ 631.325348] freeze round: 11, task to freeze: 2 > [ 631.333353] freeze round: 12, task to freeze: 2 > [ 631.341358] freeze round: 13, task to freeze: 2 > [ 631.349357] freeze round: 14, task to freeze: 2 > [ 631.357363] freeze round: 15, task to freeze: 2 > [ 631.365361] freeze round: 16, task to freeze: 2 > [ 631.373379] freeze round: 17, task to freeze: 2 > [ 631.381366] freeze round: 18, task to freeze: 2 > [ 631.389365] freeze round: 19, task to freeze: 2 > [ 631.397371] freeze round: 20, task to freeze: 2 > [ 631.405373] freeze round: 21, task to freeze: 2 > [ 631.413373] freeze round: 22, task to freeze: 2 > [ 631.421392] freeze round: 23, task to freeze: 1 > [ 631.429948] freeze round: 24, task to freeze: 1 > [ 631.438295] freeze round: 25, task to freeze: 1 > [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) > [ 631.446387] freeze round: 26, task to freeze: 0 > [ 631.446390] Freezing user space processes completed (elapsed 0.183 > seconds) > [ 631.446392] OOM killer disabled. > [ 631.446393] Freezing remaining freezable tasks > [ 631.446656] freeze round: 1, task to freeze: 4 > [ 631.447976] freeze round: 2, task to freeze: 0 > [ 631.447978] Freezing remaining freezable tasks completed (elapsed 0.001 > seconds) > [ 631.447980] PM: suspend debug: Waiting for 1 second(s). > [ 632.450858] OOM killer enabled. > [ 632.450859] Restarting tasks: Starting > [ 632.453140] Restarting tasks: Done > [ 632.453173] random: crng reseeded on system resumption > [ 632.453370] PM: suspend exit > [ 632.462799] jdb2_log_wait_log_commit completed (elapsed 0.000 seconds) > [ 632.466114] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) > > This is the reason: > > [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) > > > During freezing, user processes executing jbd2_log_wait_commit enter D state > because this function calls wait_event and can take tens of milliseconds to > complete. This long execution time, coupled with possible competition with > the freezer, causes repeated freeze retries. > > While we understand that jbd2 is a freezable kernel thread, we would like to > know if there is a way to freeze it earlier or freeze some critical > processes proactively to reduce this contention. Freeze the filesystem before you start freezing kthreads? That should quiesce the jbd2 workers and pause anyone trying to write to the fs. Maybe the missing piece here is the device model not knowing how to call bdev_freeze prior to a suspend? That said, I think that doesn't 100% work for XFS because it has kworkers for metadata buffer read completions, and freezes don't affect read operations... (just my clueless 2c) --D > Thanks for your input and suggestions. > > 在 2025/8/11 18:58, Michal Hocko 写道: > > On Mon 11-08-25 17:13:43, Zihuan Zhang wrote: > > > 在 2025/8/8 16:58, Michal Hocko 写道: > > [...] > > > > Also the interface seems to be really coarse grained and it can easily > > > > turn out insufficient for other usecases while it is not entirely clear > > > > to me how this could be extended for those. > > > We recognize that the current interface is relatively coarse-grained and > > > may not be sufficient for all scenarios. The present implementation is a > > > basic version. > > > > > > Our plan is to introduce a classification-based mechanism that assigns > > > different freeze priorities according to process categories. For example, > > > filesystem and graphics-related processes will be given higher default > > > freeze priority, as they are critical in the freezing workflow. This > > > classification approach helps target important processes more precisely. > > > > > > However, this requires further testing and refinement before full > > > deployment. We believe this incremental, category-based design will make the > > > mechanism more effective and adaptable over time while keeping it > > > manageable. > > Unless there is a clear path for a more extendable interface then > > introducing this one is a no-go. We do not want to grow different ways > > to establish freezing policies. > > > > But much more fundamentally. So far I haven't really seen any argument > > why different priorities help with the underlying problem other than the > > timing might be slightly different if you change the order of freezing. > > This to me sounds like the proposed scheme mostly works around the > > problem you are seeing and as such is not a really good candidate to be > > merged as a long term solution. Not to mention with a user API that > > needs to be maintained for ever. > > > > So NAK from me on the interface. > > > Thanks for the feedback. I understand your concern that changing the freezer > priority order looks like working around the symptom rather than solving the > root cause. > > Since the last discussion, we have analyzed the D-state processes further > and identified that the long wait time is caused by jbd2_log_wait_commit. > This wait happens because user tasks call into this function during > fsync/fdatasync and it can take tens of milliseconds to complete. When this > coincides with the freezer operation, the tasks are stuck in D state and > retried multiple times, increasing the total freeze time. > > Although we know that jbd2 is a freezable kernel thread, we are exploring > whether freezing it earlier — or freezing certain key processes first — > could reduce this contention and improve freeze completion time. > > > > > > I believe it would be more useful to find sources of those freezer > > > > blockers and try to address those. Making more blocked tasks > > > > __set_task_frozen compatible sounds like a general improvement in > > > > itself. > > > we have already identified some causes of D-state tasks, many of which are > > > related to the filesystem. On some systems, certain processes frequently > > > execute ext4_sync_file, and under contention this can lead to D-state tasks. > > Please work with maintainers of those subsystems to find proper > > solutions. > > We’ve pulled in the jbd2 maintainer to get feedback on whether changing the > freeze ordering for jbd2 is safe or if there’s a better approach to avoid > the repeated retries caused by this wait. >
Hi, 在 2025/8/13 01:26, Darrick J. Wong 写道: > On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote: >> Hi all, >> >> We encountered an issue where the number of freeze retries increased due to >> processes stuck in D state. The logs point to jbd2-related activity. >> >> log1: >> >> 6616.650482] task:ThreadPoolForeg state:D stack:0 pid:262026 >> tgid:4065 ppid:2490 task_flags:0x400040 flags:0x00004004 >> [ 6616.650485] Call Trace: >> [ 6616.650486] <TASK> >> [ 6616.650489] __schedule+0x532/0xea0 >> [ 6616.650494] schedule+0x27/0x80 >> [ 6616.650496] jbd2_log_wait_commit+0xa6/0x120 >> [ 6616.650499] ? __pfx_autoremove_wake_function+0x10/0x10 >> [ 6616.650502] ext4_sync_file+0x1ba/0x380 >> [ 6616.650505] do_fsync+0x3b/0x80 >> >> log2: >> >> [ 631.206315] jdb2_log_wait_log_commit completed (elapsed 0.002 seconds) >> [ 631.215325] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) >> [ 631.240704] jdb2_log_wait_log_commit completed (elapsed 0.386 seconds) >> [ 631.262167] Filesystems sync: 0.424 seconds >> [ 631.262821] Freezing user space processes >> [ 631.263839] freeze round: 1, task to freeze: 852 >> [ 631.265128] freeze round: 2, task to freeze: 2 >> [ 631.267039] freeze round: 3, task to freeze: 2 >> [ 631.271176] freeze round: 4, task to freeze: 2 >> [ 631.279160] freeze round: 5, task to freeze: 2 >> [ 631.287152] freeze round: 6, task to freeze: 2 >> [ 631.295346] freeze round: 7, task to freeze: 2 >> [ 631.301747] freeze round: 8, task to freeze: 2 >> [ 631.309346] freeze round: 9, task to freeze: 2 >> [ 631.317353] freeze round: 10, task to freeze: 2 >> [ 631.325348] freeze round: 11, task to freeze: 2 >> [ 631.333353] freeze round: 12, task to freeze: 2 >> [ 631.341358] freeze round: 13, task to freeze: 2 >> [ 631.349357] freeze round: 14, task to freeze: 2 >> [ 631.357363] freeze round: 15, task to freeze: 2 >> [ 631.365361] freeze round: 16, task to freeze: 2 >> [ 631.373379] freeze round: 17, task to freeze: 2 >> [ 631.381366] freeze round: 18, task to freeze: 2 >> [ 631.389365] freeze round: 19, task to freeze: 2 >> [ 631.397371] freeze round: 20, task to freeze: 2 >> [ 631.405373] freeze round: 21, task to freeze: 2 >> [ 631.413373] freeze round: 22, task to freeze: 2 >> [ 631.421392] freeze round: 23, task to freeze: 1 >> [ 631.429948] freeze round: 24, task to freeze: 1 >> [ 631.438295] freeze round: 25, task to freeze: 1 >> [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) >> [ 631.446387] freeze round: 26, task to freeze: 0 >> [ 631.446390] Freezing user space processes completed (elapsed 0.183 >> seconds) >> [ 631.446392] OOM killer disabled. >> [ 631.446393] Freezing remaining freezable tasks >> [ 631.446656] freeze round: 1, task to freeze: 4 >> [ 631.447976] freeze round: 2, task to freeze: 0 >> [ 631.447978] Freezing remaining freezable tasks completed (elapsed 0.001 >> seconds) >> [ 631.447980] PM: suspend debug: Waiting for 1 second(s). >> [ 632.450858] OOM killer enabled. >> [ 632.450859] Restarting tasks: Starting >> [ 632.453140] Restarting tasks: Done >> [ 632.453173] random: crng reseeded on system resumption >> [ 632.453370] PM: suspend exit >> [ 632.462799] jdb2_log_wait_log_commit completed (elapsed 0.000 seconds) >> [ 632.466114] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) >> >> This is the reason: >> >> [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) >> >> >> During freezing, user processes executing jbd2_log_wait_commit enter D state >> because this function calls wait_event and can take tens of milliseconds to >> complete. This long execution time, coupled with possible competition with >> the freezer, causes repeated freeze retries. >> >> While we understand that jbd2 is a freezable kernel thread, we would like to >> know if there is a way to freeze it earlier or freeze some critical >> processes proactively to reduce this contention. > Freeze the filesystem before you start freezing kthreads? That should > quiesce the jbd2 workers and pause anyone trying to write to the fs. Indeed, freezing the filesystem can work. However, this approach is quite expensive: it increases the total suspend time by about 3 to 4 seconds. Because of this overhead, we are exploring alternative solutions with lower cost. We have tested it: https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/ > Maybe the missing piece here is the device model not knowing how to call > bdev_freeze prior to a suspend? Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have any plans or insights on improving or integrating this functionality more smoothly into the device model and suspend sequence? > That said, I think that doesn't 100% work for XFS because it has > kworkers for metadata buffer read completions, and freezes don't affect > read operations... Does read activity also cause processes to enter D (uninterruptible sleep) state? From what I understand, it’s usually writes or synchronous operations that do, but I’m curious if reads can also lead to D state under certain conditions. > (just my clueless 2c) > > --D > >> Thanks for your input and suggestions. >> >> 在 2025/8/11 18:58, Michal Hocko 写道: >>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote: >>>> 在 2025/8/8 16:58, Michal Hocko 写道: >>> [...] >>>>> Also the interface seems to be really coarse grained and it can easily >>>>> turn out insufficient for other usecases while it is not entirely clear >>>>> to me how this could be extended for those. >>>> We recognize that the current interface is relatively coarse-grained and >>>> may not be sufficient for all scenarios. The present implementation is a >>>> basic version. >>>> >>>> Our plan is to introduce a classification-based mechanism that assigns >>>> different freeze priorities according to process categories. For example, >>>> filesystem and graphics-related processes will be given higher default >>>> freeze priority, as they are critical in the freezing workflow. This >>>> classification approach helps target important processes more precisely. >>>> >>>> However, this requires further testing and refinement before full >>>> deployment. We believe this incremental, category-based design will make the >>>> mechanism more effective and adaptable over time while keeping it >>>> manageable. >>> Unless there is a clear path for a more extendable interface then >>> introducing this one is a no-go. We do not want to grow different ways >>> to establish freezing policies. >>> >>> But much more fundamentally. So far I haven't really seen any argument >>> why different priorities help with the underlying problem other than the >>> timing might be slightly different if you change the order of freezing. >>> This to me sounds like the proposed scheme mostly works around the >>> problem you are seeing and as such is not a really good candidate to be >>> merged as a long term solution. Not to mention with a user API that >>> needs to be maintained for ever. >>> >>> So NAK from me on the interface. >>> >> Thanks for the feedback. I understand your concern that changing the freezer >> priority order looks like working around the symptom rather than solving the >> root cause. >> >> Since the last discussion, we have analyzed the D-state processes further >> and identified that the long wait time is caused by jbd2_log_wait_commit. >> This wait happens because user tasks call into this function during >> fsync/fdatasync and it can take tens of milliseconds to complete. When this >> coincides with the freezer operation, the tasks are stuck in D state and >> retried multiple times, increasing the total freeze time. >> >> Although we know that jbd2 is a freezable kernel thread, we are exploring >> whether freezing it earlier — or freezing certain key processes first — >> could reduce this contention and improve freeze completion time. >> >> >>>>> I believe it would be more useful to find sources of those freezer >>>>> blockers and try to address those. Making more blocked tasks >>>>> __set_task_frozen compatible sounds like a general improvement in >>>>> itself. >>>> we have already identified some causes of D-state tasks, many of which are >>>> related to the filesystem. On some systems, certain processes frequently >>>> execute ext4_sync_file, and under contention this can lead to D-state tasks. >>> Please work with maintainers of those subsystems to find proper >>> solutions. >> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the >> freeze ordering for jbd2 is safe or if there’s a better approach to avoid >> the repeated retries caused by this wait. >>
On Wed, Aug 13, 2025 at 01:48:37PM +0800, Zihuan Zhang wrote: > Hi, > > 在 2025/8/13 01:26, Darrick J. Wong 写道: > > On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote: > > > Hi all, > > > > > > We encountered an issue where the number of freeze retries increased due to > > > processes stuck in D state. The logs point to jbd2-related activity. > > > > > > log1: > > > > > > 6616.650482] task:ThreadPoolForeg state:D stack:0 pid:262026 > > > tgid:4065 ppid:2490 task_flags:0x400040 flags:0x00004004 > > > [ 6616.650485] Call Trace: > > > [ 6616.650486] <TASK> > > > [ 6616.650489] __schedule+0x532/0xea0 > > > [ 6616.650494] schedule+0x27/0x80 > > > [ 6616.650496] jbd2_log_wait_commit+0xa6/0x120 > > > [ 6616.650499] ? __pfx_autoremove_wake_function+0x10/0x10 > > > [ 6616.650502] ext4_sync_file+0x1ba/0x380 > > > [ 6616.650505] do_fsync+0x3b/0x80 > > > > > > log2: > > > > > > [ 631.206315] jdb2_log_wait_log_commit completed (elapsed 0.002 seconds) > > > [ 631.215325] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) > > > [ 631.240704] jdb2_log_wait_log_commit completed (elapsed 0.386 seconds) > > > [ 631.262167] Filesystems sync: 0.424 seconds > > > [ 631.262821] Freezing user space processes > > > [ 631.263839] freeze round: 1, task to freeze: 852 > > > [ 631.265128] freeze round: 2, task to freeze: 2 > > > [ 631.267039] freeze round: 3, task to freeze: 2 > > > [ 631.271176] freeze round: 4, task to freeze: 2 > > > [ 631.279160] freeze round: 5, task to freeze: 2 > > > [ 631.287152] freeze round: 6, task to freeze: 2 > > > [ 631.295346] freeze round: 7, task to freeze: 2 > > > [ 631.301747] freeze round: 8, task to freeze: 2 > > > [ 631.309346] freeze round: 9, task to freeze: 2 > > > [ 631.317353] freeze round: 10, task to freeze: 2 > > > [ 631.325348] freeze round: 11, task to freeze: 2 > > > [ 631.333353] freeze round: 12, task to freeze: 2 > > > [ 631.341358] freeze round: 13, task to freeze: 2 > > > [ 631.349357] freeze round: 14, task to freeze: 2 > > > [ 631.357363] freeze round: 15, task to freeze: 2 > > > [ 631.365361] freeze round: 16, task to freeze: 2 > > > [ 631.373379] freeze round: 17, task to freeze: 2 > > > [ 631.381366] freeze round: 18, task to freeze: 2 > > > [ 631.389365] freeze round: 19, task to freeze: 2 > > > [ 631.397371] freeze round: 20, task to freeze: 2 > > > [ 631.405373] freeze round: 21, task to freeze: 2 > > > [ 631.413373] freeze round: 22, task to freeze: 2 > > > [ 631.421392] freeze round: 23, task to freeze: 1 > > > [ 631.429948] freeze round: 24, task to freeze: 1 > > > [ 631.438295] freeze round: 25, task to freeze: 1 > > > [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) > > > [ 631.446387] freeze round: 26, task to freeze: 0 > > > [ 631.446390] Freezing user space processes completed (elapsed 0.183 > > > seconds) > > > [ 631.446392] OOM killer disabled. > > > [ 631.446393] Freezing remaining freezable tasks > > > [ 631.446656] freeze round: 1, task to freeze: 4 > > > [ 631.447976] freeze round: 2, task to freeze: 0 > > > [ 631.447978] Freezing remaining freezable tasks completed (elapsed 0.001 > > > seconds) > > > [ 631.447980] PM: suspend debug: Waiting for 1 second(s). > > > [ 632.450858] OOM killer enabled. > > > [ 632.450859] Restarting tasks: Starting > > > [ 632.453140] Restarting tasks: Done > > > [ 632.453173] random: crng reseeded on system resumption > > > [ 632.453370] PM: suspend exit > > > [ 632.462799] jdb2_log_wait_log_commit completed (elapsed 0.000 seconds) > > > [ 632.466114] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) > > > > > > This is the reason: > > > > > > [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) > > > > > > > > > During freezing, user processes executing jbd2_log_wait_commit enter D state > > > because this function calls wait_event and can take tens of milliseconds to > > > complete. This long execution time, coupled with possible competition with > > > the freezer, causes repeated freeze retries. > > > > > > While we understand that jbd2 is a freezable kernel thread, we would like to > > > know if there is a way to freeze it earlier or freeze some critical > > > processes proactively to reduce this contention. > > Freeze the filesystem before you start freezing kthreads? That should > > quiesce the jbd2 workers and pause anyone trying to write to the fs. > Indeed, freezing the filesystem can work. > > However, this approach is quite expensive: it increases the total suspend > time by about 3 to 4 seconds. Because of this overhead, we are exploring > alternative solutions with lower cost. Indeed it does, because now XFS and friends will actually shut down their background workers and flush all the dirty data and metadata to disk. On the other hand, if the system crashes while suspended, there's a lot less recovery work to be done. Granted the kernel (or userspace) will usually sync() before suspending so that's not been a huge problem in production afaict. > We have tested it: > > https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/ > > > Maybe the missing piece here is the device model not knowing how to call > > bdev_freeze prior to a suspend? > Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have > any plans or insights on improving or integrating this functionality more > smoothly into the device model and suspend sequence? > > That said, I think that doesn't 100% work for XFS because it has > > kworkers for metadata buffer read completions, and freezes don't affect > > read operations... > > Does read activity also cause processes to enter D (uninterruptible sleep) > state? Usually. > From what I understand, it’s usually writes or synchronous operations that > do, but I’m curious if reads can also lead to D state under certain > conditions. Anything that sets the task state to uninterruptible. --D > > (just my clueless 2c) > > > > --D > > > > > Thanks for your input and suggestions. > > > > > > 在 2025/8/11 18:58, Michal Hocko 写道: > > > > On Mon 11-08-25 17:13:43, Zihuan Zhang wrote: > > > > > 在 2025/8/8 16:58, Michal Hocko 写道: > > > > [...] > > > > > > Also the interface seems to be really coarse grained and it can easily > > > > > > turn out insufficient for other usecases while it is not entirely clear > > > > > > to me how this could be extended for those. > > > > > We recognize that the current interface is relatively coarse-grained and > > > > > may not be sufficient for all scenarios. The present implementation is a > > > > > basic version. > > > > > > > > > > Our plan is to introduce a classification-based mechanism that assigns > > > > > different freeze priorities according to process categories. For example, > > > > > filesystem and graphics-related processes will be given higher default > > > > > freeze priority, as they are critical in the freezing workflow. This > > > > > classification approach helps target important processes more precisely. > > > > > > > > > > However, this requires further testing and refinement before full > > > > > deployment. We believe this incremental, category-based design will make the > > > > > mechanism more effective and adaptable over time while keeping it > > > > > manageable. > > > > Unless there is a clear path for a more extendable interface then > > > > introducing this one is a no-go. We do not want to grow different ways > > > > to establish freezing policies. > > > > > > > > But much more fundamentally. So far I haven't really seen any argument > > > > why different priorities help with the underlying problem other than the > > > > timing might be slightly different if you change the order of freezing. > > > > This to me sounds like the proposed scheme mostly works around the > > > > problem you are seeing and as such is not a really good candidate to be > > > > merged as a long term solution. Not to mention with a user API that > > > > needs to be maintained for ever. > > > > > > > > So NAK from me on the interface. > > > > > > > Thanks for the feedback. I understand your concern that changing the freezer > > > priority order looks like working around the symptom rather than solving the > > > root cause. > > > > > > Since the last discussion, we have analyzed the D-state processes further > > > and identified that the long wait time is caused by jbd2_log_wait_commit. > > > This wait happens because user tasks call into this function during > > > fsync/fdatasync and it can take tens of milliseconds to complete. When this > > > coincides with the freezer operation, the tasks are stuck in D state and > > > retried multiple times, increasing the total freeze time. > > > > > > Although we know that jbd2 is a freezable kernel thread, we are exploring > > > whether freezing it earlier — or freezing certain key processes first — > > > could reduce this contention and improve freeze completion time. > > > > > > > > > > > > I believe it would be more useful to find sources of those freezer > > > > > > blockers and try to address those. Making more blocked tasks > > > > > > __set_task_frozen compatible sounds like a general improvement in > > > > > > itself. > > > > > we have already identified some causes of D-state tasks, many of which are > > > > > related to the filesystem. On some systems, certain processes frequently > > > > > execute ext4_sync_file, and under contention this can lead to D-state tasks. > > > > Please work with maintainers of those subsystems to find proper > > > > solutions. > > > We’ve pulled in the jbd2 maintainer to get feedback on whether changing the > > > freeze ordering for jbd2 is safe or if there’s a better approach to avoid > > > the repeated retries caused by this wait. > > > >
在 2025/8/15 00:43, Darrick J. Wong 写道: > On Wed, Aug 13, 2025 at 01:48:37PM +0800, Zihuan Zhang wrote: >> Hi, >> >> 在 2025/8/13 01:26, Darrick J. Wong 写道: >>> On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote: >>>> Hi all, >>>> >>>> We encountered an issue where the number of freeze retries increased due to >>>> processes stuck in D state. The logs point to jbd2-related activity. >>>> >>>> log1: >>>> >>>> 6616.650482] task:ThreadPoolForeg state:D stack:0 pid:262026 >>>> tgid:4065 ppid:2490 task_flags:0x400040 flags:0x00004004 >>>> [ 6616.650485] Call Trace: >>>> [ 6616.650486] <TASK> >>>> [ 6616.650489] __schedule+0x532/0xea0 >>>> [ 6616.650494] schedule+0x27/0x80 >>>> [ 6616.650496] jbd2_log_wait_commit+0xa6/0x120 >>>> [ 6616.650499] ? __pfx_autoremove_wake_function+0x10/0x10 >>>> [ 6616.650502] ext4_sync_file+0x1ba/0x380 >>>> [ 6616.650505] do_fsync+0x3b/0x80 >>>> >>>> log2: >>>> >>>> [ 631.206315] jdb2_log_wait_log_commit completed (elapsed 0.002 seconds) >>>> [ 631.215325] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) >>>> [ 631.240704] jdb2_log_wait_log_commit completed (elapsed 0.386 seconds) >>>> [ 631.262167] Filesystems sync: 0.424 seconds >>>> [ 631.262821] Freezing user space processes >>>> [ 631.263839] freeze round: 1, task to freeze: 852 >>>> [ 631.265128] freeze round: 2, task to freeze: 2 >>>> [ 631.267039] freeze round: 3, task to freeze: 2 >>>> [ 631.271176] freeze round: 4, task to freeze: 2 >>>> [ 631.279160] freeze round: 5, task to freeze: 2 >>>> [ 631.287152] freeze round: 6, task to freeze: 2 >>>> [ 631.295346] freeze round: 7, task to freeze: 2 >>>> [ 631.301747] freeze round: 8, task to freeze: 2 >>>> [ 631.309346] freeze round: 9, task to freeze: 2 >>>> [ 631.317353] freeze round: 10, task to freeze: 2 >>>> [ 631.325348] freeze round: 11, task to freeze: 2 >>>> [ 631.333353] freeze round: 12, task to freeze: 2 >>>> [ 631.341358] freeze round: 13, task to freeze: 2 >>>> [ 631.349357] freeze round: 14, task to freeze: 2 >>>> [ 631.357363] freeze round: 15, task to freeze: 2 >>>> [ 631.365361] freeze round: 16, task to freeze: 2 >>>> [ 631.373379] freeze round: 17, task to freeze: 2 >>>> [ 631.381366] freeze round: 18, task to freeze: 2 >>>> [ 631.389365] freeze round: 19, task to freeze: 2 >>>> [ 631.397371] freeze round: 20, task to freeze: 2 >>>> [ 631.405373] freeze round: 21, task to freeze: 2 >>>> [ 631.413373] freeze round: 22, task to freeze: 2 >>>> [ 631.421392] freeze round: 23, task to freeze: 1 >>>> [ 631.429948] freeze round: 24, task to freeze: 1 >>>> [ 631.438295] freeze round: 25, task to freeze: 1 >>>> [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) >>>> [ 631.446387] freeze round: 26, task to freeze: 0 >>>> [ 631.446390] Freezing user space processes completed (elapsed 0.183 >>>> seconds) >>>> [ 631.446392] OOM killer disabled. >>>> [ 631.446393] Freezing remaining freezable tasks >>>> [ 631.446656] freeze round: 1, task to freeze: 4 >>>> [ 631.447976] freeze round: 2, task to freeze: 0 >>>> [ 631.447978] Freezing remaining freezable tasks completed (elapsed 0.001 >>>> seconds) >>>> [ 631.447980] PM: suspend debug: Waiting for 1 second(s). >>>> [ 632.450858] OOM killer enabled. >>>> [ 632.450859] Restarting tasks: Starting >>>> [ 632.453140] Restarting tasks: Done >>>> [ 632.453173] random: crng reseeded on system resumption >>>> [ 632.453370] PM: suspend exit >>>> [ 632.462799] jdb2_log_wait_log_commit completed (elapsed 0.000 seconds) >>>> [ 632.466114] jdb2_log_wait_log_commit completed (elapsed 0.001 seconds) >>>> >>>> This is the reason: >>>> >>>> [ 631.444546] jdb2_log_wait_log_commit completed (elapsed 0.249 seconds) >>>> >>>> >>>> During freezing, user processes executing jbd2_log_wait_commit enter D state >>>> because this function calls wait_event and can take tens of milliseconds to >>>> complete. This long execution time, coupled with possible competition with >>>> the freezer, causes repeated freeze retries. >>>> >>>> While we understand that jbd2 is a freezable kernel thread, we would like to >>>> know if there is a way to freeze it earlier or freeze some critical >>>> processes proactively to reduce this contention. >>> Freeze the filesystem before you start freezing kthreads? That should >>> quiesce the jbd2 workers and pause anyone trying to write to the fs. >> Indeed, freezing the filesystem can work. >> >> However, this approach is quite expensive: it increases the total suspend >> time by about 3 to 4 seconds. Because of this overhead, we are exploring >> alternative solutions with lower cost. > Indeed it does, because now XFS and friends will actually shut down > their background workers and flush all the dirty data and metadata to > disk. On the other hand, if the system crashes while suspended, there's > a lot less recovery work to be done. > > Granted the kernel (or userspace) will usually sync() before suspending > so that's not been a huge problem in production afaict. Thank you for your explanation! >> We have tested it: >> >> https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/ >> >>> Maybe the missing piece here is the device model not knowing how to call >>> bdev_freeze prior to a suspend? >> Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have >> any plans or insights on improving or integrating this functionality more >> smoothly into the device model and suspend sequence? >>> That said, I think that doesn't 100% work for XFS because it has >>> kworkers for metadata buffer read completions, and freezes don't affect >>> read operations... >> Does read activity also cause processes to enter D (uninterruptible sleep) >> state? > Usually. I think you are right. read operations like vfs_read also cause it. [ 79.179682] PM: suspend entry (deep) [ 79.302703] Filesystems sync: 0.123 seconds [ 79.385416] Freezing user space processes [ 79.386223] round:0 todo:673 [ 79.387025] currnet process has not been frozen :Xorg pid:1588 [ 79.387026] task:Xorg state:D stack:0 pid:1588 tgid:1588 ppid:1471 flags:0x00000004 [ 79.387030] Call Trace: [ 79.387031] <TASK> [ 79.387032] __schedule+0x46c/0xe40 [ 79.387038] schedule+0x32/0xb0 [ 79.387040] schedule_timeout+0x23d/0x2a0 [ 79.387043] ? pollwake+0x78/0xa0 [ 79.387046] wait_for_completion+0x8c/0x180 [ 79.387048] __flush_work+0x204/0x2d0 [ 79.387051] ? __pfx_wq_barrier_func+0x10/0x10 [ 79.387054] drm_mode_rmfb+0x1a0/0x200 [ 79.387057] ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10 [ 79.387058] ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10 [ 79.387060] drm_ioctl_kernel+0xbc/0x150 [ 79.387062] ? __stack_depot_save+0x38/0x4c0 [ 79.387066] drm_ioctl+0x270/0x470 [ 79.387068] ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10 [ 79.387072] radeon_drm_ioctl+0x4a/0x80 [radeon] [ 79.387108] __x64_sys_ioctl+0x8c/0xc0 [ 79.387110] do_syscall_64+0x7e/0x270 [ 79.387112] ? __fsnotify_parent+0x113/0x370 [ 79.387114] ? drm_read+0x284/0x320 [ 79.387117] ? syscall_exit_work+0x110/0x140 [ 79.387120] ? vfs_read+0x220/0x2f0 [ 79.387122] ? vfs_read+0x220/0x2f0 [ 79.387123] ? audit_reset_context.part.0+0x27a/0x2f0 [ 79.387126] ? audit_reset_context.part.0+0x27a/0x2f0 [ 79.387128] ? syscall_exit_work+0x110/0x140 [ 79.387130] ? do_syscall_64+0x10f/0x270 [ 79.387131] ? audit_reset_context.part.0+0x27a/0x2f0 [ 79.387133] ? syscall_exit_work+0x110/0x140 [ 79.387135] ? do_syscall_64+0x10f/0x270 [ 79.387137] ? audit_reset_context.part.0+0x27a/0x2f0 [ 79.387139] ? syscall_exit_work+0x110/0x140 [ 79.387141] ? do_syscall_64+0x10f/0x270 [ 79.387142] ? syscall_exit_work+0x110/0x140 [ 79.387144] ? do_syscall_64+0x10f/0x270 [ 79.387145] ? irqtime_account_irq+0x40/0xc0 [ 79.387148] ? irqentry_exit_to_user_mode+0x74/0x1e0 [ 79.387150] entry_SYSCALL_64_after_hwframe+0x76/0xe0 [ 79.387153] RIP: 0033:0x7f91baf2550b [ 79.387155] RSP: 002b:00007ffc673d5668 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 79.387157] RAX: ffffffffffffffda RBX: 00007ffc673d56ac RCX: 00007f91baf2550b [ 79.387158] RDX: 00007ffc673d56ac RSI: 00000000c00464af RDI: 000000000000000e [ 79.387159] RBP: 00000000c00464af R08: 00007f91ba860220 R09: 000056429d1d9fa0 [ 79.387160] R10: 0000000000000103 R11: 0000000000000246 R12: 000056429ba931e0 [ 79.387161] R13: 000000000000000e R14: 00000000049f0b22 R15: 000056429b93bfb0 [ 79.387164] </TASK> [ 79.387255] round:1 todo:1 >> From what I understand, it’s usually writes or synchronous operations that >> do, but I’m curious if reads can also lead to D state under certain >> conditions. > Anything that sets the task state to uninterruptible. > > --D > >>> (just my clueless 2c) >>> >>> --D >>> >>>> Thanks for your input and suggestions. >>>> >>>> 在 2025/8/11 18:58, Michal Hocko 写道: >>>>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote: >>>>>> 在 2025/8/8 16:58, Michal Hocko 写道: >>>>> [...] >>>>>>> Also the interface seems to be really coarse grained and it can easily >>>>>>> turn out insufficient for other usecases while it is not entirely clear >>>>>>> to me how this could be extended for those. >>>>>> We recognize that the current interface is relatively coarse-grained and >>>>>> may not be sufficient for all scenarios. The present implementation is a >>>>>> basic version. >>>>>> >>>>>> Our plan is to introduce a classification-based mechanism that assigns >>>>>> different freeze priorities according to process categories. For example, >>>>>> filesystem and graphics-related processes will be given higher default >>>>>> freeze priority, as they are critical in the freezing workflow. This >>>>>> classification approach helps target important processes more precisely. >>>>>> >>>>>> However, this requires further testing and refinement before full >>>>>> deployment. We believe this incremental, category-based design will make the >>>>>> mechanism more effective and adaptable over time while keeping it >>>>>> manageable. >>>>> Unless there is a clear path for a more extendable interface then >>>>> introducing this one is a no-go. We do not want to grow different ways >>>>> to establish freezing policies. >>>>> >>>>> But much more fundamentally. So far I haven't really seen any argument >>>>> why different priorities help with the underlying problem other than the >>>>> timing might be slightly different if you change the order of freezing. >>>>> This to me sounds like the proposed scheme mostly works around the >>>>> problem you are seeing and as such is not a really good candidate to be >>>>> merged as a long term solution. Not to mention with a user API that >>>>> needs to be maintained for ever. >>>>> >>>>> So NAK from me on the interface. >>>>> >>>> Thanks for the feedback. I understand your concern that changing the freezer >>>> priority order looks like working around the symptom rather than solving the >>>> root cause. >>>> >>>> Since the last discussion, we have analyzed the D-state processes further >>>> and identified that the long wait time is caused by jbd2_log_wait_commit. >>>> This wait happens because user tasks call into this function during >>>> fsync/fdatasync and it can take tens of milliseconds to complete. When this >>>> coincides with the freezer operation, the tasks are stuck in D state and >>>> retried multiple times, increasing the total freeze time. >>>> >>>> Although we know that jbd2 is a freezable kernel thread, we are exploring >>>> whether freezing it earlier — or freezing certain key processes first — >>>> could reduce this contention and improve freeze completion time. >>>> >>>> >>>>>>> I believe it would be more useful to find sources of those freezer >>>>>>> blockers and try to address those. Making more blocked tasks >>>>>>> __set_task_frozen compatible sounds like a general improvement in >>>>>>> itself. >>>>>> we have already identified some causes of D-state tasks, many of which are >>>>>> related to the filesystem. On some systems, certain processes frequently >>>>>> execute ext4_sync_file, and under contention this can lead to D-state tasks. >>>>> Please work with maintainers of those subsystems to find proper >>>>> solutions. >>>> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the >>>> freeze ordering for jbd2 is safe or if there’s a better approach to avoid >>>> the repeated retries caused by this wait. >>>>
On Thu, Aug 07, 2025 at 08:14:09PM +0800, Zihuan Zhang wrote: > Freeze Window Begins > > [process A] - epoll_wait() > │ > ▼ > [process B] - event source (already frozen) > Can we make epoll_wait() TASK_FREEZABLE? AFAICT it doesn't hold any resources, it just sits there waiting for stuff.
在 2025/8/14 22:37, Peter Zijlstra 写道: > On Thu, Aug 07, 2025 at 08:14:09PM +0800, Zihuan Zhang wrote: > >> Freeze Window Begins >> >> [process A] - epoll_wait() >> │ >> ▼ >> [process B] - event source (already frozen) >> > Can we make epoll_wait() TASK_FREEZABLE? AFAICT it doesn't hold any > resources, it just sits there waiting for stuff. Based on the code, it’s ep_poll() that puts the task into the D state, most likely due to I/O or lower-level driver behavior. In fs/eventpoll.c: Line:2097 __set_current_state <https://elixir.bootlin.com/linux/v6.16/C/ident/__set_current_state>(TASK_INTERRUPTIBLE <https://elixir.bootlin.com/linux/v6.16/C/ident/TASK_INTERRUPTIBLE>); Simply changing the task state may not actually address the root cause. Currently, our approach is to identify tasks that are more likely to cause such issues and freeze them earlier or later in the process to avoid conflicts.
© 2016 - 2025 Red Hat, Inc.