include/linux/futex.h | 5 ++++ include/linux/mm.h | 3 +++ kernel/futex/core.c | 30 +++++++++++++++++++++++ mm/oom_kill.c | 55 +++++++++++++++++++++++++++++++------------ 4 files changed, 78 insertions(+), 15 deletions(-)
From: zhongjinji <zhongjinji@honor.com> The OOM reaper quickly reclaims a process's memory when the system hits OOM, helping the system recover. Without the OOM reaper, if a process frozen by cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the system in a poor state. Even if the process is not frozen by cgroup v1, reclaiming victims' memory remains important, as having one more process working speeds up memory release. When processes holding robust futexes are OOM killed but waiters on those futexes remain alive, the robust futexes might be reaped before futex_cleanup() runs. This can cause the waiters to block indefinitely [1]. To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since many killed processes exit within 2 seconds, the OOM reaper rarely runs after this delay. However, robust futex users are few, so delaying OOM reap for all victims is unnecessary. If each thread's robust_list in a process is NULL, the process holds no robust futexes. For such processes, the OOM reaper should not be delayed. For processes holding robust futexes, to avoid issue [1], the OOM reaper must still be delayed. Patch 1 introduces process_has_robust_futex() to detect whether a process uses robust futexes. Patch 2 delays the OOM reaper only for processes holding robust futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock contention caused by unmapping the same vma. Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1] --- v3 -> v4: 1. Rename check_robust_futex() to process_has_robust_futex() for clearer intent. 2. Because the delay_reap parameter was added to task_will_free_mem(), the function is renamed to should_reap_task() to better clarify its purpose. 3. Add should_delay_oom_reap() to decide whether to delay OOM reap. 4. Modify the OOM reaper to traverse the maple tree in reverse order; see patch 3 for details. These changes improve code readability and enhance OOM reaper behavior. zhongjinji (3): futex: Introduce function process_has_robust_futex() mm/oom_kill: Only delay OOM reaper for processes using robust futexes mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders include/linux/futex.h | 5 ++++ include/linux/mm.h | 3 +++ kernel/futex/core.c | 30 +++++++++++++++++++++++ mm/oom_kill.c | 55 +++++++++++++++++++++++++++++++------------ 4 files changed, 78 insertions(+), 15 deletions(-) -- 2.17.1
On Thu, 14 Aug 2025 21:55:52 +0800 <zhongjinji@honor.com> wrote: > The OOM reaper quickly reclaims a process's memory when the system hits OOM, > helping the system recover. Without the OOM reaper, if a process frozen by > cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the > system in a poor state. Even if the process is not frozen by cgroup v1, > reclaiming victims' memory remains important, as having one more process > working speeds up memory release. > > When processes holding robust futexes are OOM killed but waiters on those > futexes remain alive, the robust futexes might be reaped before > futex_cleanup() runs. This can cause the waiters to block indefinitely [1]. > > To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since > many killed processes exit within 2 seconds, the OOM reaper rarely runs after > this delay. However, robust futex users are few, so delaying OOM reap for all > victims is unnecessary. > > If each thread's robust_list in a process is NULL, the process holds no robust > futexes. For such processes, the OOM reaper should not be delayed. For > processes holding robust futexes, to avoid issue [1], the OOM reaper must > still be delayed. > > Patch 1 introduces process_has_robust_futex() to detect whether a process uses > robust futexes. Patch 2 delays the OOM reaper only for processes holding robust > futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and > exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock > contention caused by unmapping the same vma. This all sounds sensible, given that we appear to be stuck with the 2-second hack. What prevents one of the process's threads from creating a robust mutex after we've inspected it with process_has_robust_futex()?
On Thu, 14 Aug 2025 21:55:52 +0800 <zhongjinji@honor.com> wrote: > > The OOM reaper quickly reclaims a process's memory when the system hits OOM, > > helping the system recover. Without the OOM reaper, if a process frozen by > > cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the > > system in a poor state. Even if the process is not frozen by cgroup v1, > > reclaiming victims' memory remains important, as having one more process > > working speeds up memory release. > > > > When processes holding robust futexes are OOM killed but waiters on those > > futexes remain alive, the robust futexes might be reaped before > > futex_cleanup() runs. This can cause the waiters to block indefinitely [1]. > > > > To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since > > many killed processes exit within 2 seconds, the OOM reaper rarely runs after > > this delay. However, robust futex users are few, so delaying OOM reap for all > > victims is unnecessary. > > > > If each thread's robust_list in a process is NULL, the process holds no robust > > futexes. For such processes, the OOM reaper should not be delayed. For > > processes holding robust futexes, to avoid issue [1], the OOM reaper must > > still be delayed. > > > > Patch 1 introduces process_has_robust_futex() to detect whether a process uses > > robust futexes. Patch 2 delays the OOM reaper only for processes holding robust > > futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and > > exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock > > contention caused by unmapping the same vma. > > This all sounds sensible, given that we appear to be stuck with the > 2-second hack. > > What prevents one of the process's threads from creating a robust mutex > after we've inspected it with process_has_robust_futex()? Thank you, I didn't consider this situation. Since process_has_robust_futex() is called after the kill signal is sent, this means the process will have the SIGNAL_GROUP_EXIT flag when calling process_has_robust_futex(). We can check whether task->signal->flags contains the SIGNAL_GROUP_EXIT flag in set_robust_list() to ensure that the process is not being killed before creating the robust mutex.
© 2016 - 2025 Red Hat, Inc.