mm/oom_kill: Only delay OOM reaper for processes using robust futexes

[PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes

Posted by zhongjinji@honor.com 1 month, 3 weeks ago

From: zhongjinji <zhongjinji@honor.com>

The OOM reaper quickly reclaims a process's memory when the system hits OOM,
helping the system recover. Without the OOM reaper, if a process frozen by
cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the
system in a poor state. Even if the process is not frozen by cgroup v1,
reclaiming victims' memory remains important, as having one more process
working speeds up memory release.

When processes holding robust futexes are OOM killed but waiters on those
futexes remain alive, the robust futexes might be reaped before
futex_cleanup() runs. This can cause the waiters to block indefinitely [1].

To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since
many killed processes exit within 2 seconds, the OOM reaper rarely runs after
this delay. However, robust futex users are few, so delaying OOM reap for all
victims is unnecessary.

If each thread's robust_list in a process is NULL, the process holds no robust
futexes. For such processes, the OOM reaper should not be delayed. For
processes holding robust futexes, to avoid issue [1], the OOM reaper must
still be delayed.

Patch 1 introduces process_has_robust_futex() to detect whether a process uses
robust futexes. Patch 2 delays the OOM reaper only for processes holding robust
futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and
exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock
contention caused by unmapping the same vma.

Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]

---

v3 -> v4:

1. Rename check_robust_futex() to process_has_robust_futex() for clearer
   intent.
2. Because the delay_reap parameter was added to task_will_free_mem(),
   the function is renamed to should_reap_task() to better clarify
   its purpose.
3. Add should_delay_oom_reap() to decide whether to delay OOM reap.
4. Modify the OOM reaper to traverse the maple tree in reverse order; see patch
   3 for details.
These changes improve code readability and enhance OOM reaper behavior.

zhongjinji (3):
  futex: Introduce function process_has_robust_futex()
  mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple
    tree in opposite orders

 include/linux/futex.h |  5 ++++
 include/linux/mm.h    |  3 +++
 kernel/futex/core.c   | 30 +++++++++++++++++++++++
 mm/oom_kill.c         | 55 +++++++++++++++++++++++++++++++------------
 4 files changed, 78 insertions(+), 15 deletions(-)

-- 
2.17.1

Re: [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes

Posted by Andrew Morton 1 month, 2 weeks ago

On Thu, 14 Aug 2025 21:55:52 +0800 <zhongjinji@honor.com> wrote:

> The OOM reaper quickly reclaims a process's memory when the system hits OOM,
> helping the system recover. Without the OOM reaper, if a process frozen by
> cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the
> system in a poor state. Even if the process is not frozen by cgroup v1,
> reclaiming victims' memory remains important, as having one more process
> working speeds up memory release.
> 
> When processes holding robust futexes are OOM killed but waiters on those
> futexes remain alive, the robust futexes might be reaped before
> futex_cleanup() runs. This can cause the waiters to block indefinitely [1].
> 
> To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since
> many killed processes exit within 2 seconds, the OOM reaper rarely runs after
> this delay. However, robust futex users are few, so delaying OOM reap for all
> victims is unnecessary.
> 
> If each thread's robust_list in a process is NULL, the process holds no robust
> futexes. For such processes, the OOM reaper should not be delayed. For
> processes holding robust futexes, to avoid issue [1], the OOM reaper must
> still be delayed.
> 
> Patch 1 introduces process_has_robust_futex() to detect whether a process uses
> robust futexes. Patch 2 delays the OOM reaper only for processes holding robust
> futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and
> exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock
> contention caused by unmapping the same vma.

This all sounds sensible, given that we appear to be stuck with the
2-second hack.

What prevents one of the process's threads from creating a robust mutex
after we've inspected it with process_has_robust_futex()?

Re: [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes

Posted by zhongjinji 1 month, 2 weeks ago

On Thu, 14 Aug 2025 21:55:52 +0800 <zhongjinji@honor.com> wrote:

> > The OOM reaper quickly reclaims a process's memory when the system hits OOM,
> > helping the system recover. Without the OOM reaper, if a process frozen by
> > cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the
> > system in a poor state. Even if the process is not frozen by cgroup v1,
> > reclaiming victims' memory remains important, as having one more process
> > working speeds up memory release.
> > 
> > When processes holding robust futexes are OOM killed but waiters on those
> > futexes remain alive, the robust futexes might be reaped before
> > futex_cleanup() runs. This can cause the waiters to block indefinitely [1].
> > 
> > To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since
> > many killed processes exit within 2 seconds, the OOM reaper rarely runs after
> > this delay. However, robust futex users are few, so delaying OOM reap for all
> > victims is unnecessary.
> > 
> > If each thread's robust_list in a process is NULL, the process holds no robust
> > futexes. For such processes, the OOM reaper should not be delayed. For
> > processes holding robust futexes, to avoid issue [1], the OOM reaper must
> > still be delayed.
> > 
> > Patch 1 introduces process_has_robust_futex() to detect whether a process uses
> > robust futexes. Patch 2 delays the OOM reaper only for processes holding robust
> > futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and
> > exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock
> > contention caused by unmapping the same vma.
> 
> This all sounds sensible, given that we appear to be stuck with the
> 2-second hack.
> 
> What prevents one of the process's threads from creating a robust mutex
> after we've inspected it with process_has_robust_futex()?

Thank you, I didn't consider this situation.
Since process_has_robust_futex() is called after the kill signal is sent,
this means the process will have the SIGNAL_GROUP_EXIT flag when calling
process_has_robust_futex().
We can check whether task->signal->flags contains the SIGNAL_GROUP_EXIT
flag in set_robust_list() to ensure that the process is not being killed
before creating the robust mutex.