Do not delay oom reaper when the victim is frozen

[PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen

Posted by zhongjinji 5 months, 2 weeks ago

The OOM reaper can quickly reap a process's memory when the system
encounters OOM, helping the system recover. If the victim process is
frozen and cannot be unfrozen in time, the reaper delayed by two seconds
will cause the system to fail to recover quickly from the OOM state.

When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
will keep the system in a bad state for two seconds. Before scheduling the
oom_reaper task, check whether the victim is in a frozen state. If the
victim is frozen, do not delay the OOM reaper.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..4b4d73b1e00d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
 	wake_up(&oom_reaper_wait);
 }
 
+/*
+ * When the victim is frozen, the OOM reaper should not be delayed, because
+ * if the victim cannot be unfrozen promptly, it may block the system from
+ * quickly recovering from the OOM state.
+ */
+static bool should_delay_oom_reap(struct task_struct *tsk)
+{
+	struct mm_struct *mm = tsk->mm;
+	struct task_struct *p;
+	bool ret;
+
+	if (!mm)
+		return true;
+
+	if (!frozen(tsk))
+		return true;
+
+	if (atomic_read(&mm->mm_users) <= 1)
+		return false;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(tsk, p))
+			continue;
+		ret = !frozen(p);
+		if (ret)
+			break;
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /*
  * Give the OOM victim time to exit naturally before invoking the oom_reaping.
  * The timers timeout is arbitrary... the longer it is, the longer the worst
@@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer)
 #define OOM_REAPER_DELAY (2*HZ)
 static void queue_oom_reaper(struct task_struct *tsk)
 {
+	bool delay;
+
 	/* mm is already queued? */
 	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
+	delay = should_delay_oom_reap(tsk);
 	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
-	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);
 	add_timer(&tsk->oom_reaper_timer);
 }
 
-- 
2.17.1

Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen

Posted by Michal Hocko 5 months, 2 weeks ago

On Mon 25-08-25 21:38:54, zhongjinji wrote:
> The OOM reaper can quickly reap a process's memory when the system
> encounters OOM, helping the system recover. If the victim process is
> frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> will cause the system to fail to recover quickly from the OOM state.
> 
> When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> will keep the system in a bad state for two seconds. Before scheduling the
> oom_reaper task, check whether the victim is in a frozen state. If the
> victim is frozen, do not delay the OOM reaper.

I do not think this changelog captures the essence of the change really
well and it suggests that this might be a performance optimization. As I
have explained on several occasions the oom reaper is not meant to be a
performance optimization but rather a forward progress guarantee. I
would suggest this wording instead.

"
The oom reaper is a mechanism to guarantee a forward process during OOM
situation when the oom victim cannot terminate on its own (e.g. being
blocked in uninterruptible state or frozen by cgroup freezer). In order
to give the victim some time to terminate properly the oom reaper is
delayed in its invocation. This is particularly beneficial when the oom
victim is holding robust futex resources as the anonymous memory tear
down can break those.

On the other hand deliberately frozen tasks by the freezer cgroup will
not wake up until they are thawed in the userspace and delay is
effectively pointless. Therefore opt out from the delay for cgroup
frozen oom victims.
"

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen

Posted by zhongjinji 5 months, 2 weeks ago

> On Mon 25-08-25 21:38:54, zhongjinji wrote:
> > The OOM reaper can quickly reap a process's memory when the system
> > encounters OOM, helping the system recover. If the victim process is
> > frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> > will cause the system to fail to recover quickly from the OOM state.
> > 
> > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> > will keep the system in a bad state for two seconds. Before scheduling the
> > oom_reaper task, check whether the victim is in a frozen state. If the
> > victim is frozen, do not delay the OOM reaper.
> 
> I do not think this changelog captures the essence of the change really
> well and it suggests that this might be a performance optimization. As I
> have explained on several occasions the oom reaper is not meant to be a
> performance optimization but rather a forward progress guarantee. I
> would suggest this wording instead.
> 
> "
> The oom reaper is a mechanism to guarantee a forward process during OOM
> situation when the oom victim cannot terminate on its own (e.g. being
> blocked in uninterruptible state or frozen by cgroup freezer). In order
> to give the victim some time to terminate properly the oom reaper is
> delayed in its invocation. This is particularly beneficial when the oom
> victim is holding robust futex resources as the anonymous memory tear
> down can break those.
> 
> On the other hand deliberately frozen tasks by the freezer cgroup will
> not wake up until they are thawed in the userspace and delay is
> effectively pointless. Therefore opt out from the delay for cgroup
> frozen oom victims.
> "

Thank you, I will update it.

> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs

Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen

Posted by Lorenzo Stoakes 5 months, 2 weeks ago

On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote:
> The OOM reaper can quickly reap a process's memory when the system
> encounters OOM, helping the system recover. If the victim process is
> frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> will cause the system to fail to recover quickly from the OOM state.

Be good to reference the commit where this was introduced.

>
> When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> will keep the system in a bad state for two seconds. Before scheduling the
> oom_reaper task, check whether the victim is in a frozen state. If the
> victim is frozen, do not delay the OOM reaper.
>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

This is a lot better than the previous version, thanks! :)

> ---
>  mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..4b4d73b1e00d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
>  	wake_up(&oom_reaper_wait);
>  }
>
> +/*
> + * When the victim is frozen, the OOM reaper should not be delayed, because
> + * if the victim cannot be unfrozen promptly, it may block the system from
> + * quickly recovering from the OOM state.
> + */

You should put comments like this with each of the predicates, so e.g. this
comment should be above the frozen check, and then you should write equivalent
ones for the rest.

However, if Shakeel's correct, you can vastly simplify this further, so
obviously in that instance you can reduce to the single comment.

> +static bool should_delay_oom_reap(struct task_struct *tsk)
> +{
> +	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *p;
> +	bool ret;
> +
> +	if (!mm)
> +		return true;
> +
> +	if (!frozen(tsk))
> +		return true;
> +
> +	if (atomic_read(&mm->mm_users) <= 1)
> +		return false;
> +
> +	rcu_read_lock();
> +	for_each_process(p) {
> +		if (!process_shares_mm(p, mm))
> +			continue;
> +		if (same_thread_group(tsk, p))
> +			continue;
> +		ret = !frozen(p);
> +		if (ret)
> +			break;
> +	}
> +	rcu_read_unlock();

This surely in any case must exist as a helper somehwere (bieng lazy + not
checking), seems a prime candidate for that if not.

> +
> +	return ret;
> +}
> +
>  /*
>   * Give the OOM victim time to exit naturally before invoking the oom_reaping.
>   * The timers timeout is arbitrary... the longer it is, the longer the worst
> @@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer)
>  #define OOM_REAPER_DELAY (2*HZ)
>  static void queue_oom_reaper(struct task_struct *tsk)
>  {
> +	bool delay;
> +
>  	/* mm is already queued? */
>  	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>  		return;
>
>  	get_task_struct(tsk);
> +	delay = should_delay_oom_reap(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);

I mean, unless there's some reason not to, why not simplify to:

	task->oom_reaper_timer.expires = jiffies;
	if (should_delay_oom_reap(tsk))
		task->oom_reaper_timer.expires += OOM_REAPER_DELAY;

While super spells things out and avoids the other noise.

>  	add_timer(&tsk->oom_reaper_timer);
>  }
>
> --
> 2.17.1
>

Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen

Posted by Shakeel Butt 5 months, 2 weeks ago

+cgroups

On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote:
> The OOM reaper can quickly reap a process's memory when the system
> encounters OOM, helping the system recover. If the victim process is
> frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> will cause the system to fail to recover quickly from the OOM state.
> 
> When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> will keep the system in a bad state for two seconds. Before scheduling the
> oom_reaper task, check whether the victim is in a frozen state. If the
> victim is frozen, do not delay the OOM reaper.
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..4b4d73b1e00d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
>  	wake_up(&oom_reaper_wait);
>  }
>  
> +/*
> + * When the victim is frozen, the OOM reaper should not be delayed, because
> + * if the victim cannot be unfrozen promptly, it may block the system from
> + * quickly recovering from the OOM state.
> + */
> +static bool should_delay_oom_reap(struct task_struct *tsk)
> +{
> +	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *p;
> +	bool ret;
> +

On v2, shouldn't READ_ONCE(tsk->frozen) be enough instead of mm check
and checks insode for_each_process()?

> +	if (!mm)
> +		return true;
> +
> +	if (!frozen(tsk))
> +		return true;
> +
> +	if (atomic_read(&mm->mm_users) <= 1)
> +		return false;
> +
> +	rcu_read_lock();
> +	for_each_process(p) {
> +		if (!process_shares_mm(p, mm))
> +			continue;
> +		if (same_thread_group(tsk, p))
> +			continue;
> +		ret = !frozen(p);
> +		if (ret)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  /*
>   * Give the OOM victim time to exit naturally before invoking the oom_reaping.
>   * The timers timeout is arbitrary... the longer it is, the longer the worst
> @@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer)
>  #define OOM_REAPER_DELAY (2*HZ)
>  static void queue_oom_reaper(struct task_struct *tsk)
>  {
> +	bool delay;
> +
>  	/* mm is already queued? */
>  	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>  		return;
>  
>  	get_task_struct(tsk);
> +	delay = should_delay_oom_reap(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);
>  	add_timer(&tsk->oom_reaper_timer);
>  }
>  
> -- 
> 2.17.1
>

Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen

Posted by zhongjinji 5 months, 2 weeks ago

> +cgroups
> 
> On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote:
> > The OOM reaper can quickly reap a process's memory when the system
> > encounters OOM, helping the system recover. If the victim process is
> > frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> > will cause the system to fail to recover quickly from the OOM state.
> > 
> > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> > will keep the system in a bad state for two seconds. Before scheduling the
> > oom_reaper task, check whether the victim is in a frozen state. If the
> > victim is frozen, do not delay the OOM reaper.
> > 
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 39 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 25923cfec9c6..4b4d73b1e00d 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
> >  	wake_up(&oom_reaper_wait);
> >  }
> >  
> > +/*
> > + * When the victim is frozen, the OOM reaper should not be delayed, because
> > + * if the victim cannot be unfrozen promptly, it may block the system from
> > + * quickly recovering from the OOM state.
> > + */
> > +static bool should_delay_oom_reap(struct task_struct *tsk)
> > +{
> > +	struct mm_struct *mm = tsk->mm;
> > +	struct task_struct *p;
> > +	bool ret;
> > +
> 
> On v2, shouldn't READ_ONCE(tsk->frozen) be enough instead of mm check
> and checks insode for_each_process()?

Thank you.
It is mainly to check the processes created by vfork. I believe no one would
put them into different cgroups, so I also think that READ_ONCE(tsk->frozen)
is enough. I will update it.

> > +	if (!mm)
> > +		return true;
> > +
> > +	if (!frozen(tsk))
> > +		return true;
> > +
> > +	if (atomic_read(&mm->mm_users) <= 1)
> > +		return false;
> > +
> > +	rcu_read_lock();
> > +	for_each_process(p) {