From nobody Sat Oct  4 15:54:56 2025
Received: from mta20.hihonor.com (mta20.hihonor.com [81.70.206.69])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A3252FE054
	for <linux-kernel@vger.kernel.org>; Thu, 14 Aug 2025 13:56:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=81.70.206.69
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1755179772; cv=none;
 b=DuscrF2OdqQ6eJfBorRSyPxF/uK8hPqmRq+BLXIDYGcX9fsvjA/SgkUXFTLT6HdDnV78eJyMImDsLbgtGegKWRd0krE88YF2VyFZBzk0YmeQWcEoOX/bRSxJYEPPp9t7+98WikOG/idexAS7LmrQzm9XncpbDpaqhukIUVltDQM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1755179772; c=relaxed/simple;
	bh=2qSTqKgAiQDfLqRBh2lVVrFNo/uyUMPLEYEWg2jiGM4=;
	h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=PhBV/IZO0LA2eRJig7UHzHsnsRlABdNua8tzBb0PV3p0uUEegXW62kCaDjEdgS8Chz0bZ49ifRrOdV/WiEPJfDqGNffcgxH7lFn1bbukgjZJn50vjTnmapPOUt5J2WZa3Oqu/8Lhlus6eOca1AZiq4+b2De30thBjGj93NQln00=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=honor.com;
 spf=pass smtp.mailfrom=honor.com; arc=none smtp.client-ip=81.70.206.69
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=honor.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=honor.com
Received: from w003.hihonor.com (unknown [10.68.17.88])
	by mta20.hihonor.com (SkyGuard) with ESMTPS id 4c2mwm1WGfzYl2mK;
	Thu, 14 Aug 2025 21:55:52 +0800 (CST)
Received: from a018.hihonor.com (10.68.17.250) by w003.hihonor.com
 (10.68.17.88) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 14 Aug
 2025 21:56:00 +0800
Received: from localhost.localdomain (10.144.20.219) by a018.hihonor.com
 (10.68.17.250) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 14 Aug
 2025 21:55:59 +0800
From: <zhongjinji@honor.com>
To: <linux-mm@kvack.org>
CC: <akpm@linux-foundation.org>, <mhocko@suse.com>, <rientjes@google.com>,
	<shakeel.butt@linux.dev>, <npache@redhat.com>,
	<linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>,
	<peterz@infradead.org>, <dvhart@infradead.org>, <dave@stgolabs.net>,
	<andrealmeid@igalia.com>, <liam.howlett@oracle.com>, <liulu.liu@honor.com>,
	<feng.han@honor.com>, <zhongjinji@honor.com>
Subject: [PATCH v4 1/3] futex: Introduce function process_has_robust_futex()
Date: Thu, 14 Aug 2025 21:55:53 +0800
Message-ID: <20250814135555.17493-2-zhongjinji@honor.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20250814135555.17493-1-zhongjinji@honor.com>
References: <20250814135555.17493-1-zhongjinji@honor.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-ClientProxiedBy: w003.hihonor.com (10.68.17.88) To a018.hihonor.com
 (10.68.17.250)
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: zhongjinji <zhongjinji@honor.com>

When the holders of robust futexes are OOM killed but the waiters on robust
futexes are still alive, the robust futexes might be reaped before
futex_cleanup() runs. This can cause the waiters to block indefinitely [1].
To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
However, the OOM reaper now rarely runs since many killed processes exit
within 2 seconds.

Because robust futex users are few, delay the reaper's execution only for
processes holding robust futexes to improve the performance of the OOM
reaper.

Introduce the function process_has_robust_futex() to detect whether a proce=
ss
uses robust futexes. If each thread's robust_list in a process is NULL, it
means the process holds no robust futexes. Conversely, it means the process
holds robust futexes.

Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com=
/T/#u [1]

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 include/linux/futex.h |  5 +++++
 kernel/futex/core.c   | 30 ++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 9e9750f04980..39540b7ae2a1 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -81,6 +81,7 @@ void futex_exec_release(struct task_struct *tsk);
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
 int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long=
 arg4);
+bool process_has_robust_futex(struct task_struct *tsk);
=20
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
@@ -108,6 +109,10 @@ static inline int futex_hash_prctl(unsigned long arg2,=
 unsigned long arg3, unsig
 {
 	return -EINVAL;
 }
+static inline bool process_has_robust_futex(struct task_struct *tsk)
+{
+	return false;
+}
 static inline int futex_hash_allocate_default(void)
 {
 	return 0;
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb5567af0c..01b6561ab4f6 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1961,6 +1961,36 @@ int futex_hash_prctl(unsigned long arg2, unsigned lo=
ng arg3, unsigned long arg4)
 	return ret;
 }
=20
+/*
+ * process_has_robust_futex() - check whether the given task hold robust f=
utexes.
+ * @p: task struct of which task to consider
+ *
+ * If any thread in the task has a non-NULL robust_list or compat_robust_l=
ist,
+ * it indicates that the task holds robust futexes.
+ */
+bool process_has_robust_futex(struct task_struct *tsk)
+{
+	struct task_struct *t;
+	bool ret =3D false;
+
+	rcu_read_lock();
+	for_each_thread(tsk, t) {
+		if (unlikely(t->robust_list)) {
+			ret =3D true;
+			break;
+		}
+#ifdef CONFIG_COMPAT
+		if (unlikely(t->compat_robust_list)) {
+			ret =3D true;
+			break;
+		}
+#endif
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static int __init futex_init(void)
 {
 	unsigned long hashsize, i;
--=20
2.17.1
From nobody Sat Oct  4 15:54:56 2025
Received: from mta22.hihonor.com (mta22.honor.com [81.70.192.198])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 123651C4A0A
	for <linux-kernel@vger.kernel.org>; Thu, 14 Aug 2025 14:11:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=81.70.192.198
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1755180718; cv=none;
 b=CAf4uMvwqqphrFx4s11kdTbggazPlxrs/2yCkToPeRFAahupuUrLfaYkDk1gK3zKwNL0dDLbucCeztx1bny9B9egfwc6iV0gBPTIunVhf1GSkOqFnoxOe9VPawB1S3Zxd5d2AUXbZrXZOhH45fIo/690PZWTXZn8F98v/fZSLqI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1755180718; c=relaxed/simple;
	bh=EbZzoJjr4Kk8JDzE0eXvjw9sGSzsd74uZr6Y9Z/YjM8=;
	h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=ogoo+fcUnhg3hyeSiCO83wBwoUgIaS1B/kwkIxDMN5dDup65BtDdkvkqm5xXHgaFA3/LPWNIs64odrb7+mi8pcQE2PXjMZF0DoLmT+91NAg/CefDdxiKq05H2yHgL30FtPE4l7U9maOR00v2CCV9jmuRkXZ7Wuve2quEnZ/7tyw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=honor.com;
 spf=pass smtp.mailfrom=honor.com;
 dkim=pass (1024-bit key) header.d=honor.com header.i=@honor.com
 header.b=Rt08Qyxk; arc=none smtp.client-ip=81.70.192.198
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=honor.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=honor.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=honor.com header.i=@honor.com
 header.b="Rt08Qyxk"
dkim-signature: v=1; a=rsa-sha256; d=honor.com; s=dkim;
	c=relaxed/relaxed; q=dns/txt;
	h=To:From;
	bh=SbP+LrjMItEDvKeWro7z69XURleJAOwLaWagLV7cvLg=;
	b=Rt08QyxkbUgKzbEVzCJKnSRyr43rcH7SdnETOsF1osGZ3c1UANSADT/Kyqq/vGBbtd5NUkOBm
	nBEZsM6eDN1cCVEVZr8q+zU5QzTdWntPr6lzasN76QP+R4AaIS5/vNUsgFFG+wCNXoVkKo8o5qB
	vSMgbDdM0Yh8XZScMk2z6gc=
Received: from w002.hihonor.com (unknown [10.68.28.120])
	by mta22.hihonor.com (SkyGuard) with ESMTPS id 4c2mwn5GHpzYlnhn;
	Thu, 14 Aug 2025 21:55:53 +0800 (CST)
Received: from a018.hihonor.com (10.68.17.250) by w002.hihonor.com
 (10.68.28.120) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 14 Aug
 2025 21:56:00 +0800
Received: from localhost.localdomain (10.144.20.219) by a018.hihonor.com
 (10.68.17.250) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 14 Aug
 2025 21:55:59 +0800
From: <zhongjinji@honor.com>
To: <linux-mm@kvack.org>
CC: <akpm@linux-foundation.org>, <mhocko@suse.com>, <rientjes@google.com>,
	<shakeel.butt@linux.dev>, <npache@redhat.com>,
	<linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>,
	<peterz@infradead.org>, <dvhart@infradead.org>, <dave@stgolabs.net>,
	<andrealmeid@igalia.com>, <liam.howlett@oracle.com>, <liulu.liu@honor.com>,
	<feng.han@honor.com>, <zhongjinji@honor.com>
Subject: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using
 robust futexes
Date: Thu, 14 Aug 2025 21:55:54 +0800
Message-ID: <20250814135555.17493-3-zhongjinji@honor.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20250814135555.17493-1-zhongjinji@honor.com>
References: <20250814135555.17493-1-zhongjinji@honor.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-ClientProxiedBy: w003.hihonor.com (10.68.17.88) To a018.hihonor.com
 (10.68.17.250)
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: zhongjinji <zhongjinji@honor.com>

The OOM reaper can quickly reap a process's memory when the system encounte=
rs
OOM, helping the system recover. Without the OOM reaper, if a process frozen
by cgroup v1 is OOM killed, the victims' memory cannot be freed, and the
system stays in a poor state. Even if the process is not frozen by cgroup v=
1,
reaping victims' memory is still meaningful, because having one more process
working speeds up memory release.

When processes holding robust futexes are OOM killed but waiters on those
futexes remain alive, the robust futexes might be reaped before
futex_cleanup() runs. It would cause the waiters to block indefinitely.
To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
The OOM reaper now rarely runs since many killed processes exit within 2
seconds.

Because robust futex users are few, it is unreasonable to delay OOM reap for
all victims. For processes that do not hold robust futexes, the OOM reaper
should not be delayed and for processes holding robust futexes, the OOM
reaper must still be delayed to prevent the waiters to block indefinitely [=
1].

Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com=
/T/#u [1]

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..7ae4001e47c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -39,6 +39,7 @@
 #include <linux/ptrace.h>
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
+#include <linux/futex.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
 #include <linux/init.h>
@@ -692,7 +693,7 @@ static void wake_oom_reaper(struct timer_list *timer)
  * before the exit path is able to wake the futex waiters.
  */
 #define OOM_REAPER_DELAY (2*HZ)
-static void queue_oom_reaper(struct task_struct *tsk)
+static void queue_oom_reaper(struct task_struct *tsk, bool delay)
 {
 	/* mm is already queued? */
 	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
@@ -700,7 +701,7 @@ static void queue_oom_reaper(struct task_struct *tsk)
=20
 	get_task_struct(tsk);
 	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
-	tsk->oom_reaper_timer.expires =3D jiffies + OOM_REAPER_DELAY;
+	tsk->oom_reaper_timer.expires =3D jiffies + (delay ? OOM_REAPER_DELAY : 0=
);
 	add_timer(&tsk->oom_reaper_timer);
 }
=20
@@ -742,7 +743,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static inline void queue_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_reaper(struct task_struct *tsk, bool delay)
 {
 }
 #endif /* CONFIG_MMU */
@@ -843,6 +844,16 @@ bool oom_killer_disable(signed long timeout)
 	return true;
 }
=20
+/*
+ * If the owner thread of robust futexes is killed by OOM, the robust fute=
xes might be freed
+ * by the OOM reaper before futex_cleanup() runs, which could cause the wa=
iters to
+ * block indefinitely. So when the task hold robust futexes, delay oom rea=
per.
+ */
+static inline bool should_delay_oom_reap(struct task_struct *task)
+{
+	return process_has_robust_futex(task);
+}
+
 static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig =3D task->signal;
@@ -865,17 +876,19 @@ static inline bool __task_will_free_mem(struct task_s=
truct *task)
 }
=20
 /*
- * Checks whether the given task is dying or exiting and likely to
- * release its address space. This means that all threads and processes
+ * Determine whether the given task should be reaped based on
+ * whether it is dying or exiting and likely to release its
+ * address space. This means that all threads and processes
  * sharing the same mm have to be killed or exiting.
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool should_reap_task(struct task_struct *task, bool *delay_reap)
 {
 	struct mm_struct *mm =3D task->mm;
 	struct task_struct *p;
 	bool ret =3D true;
+	bool delay;
=20
 	/*
 	 * Skip tasks without mm because it might have passed its exit_mm and
@@ -888,6 +901,8 @@ static bool task_will_free_mem(struct task_struct *task)
 	if (!__task_will_free_mem(task))
 		return false;
=20
+	delay =3D should_delay_oom_reap(task);
+
 	/*
 	 * This task has already been drained by the oom reaper so there are
 	 * only small chances it will free some more
@@ -912,8 +927,11 @@ static bool task_will_free_mem(struct task_struct *tas=
k)
 		ret =3D __task_will_free_mem(p);
 		if (!ret)
 			break;
+		if (!delay)
+			delay =3D should_delay_oom_reap(p);
 	}
 	rcu_read_unlock();
+	*delay_reap =3D delay;
=20
 	return ret;
 }
@@ -923,6 +941,7 @@ static void __oom_kill_process(struct task_struct *vict=
im, const char *message)
 	struct task_struct *p;
 	struct mm_struct *mm;
 	bool can_oom_reap =3D true;
+	bool delay_reap;
=20
 	p =3D find_lock_task_mm(victim);
 	if (!p) {
@@ -959,6 +978,7 @@ static void __oom_kill_process(struct task_struct *vict=
im, const char *message)
 		from_kuid(&init_user_ns, task_uid(victim)),
 		mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
 	task_unlock(victim);
+	delay_reap =3D should_delay_oom_reap(victim);
=20
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
@@ -990,11 +1010,13 @@ static void __oom_kill_process(struct task_struct *v=
ictim, const char *message)
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
+		if (!delay_reap)
+			delay_reap =3D should_delay_oom_reap(p);
 	}
 	rcu_read_unlock();
=20
 	if (can_oom_reap)
-		queue_oom_reaper(victim);
+		queue_oom_reaper(victim, delay_reap);
=20
 	mmdrop(mm);
 	put_task_struct(victim);
@@ -1020,6 +1042,7 @@ static void oom_kill_process(struct oom_control *oc, =
const char *message)
 	struct mem_cgroup *oom_group;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool delay_reap =3D false;
=20
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -1027,9 +1050,9 @@ static void oom_kill_process(struct oom_control *oc, =
const char *message)
 	 * so it can die quickly
 	 */
 	task_lock(victim);
-	if (task_will_free_mem(victim)) {
+	if (should_reap_task(victim, &delay_reap)) {
 		mark_oom_victim(victim);
-		queue_oom_reaper(victim);
+		queue_oom_reaper(victim, delay_reap);
 		task_unlock(victim);
 		put_task_struct(victim);
 		return;
@@ -1112,6 +1135,7 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed =3D 0;
+	bool delay_reap =3D false;
=20
 	if (oom_killer_disabled)
 		return false;
@@ -1128,9 +1152,9 @@ bool out_of_memory(struct oom_control *oc)
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (task_will_free_mem(current)) {
+	if (should_reap_task(current, &delay_reap)) {
 		mark_oom_victim(current);
-		queue_oom_reaper(current);
+		queue_oom_reaper(current, delay_reap);
 		return true;
 	}
=20
@@ -1209,6 +1233,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigne=
d int, flags)
 	struct task_struct *p;
 	unsigned int f_flags;
 	bool reap =3D false;
+	bool delay_reap =3D false;
 	long ret =3D 0;
=20
 	if (flags)
@@ -1231,7 +1256,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigne=
d int, flags)
 	mm =3D p->mm;
 	mmgrab(mm);
=20
-	if (task_will_free_mem(p))
+	if (should_reap_task(p, &delay_reap))
 		reap =3D true;
 	else {
 		/* Error only if the work has not been done already */
@@ -1240,7 +1265,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigne=
d int, flags)
 	}
 	task_unlock(p);
=20
-	if (!reap)
+	if (!reap || delay_reap)
 		goto drop_mm;
=20
 	if (mmap_read_lock_killable(mm)) {
--=20
2.17.1
From nobody Sat Oct  4 15:54:56 2025
Received: from mta20.hihonor.com (mta20.hihonor.com [81.70.206.69])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2AA2726AA93
	for <linux-kernel@vger.kernel.org>; Thu, 14 Aug 2025 13:56:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=81.70.206.69
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1755179771; cv=none;
 b=noW1bhCZizOx0GJx60RWPWcqs5aglbs73JfRlyYtPNB62rzv+ISPPf0DLok2eu1HIGfntBy5wS+8s2RslEa/Asfk3OImx1SIfjq5dW3G5cuCAWeNmyxM6WPXt68ldM6dh4aFChz7UArUibrLpWDdvdTY2wxQXxeIwwUjAMsLaBY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1755179771; c=relaxed/simple;
	bh=A/0/kbVQAaKBR0zTMyp75SlhUpDVp+77Akx0dyB+U6E=;
	h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=PVcs365RJyNNynqrY6IZXo0JVA8XrSM9nOqi7xv2L9YpNL84hcMamZwWGehcxhSsRiFbRlF5FUu6I4XICF+pGkeGyWP+uP+LbVFEXNJk+FH5fBCUiL39iJnZAcKFrbDoQbyHM19mCH1BSBBhetIWiUSJYWs6HR9QH3tlTfkRGSA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=honor.com;
 spf=pass smtp.mailfrom=honor.com; arc=none smtp.client-ip=81.70.206.69
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=honor.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=honor.com
Received: from w001.hihonor.com (unknown [10.68.25.235])
	by mta20.hihonor.com (SkyGuard) with ESMTPS id 4c2mwm4GfqzYl3br;
	Thu, 14 Aug 2025 21:55:52 +0800 (CST)
Received: from a018.hihonor.com (10.68.17.250) by w001.hihonor.com
 (10.68.25.235) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 14 Aug
 2025 21:56:00 +0800
Received: from localhost.localdomain (10.144.20.219) by a018.hihonor.com
 (10.68.17.250) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 14 Aug
 2025 21:56:00 +0800
From: <zhongjinji@honor.com>
To: <linux-mm@kvack.org>
CC: <akpm@linux-foundation.org>, <mhocko@suse.com>, <rientjes@google.com>,
	<shakeel.butt@linux.dev>, <npache@redhat.com>,
	<linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>,
	<peterz@infradead.org>, <dvhart@infradead.org>, <dave@stgolabs.net>,
	<andrealmeid@igalia.com>, <liam.howlett@oracle.com>, <liulu.liu@honor.com>,
	<feng.han@honor.com>, <zhongjinji@honor.com>
Subject: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap()
 traverse the maple tree in opposite orders
Date: Thu, 14 Aug 2025 21:55:55 +0800
Message-ID: <20250814135555.17493-4-zhongjinji@honor.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20250814135555.17493-1-zhongjinji@honor.com>
References: <20250814135555.17493-1-zhongjinji@honor.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-ClientProxiedBy: w003.hihonor.com (10.68.17.88) To a018.hihonor.com
 (10.68.17.250)
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: zhongjinji <zhongjinji@honor.com>

When a process is OOM killed, if the OOM reaper and the thread running
exit_mmap() execute at the same time, both will traverse the vma's maple
tree along the same path. They may easily unmap the same vma, causing them
to compete for the pte spinlock. This increases unnecessary load, causing
the execution time of the OOM reaper and the thread running exit_mmap() to
increase.

When a process exits, exit_mmap() traverses the vma's maple tree from low t=
o high
address. To reduce the chance of unmapping the same vma simultaneously,
the OOM reaper should traverse vma's tree from high to low address. This re=
duces
lock contention when unmapping the same vma.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 include/linux/mm.h | 3 +++
 mm/oom_kill.c      | 9 +++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0c44bb8ce544..b665ea3c30eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vm=
i, unsigned long addr)
 #define for_each_vma_range(__vmi, __vma, __end)				\
 	while (((__vma) =3D vma_find(&(__vmi), (__end))) !=3D NULL)
=20
+#define for_each_vma_reverse(__vmi, __vma)					\
+	while (((__vma) =3D vma_prev(&(__vmi))) !=3D NULL)
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ae4001e47c1..602d6836098a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret =3D true;
-	VMA_ITERATOR(vmi, mm, 0);
+	VMA_ITERATOR(vmi, mm, ULONG_MAX);
=20
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
=20
-	for_each_vma(vmi, vma) {
+	/*
+	 * When two tasks unmap the same vma at the same time, they may contend f=
or the
+	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, tra=
verse
+	 * the vma maple tree in reverse order.
+	 */
+	for_each_vma_reverse(vmi, vma) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
=20
--=20
2.17.1