From nobody Sun Feb  8 21:30:50 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E4DA168BD;
	Wed, 26 Nov 2025 04:36:07 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764131770; cv=none;
 b=TbdlFKuyVoMsUiSeNQu3oIXxX6o8kQCPntH2EKFlXWgUNicB7Y9qZTk437CBTfrEnEojn0dI75b1Oxyh6RVbn7WcfVgAAL4DTWgH2WPjcswvHdL8XdSvjOHVcTACN3K57Avl/Du7lDqTKYBD/owLw/PAYCWz6JVbMmiG83b9xNY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764131770; c=relaxed/simple;
	bh=4L0cy1lkA9xTHo9F7FDBjJSfchw+FktW3hzoIw3O9sw=;
	h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version:
	 Message-ID:Content-Type;
 b=g9OjAIPnfuRs8IR+GsczG0xAztHkVRRvmL3sh7QngQ1L57wf+4VYGQN4yMemFHlmqYo+V7ue3GqgjvG3iPC1zUp8SFLmGIxwaUrjuI6HieRBPMTNAExMEqgj+BuAqzF/QBH36PccR58bM6WJigYE4UUZSiBpeyVa3WirTvRpGmw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=Mdao9/dk;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=15Jb8LOg; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="Mdao9/dk";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="15Jb8LOg"
Date: Wed, 26 Nov 2025 04:36:00 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1764131766;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=LaaQGV1ZLGDlGxZqsjKzFg2DvJir8u6igf8uioUcl/o=;
	b=Mdao9/dk9Kqb2wYiZl+ZUu6U1Z9wmeQgXQfZdeu6Oo9nvl4pyaoI9C9KNXENjqEjyHPiQe
	bPapC4/AAKeikU6nfH72otG1eCYvna1hyp52wbgilnaoXyuyGxdVuGpXZe4lA2sDcBox6r
	qhK8ib3vsaCjKUYg0MnE33mGQwQELIIja0VrX5PEld33UCrVOtcM0HxRT8AkFfMVVrS1sC
	WBzhz7tXyVRTeBEpVK97QvwMGCzytB+ju2D21lWHMYFG7rF6uorbCg7WYT5B8CD5niUqF7
	L9IkFCavPAbj/f3ikdUEErFT5btke4PaBQd9K4tGPxwoD5xO286ngwQOZAVgKQ==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1764131766;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=LaaQGV1ZLGDlGxZqsjKzFg2DvJir8u6igf8uioUcl/o=;
	b=15Jb8LOgqmL6I1htTCJc9pvyVlBYhkrPJvn7lovZESRSL8p9m1c/wS7VkWvSbYna7pX0un
	6E/Gb8qjbHMgFEBg==
From: "tip-bot2 for Thomas Gleixner" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: [tip: core/rseq] sched/mmcid: Switch over to the new mechanism
Cc: Thomas Gleixner <tglx@linutronix.de>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>,
 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, x86@kernel.org,
 linux-kernel@vger.kernel.org
In-Reply-To: <20251119172550.280380631@linutronix.de>
References: <20251119172550.280380631@linutronix.de>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Message-ID: <176413176134.498.16658485684652504729.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: 
 Contact <mailto:tglx@linutronix.de> to get blacklisted from these emails
Precedence: bulk
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

The following commit has been merged into the core/rseq branch of tip:

Commit-ID:     653fda7ae73d8033dedb65537acac0c2c287dc3f
Gitweb:        https://git.kernel.org/tip/653fda7ae73d8033dedb65537acac0c2c=
287dc3f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 19 Nov 2025 18:27:22 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 25 Nov 2025 19:45:42 +01:00

sched/mmcid: Switch over to the new mechanism

Now that all pieces are in place, change the implementations of
sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict
ownership scheme and switch context_switch() over to use the new
mm_cid_schedin() functionality.

The common case is that there is no mode change required, which makes
fork() and exit() just update the user count and the constraints.

In case that a new user would exceed the CID space limit the fork() context
handles the transition to per CPU mode with mm::mm_cid::mutex held. exit()
handles the transition back to per task mode when the user count drops
below the switch back threshold. fork() might also be forced to handle a
deferred switch back to per task mode, when a affinity change increased the
number of allowed CPUs enough.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
---
 include/linux/rseq.h       |  19 +------
 include/linux/rseq_types.h |   8 +--
 kernel/fork.c              |   1 +-
 kernel/sched/core.c        | 115 ++++++++++++++++++++++++++++++------
 kernel/sched/sched.h       |  76 +------------------------
 5 files changed, 103 insertions(+), 116 deletions(-)

diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 4c0e8bd..2266f4d 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -84,24 +84,6 @@ static __always_inline void rseq_sched_set_ids_changed(s=
truct task_struct *t)
 	t->rseq.event.ids_changed =3D true;
 }
=20
-/*
- * Invoked from switch_mm_cid() in context switch when the task gets a MM
- * CID assigned.
- *
- * This does not raise TIF_NOTIFY_RESUME as that happens in
- * rseq_sched_switch_event().
- */
-static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct =
*t, unsigned int cid)
-{
-	/*
-	 * Requires a comparison as the switch_mm_cid() code does not
-	 * provide a conditional for it readily. So avoid excessive updates
-	 * when nothing changes.
-	 */
-	if (t->rseq.ids.mm_cid !=3D cid)
-		t->rseq.event.ids_changed =3D true;
-}
-
 /* Enforce a full update after RSEQ registration and when execve() failed =
*/
 static inline void rseq_force_update(void)
 {
@@ -169,7 +151,6 @@ static inline void rseq_handle_slowpath(struct pt_regs =
*regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_reg=
s *regs) { }
 static inline void rseq_sched_switch_event(struct task_struct *t) { }
 static inline void rseq_sched_set_ids_changed(struct task_struct *t) { }
-static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsig=
ned int cid) { }
 static inline void rseq_force_update(void) { }
 static inline void rseq_virt_userspace_exit(void) { }
 static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 81fbb88..332dc14 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -101,18 +101,18 @@ struct rseq_data { };
 /**
  * struct sched_mm_cid - Storage for per task MM CID data
  * @active:	MM CID is active for the task
- * @cid:	The CID associated to the task
- * @last_cid:	The last CID associated to the task
+ * @cid:	The CID associated to the task either permanently or
+ *		borrowed from the CPU
  */
 struct sched_mm_cid {
 	unsigned int		active;
 	unsigned int		cid;
-	unsigned int		last_cid;
 };
=20
 /**
  * struct mm_cid_pcpu - Storage for per CPU MM_CID data
- * @cid:	The CID associated to the CPU
+ * @cid:	The CID associated to the CPU either permanently or
+ *		while a task with a CID is running
  */
 struct mm_cid_pcpu {
 	unsigned int	cid;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c23219..8475958 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -956,7 +956,6 @@ static struct task_struct *dup_task_struct(struct task_=
struct *orig, int node)
=20
 #ifdef CONFIG_SCHED_MM_CID
 	tsk->mm_cid.cid =3D MM_CID_UNSET;
-	tsk->mm_cid.last_cid =3D MM_CID_UNSET;
 	tsk->mm_cid.active =3D 0;
 #endif
 	return tsk;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cbb543a..62235f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5307,7 +5307,7 @@ context_switch(struct rq *rq, struct task_struct *pre=
v,
 		}
 	}
=20
-	switch_mm_cid(prev, next);
+	mm_cid_switch_to(prev, next);
=20
 	/*
 	 * Tell rseq that the task was scheduled in. Must be after
@@ -10624,7 +10624,7 @@ static bool mm_cid_fixup_task_to_cpu(struct task_st=
ruct *t, struct mm_struct *mm
 	return true;
 }
=20
-static void __maybe_unused mm_cid_fixup_tasks_to_cpus(void)
+static void mm_cid_fixup_tasks_to_cpus(void)
 {
 	struct mm_struct *mm =3D current->mm;
 	struct task_struct *p, *t;
@@ -10674,25 +10674,81 @@ static bool sched_mm_cid_add_user(struct task_str=
uct *t, struct mm_struct *mm)
 void sched_mm_cid_fork(struct task_struct *t)
 {
 	struct mm_struct *mm =3D t->mm;
+	bool percpu;
=20
 	WARN_ON_ONCE(!mm || t->mm_cid.cid !=3D MM_CID_UNSET);
=20
 	guard(mutex)(&mm->mm_cid.mutex);
-	scoped_guard(raw_spinlock, &mm->mm_cid.lock) {
-		sched_mm_cid_add_user(t, mm);
-		/* Preset last_cid for mm_cid_select() */
-		t->mm_cid.last_cid =3D mm->mm_cid.max_cids - 1;
+	scoped_guard(raw_spinlock_irq, &mm->mm_cid.lock) {
+		struct mm_cid_pcpu *pcp =3D this_cpu_ptr(mm->mm_cid.pcpu);
+
+		/* First user ? */
+		if (!mm->mm_cid.users) {
+			sched_mm_cid_add_user(t, mm);
+			t->mm_cid.cid =3D mm_get_cid(mm);
+			/* Required for execve() */
+			pcp->cid =3D t->mm_cid.cid;
+			return;
+		}
+
+		if (!sched_mm_cid_add_user(t, mm)) {
+			if (!mm->mm_cid.percpu)
+				t->mm_cid.cid =3D mm_get_cid(mm);
+			return;
+		}
+
+		/* Handle the mode change and transfer current's CID */
+		percpu =3D !!mm->mm_cid.percpu;
+		if (!percpu)
+			mm_cid_transit_to_task(current, pcp);
+		else
+			mm_cid_transfer_to_cpu(current, pcp);
+	}
+
+	if (percpu) {
+		mm_cid_fixup_tasks_to_cpus();
+	} else {
+		mm_cid_fixup_cpus_to_tasks(mm);
+		t->mm_cid.cid =3D mm_get_cid(mm);
 	}
 }
=20
 static bool sched_mm_cid_remove_user(struct task_struct *t)
 {
 	t->mm_cid.active =3D 0;
-	mm_unset_cid_on_task(t);
+	scoped_guard(preempt) {
+		/* Clear the transition bit */
+		t->mm_cid.cid =3D cid_from_transit_cid(t->mm_cid.cid);
+		mm_unset_cid_on_task(t);
+	}
 	t->mm->mm_cid.users--;
 	return mm_update_max_cids(t->mm);
 }
=20
+static bool __sched_mm_cid_exit(struct task_struct *t)
+{
+	struct mm_struct *mm =3D t->mm;
+
+	if (!sched_mm_cid_remove_user(t))
+		return false;
+	/*
+	 * Contrary to fork() this only deals with a switch back to per
+	 * task mode either because the above decreased users or an
+	 * affinity change increased the number of allowed CPUs and the
+	 * deferred fixup did not run yet.
+	 */
+	if (WARN_ON_ONCE(mm->mm_cid.percpu))
+		return false;
+	/*
+	 * A failed fork(2) cleanup never gets here, so @current must have
+	 * the same MM as @t. That's true for exit() and the failed
+	 * pthread_create() cleanup case.
+	 */
+	if (WARN_ON_ONCE(current->mm !=3D mm))
+		return false;
+	return true;
+}
+
 /*
  * When a task exits, the MM CID held by the task is not longer required as
  * the task cannot return to user space.
@@ -10703,10 +10759,43 @@ void sched_mm_cid_exit(struct task_struct *t)
=20
 	if (!mm || !t->mm_cid.active)
 		return;
+	/*
+	 * Ensure that only one instance is doing MM CID operations within
+	 * a MM. The common case is uncontended. The rare fixup case adds
+	 * some overhead.
+	 */
+	scoped_guard(mutex, &mm->mm_cid.mutex) {
+		/* mm_cid::mutex is sufficient to protect mm_cid::users */
+		if (likely(mm->mm_cid.users > 1)) {
+			scoped_guard(raw_spinlock_irq, &mm->mm_cid.lock) {
+				if (!__sched_mm_cid_exit(t))
+					return;
+				/* Mode change required. Transfer currents CID */
+				mm_cid_transit_to_task(current, this_cpu_ptr(mm->mm_cid.pcpu));
+			}
+			mm_cid_fixup_cpus_to_tasks(mm);
+			return;
+		}
+		/* Last user */
+		scoped_guard(raw_spinlock_irq, &mm->mm_cid.lock) {
+			/* Required across execve() */
+			if (t =3D=3D current)
+				mm_cid_transit_to_task(t, this_cpu_ptr(mm->mm_cid.pcpu));
+			/* Ignore mode change. There is nothing to do. */
+			sched_mm_cid_remove_user(t);
+		}
+	}
=20
-	guard(mutex)(&mm->mm_cid.mutex);
-	scoped_guard(raw_spinlock, &mm->mm_cid.lock)
-		sched_mm_cid_remove_user(t);
+	/*
+	 * As this is the last user (execve(), process exit or failed
+	 * fork(2)) there is no concurrency anymore.
+	 *
+	 * Synchronize eventually pending work to ensure that there are no
+	 * dangling references left. @t->mm_cid.users is zero so nothing
+	 * can queue this work anymore.
+	 */
+	irq_work_sync(&mm->mm_cid.irq_work);
+	cancel_work_sync(&mm->mm_cid.work);
 }
=20
 /* Deactivate MM CID allocation across execve() */
@@ -10719,18 +10808,12 @@ void sched_mm_cid_before_execve(struct task_struc=
t *t)
 void sched_mm_cid_after_execve(struct task_struct *t)
 {
 	sched_mm_cid_fork(t);
-	guard(preempt)();
-	mm_cid_select(t);
 }
=20
 static void mm_cid_work_fn(struct work_struct *work)
 {
 	struct mm_struct *mm =3D container_of(work, struct mm_struct, mm_cid.work=
);
=20
-	/* Make it compile, but not functional yet */
-	if (!IS_ENABLED(CONFIG_NEW_MM_CID))
-		return;
-
 	guard(mutex)(&mm->mm_cid.mutex);
 	/* Did the last user task exit already? */
 	if (!mm->mm_cid.users)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 82c7978..f9d0515 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3745,83 +3745,7 @@ static inline void mm_cid_switch_to(struct task_stru=
ct *prev, struct task_struct
 	mm_cid_schedin(next);
 }
=20
-/* Active implementation */
-static inline void init_sched_mm_cid(struct task_struct *t)
-{
-	struct mm_struct *mm =3D t->mm;
-	unsigned int max_cid;
-
-	if (!mm)
-		return;
-
-	/* Preset last_mm_cid */
-	max_cid =3D min_t(int, READ_ONCE(mm->mm_cid.nr_cpus_allowed), atomic_read=
(&mm->mm_users));
-	t->mm_cid.last_cid =3D max_cid - 1;
-}
-
-static inline bool __mm_cid_get(struct task_struct *t, unsigned int cid, u=
nsigned int max_cids)
-{
-	struct mm_struct *mm =3D t->mm;
-
-	if (cid >=3D max_cids)
-		return false;
-	if (test_and_set_bit(cid, mm_cidmask(mm)))
-		return false;
-	t->mm_cid.cid =3D t->mm_cid.last_cid =3D cid;
-	__this_cpu_write(mm->mm_cid.pcpu->cid, cid);
-	return true;
-}
-
-static inline bool mm_cid_get(struct task_struct *t)
-{
-	struct mm_struct *mm =3D t->mm;
-	unsigned int max_cids;
-
-	max_cids =3D READ_ONCE(mm->mm_cid.max_cids);
-
-	/* Try to reuse the last CID of this task */
-	if (__mm_cid_get(t, t->mm_cid.last_cid, max_cids))
-		return true;
-
-	/* Try to reuse the last CID of this mm on this CPU */
-	if (__mm_cid_get(t, __this_cpu_read(mm->mm_cid.pcpu->cid), max_cids))
-		return true;
-
-	/* Try the first zero bit in the cidmask. */
-	return __mm_cid_get(t, find_first_zero_bit(mm_cidmask(mm), num_possible_c=
pus()), max_cids);
-}
-
-static inline void mm_cid_select(struct task_struct *t)
-{
-	/*
-	 * mm_cid_get() can fail when the maximum CID, which is determined
-	 * by min(mm->nr_cpus_allowed, mm->mm_users) changes concurrently.
-	 * That's a transient failure as there cannot be more tasks
-	 * concurrently on a CPU (or about to be scheduled in) than that.
-	 */
-	for (;;) {
-		if (mm_cid_get(t))
-			break;
-	}
-}
-
-static inline void switch_mm_cid(struct task_struct *prev, struct task_str=
uct *next)
-{
-	if (prev->mm_cid.active) {
-		if (prev->mm_cid.cid !=3D MM_CID_UNSET)
-			clear_bit(prev->mm_cid.cid, mm_cidmask(prev->mm));
-		prev->mm_cid.cid =3D MM_CID_UNSET;
-	}
-
-	if (next->mm_cid.active) {
-		mm_cid_select(next);
-		rseq_sched_set_task_mm_cid(next, next->mm_cid.cid);
-	}
-}
-
 #else /* !CONFIG_SCHED_MM_CID: */
-static inline void mm_cid_select(struct task_struct *t) { }
-static inline void switch_mm_cid(struct task_struct *prev, struct task_str=
uct *next) { }
 static inline void mm_cid_switch_to(struct task_struct *prev, struct task_=
struct *next) { }
 #endif /* !CONFIG_SCHED_MM_CID */
=20