From nobody Mon Feb  9 04:29:30 2026
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 468CA2D5946
	for <linux-kernel@vger.kernel.org>; Mon,  2 Feb 2026 09:39:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770025179; cv=none;
 b=b3OOeH2ydGNEpRCoFDCQsPMvggzKzOCxq43cDFbRUjGBTShFHJ1ikqfBNfix8QUzkEWiFyKYgjeREi/tGpbIIzYRkzTZmxGF4hgDCgFwWNGieakkjonXqWVVb+l6PQEG94+GpBSrB8vKDCMckT1YRqzaIxbG9T9wL//CJQYBex4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770025179; c=relaxed/simple;
	bh=WS6sOOblZIUuL9EUTXzhU9cYQpOoL1aK+D/uXK622h4=;
	h=Date:Message-ID:From:To:Cc:Subject;
 b=Eizkzej8d10vm77/Hv0MsdKcaV4zNEBYntFGjWMa9XAkKOUvc4hl/PF+T2yuTJ4rmMVLnsuo7AmGC6QnSDbh7SQE8mA/hssrcNQOs+6uYrdImwt8jIVIxa7sHBBr7c6kokRCd51ggUOmIOh4laysz/SWAqQCewEp0EiC2wq1Ju8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=taVxBHXF; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="taVxBHXF"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6E832C116C6;
	Mon,  2 Feb 2026 09:39:38 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1770025179;
	bh=WS6sOOblZIUuL9EUTXzhU9cYQpOoL1aK+D/uXK622h4=;
	h=Date:From:To:Cc:Subject:From;
	b=taVxBHXFQn0/6ZHA+GPZC+JBde+rppm3/ARASD3Wd2Br6SbO5k/NOnGfDbKgtiu7p
	 I0iryyN/B61lP7ZexP7xhTukxhPe0YEaQJ8Q3gHPb1+FVuw57CDwVsc+H60vXzesnA
	 +H+ROkJRRVqfsaR5xs3qaMVF9nVxLX9E+YiNchnm0tLHLd2CeCvHUvtlvrT1Ojjfdh
	 qeyTPfEpKVd+XcUh8WGryHXGbyyOUpoJ4gLRLDJoDd6tbaKJYTXB1HSDpzcKoQlPzr
	 SRKstaM9s/pt89nzJnPRUhkhMmA3FYYmisZeJDw5iuwOjJdGRYc6jf1kYTkgqBFWnd
	 mNCZIhJiCD/XA==
Date: Mon, 02 Feb 2026 10:39:35 +0100
Message-ID: <20260201192234.380608594@kernel.org>
User-Agent: quilt/0.68
From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Ihor Solodrai <ihor.solodrai@linux.dev>,
 Shrikanth Hegde <sshegde@linux.ibm.com>,
 Peter Zijlstra <peterz@infradead.org>,
 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
 Michael Jeanson <mjeanson@efficios.com>
Subject: [patch V2 0/4] sched/mmcid: Cure mode transition woes
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

This is a follow up to the V1 submission:

     https://lore.kernel.org/20260129210219.452851594@kernel.org

Ihor and Shrikanth reported hard lockups which can be tracked back to the r=
ecent
rewrite of the MM_CID management code.

  1) The from task to CPU ownership transition lacks the intermediate
     transition mode, which can lead to CID pool exhaustion and a
     subsequent live lock. That intermediate mode was implemented for the
     reverse operation already but omitted for this transition as the
     original analysis missed a few possible scheduling scenarios.

  2) Weakly ordered architectures can observe inconsistent state which
     causes them to make the wrong decision. That leads to the same problem
     as with #1.

The following series addresses these issue and fixes another albeit harmless
inconsistent state hickup which was found when analysing the above issues.

With these issues addressed the last change optimizes the bitmap
utilization in the transition modes.

The series applies on Linus tree and passes the selftests and a thread pool
emulator which stress tests the ownership transitions.

Changes vs. V1:

   - Move the mm_cid_fixup_tasks_to_cpus() wrapping where it belongs (patch=
 1)

   - Add barriers before and after the fixup functions to prevent CPU
     reordering of the mode stores - Mathieu

   - Update change logs - Mathieu

Delta patch against V1 is below

Thanks,

	tglx
---
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -133,7 +133,6 @@ struct mm_cid_pcpu {
  *			as that is modified by mmget()/mm_put() by other entities which
  *			do not actually share the MM.
  * @pcpu_thrs:		Threshold for switching back from per CPU mode
- * @mode_change:	Mode change in progress
  * @update_deferred:	A deferred switch back to per task mode is pending.
  */
 struct mm_mm_cid {
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10445,6 +10445,12 @@ static bool mm_update_max_cids(struct mm
=20
 	/* Flip the mode and set the transition flag to bridge the transfer */
 	WRITE_ONCE(mc->mode, mc->mode ^ (MM_CID_TRANSIT | MM_CID_ONCPU));
+	/*
+	 * Order the store against the subsequent fixups so that
+	 * acquire(rq::lock) cannot be reordered by the CPU before the
+	 * store.
+	 */
+	smp_mb();
 	return true;
 }
=20
@@ -10487,6 +10493,16 @@ static inline void mm_update_cpus_allowe
 	irq_work_queue(&mc->irq_work);
 }
=20
+static inline void mm_cid_complete_transit(struct mm_struct *mm, unsigned =
int mode)
+{
+	/*
+	 * Ensure that the store removing the TRANSIT bit cannot be
+	 * reordered by the CPU before the fixups have been completed.
+	 */
+	smp_mb();
+	WRITE_ONCE(mm->mm_cid.mode, mode);
+}
+
 static inline void mm_cid_transit_to_task(struct task_struct *t, struct mm=
_cid_pcpu *pcp)
 {
 	if (cid_on_cpu(t->mm_cid.cid)) {
@@ -10530,8 +10546,7 @@ static void mm_cid_fixup_cpus_to_tasks(s
 			}
 		}
 	}
-	/* Clear the transition bit in the mode */
-	WRITE_ONCE(mm->mm_cid.mode, 0);
+	mm_cid_complete_transit(mm, 0);
 }
=20
 static inline void mm_cid_transit_to_cpu(struct task_struct *t, struct mm_=
cid_pcpu *pcp)
@@ -10603,8 +10618,7 @@ static void mm_cid_fixup_tasks_to_cpus(v
 	struct mm_struct *mm =3D current->mm;
=20
 	mm_cid_do_fixup_tasks_to_cpus(mm);
-	/* Clear the transition bit in the mode */
-	WRITE_ONCE(mm->mm_cid.mode, MM_CID_ONCPU);
+	mm_cid_complete_transit(mm, MM_CID_ONCPU);
 }
=20
 static bool sched_mm_cid_add_user(struct task_struct *t, struct mm_struct =
*mm)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3914,8 +3914,7 @@ static __always_inline void mm_cid_sched
=20
 	/*
 	 * If transition mode is done, transfer ownership when the CID is
-	 * within the convergion range. Otherwise the next schedule in will
-	 * have to allocate or converge
+	 * within the convergence range to optimize the next schedule in.
 	 */
 	if (!cid_in_transit(mode) && cid < READ_ONCE(mm->mm_cid.max_cids)) {
 		if (cid_on_cpu(mode))