From nobody Mon Feb 9 04:29:30 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 468CA2D5946 for ; Mon, 2 Feb 2026 09:39:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770025179; cv=none; b=b3OOeH2ydGNEpRCoFDCQsPMvggzKzOCxq43cDFbRUjGBTShFHJ1ikqfBNfix8QUzkEWiFyKYgjeREi/tGpbIIzYRkzTZmxGF4hgDCgFwWNGieakkjonXqWVVb+l6PQEG94+GpBSrB8vKDCMckT1YRqzaIxbG9T9wL//CJQYBex4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770025179; c=relaxed/simple; bh=WS6sOOblZIUuL9EUTXzhU9cYQpOoL1aK+D/uXK622h4=; h=Date:Message-ID:From:To:Cc:Subject; b=Eizkzej8d10vm77/Hv0MsdKcaV4zNEBYntFGjWMa9XAkKOUvc4hl/PF+T2yuTJ4rmMVLnsuo7AmGC6QnSDbh7SQE8mA/hssrcNQOs+6uYrdImwt8jIVIxa7sHBBr7c6kokRCd51ggUOmIOh4laysz/SWAqQCewEp0EiC2wq1Ju8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=taVxBHXF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="taVxBHXF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6E832C116C6; Mon, 2 Feb 2026 09:39:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1770025179; bh=WS6sOOblZIUuL9EUTXzhU9cYQpOoL1aK+D/uXK622h4=; h=Date:From:To:Cc:Subject:From; b=taVxBHXFQn0/6ZHA+GPZC+JBde+rppm3/ARASD3Wd2Br6SbO5k/NOnGfDbKgtiu7p I0iryyN/B61lP7ZexP7xhTukxhPe0YEaQJ8Q3gHPb1+FVuw57CDwVsc+H60vXzesnA +H+ROkJRRVqfsaR5xs3qaMVF9nVxLX9E+YiNchnm0tLHLd2CeCvHUvtlvrT1Ojjfdh qeyTPfEpKVd+XcUh8WGryHXGbyyOUpoJ4gLRLDJoDd6tbaKJYTXB1HSDpzcKoQlPzr SRKstaM9s/pt89nzJnPRUhkhMmA3FYYmisZeJDw5iuwOjJdGRYc6jf1kYTkgqBFWnd mNCZIhJiCD/XA== Date: Mon, 02 Feb 2026 10:39:35 +0100 Message-ID: <20260201192234.380608594@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Ihor Solodrai , Shrikanth Hegde , Peter Zijlstra , Mathieu Desnoyers , Michael Jeanson Subject: [patch V2 0/4] sched/mmcid: Cure mode transition woes Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This is a follow up to the V1 submission: https://lore.kernel.org/20260129210219.452851594@kernel.org Ihor and Shrikanth reported hard lockups which can be tracked back to the r= ecent rewrite of the MM_CID management code. 1) The from task to CPU ownership transition lacks the intermediate transition mode, which can lead to CID pool exhaustion and a subsequent live lock. That intermediate mode was implemented for the reverse operation already but omitted for this transition as the original analysis missed a few possible scheduling scenarios. 2) Weakly ordered architectures can observe inconsistent state which causes them to make the wrong decision. That leads to the same problem as with #1. The following series addresses these issue and fixes another albeit harmless inconsistent state hickup which was found when analysing the above issues. With these issues addressed the last change optimizes the bitmap utilization in the transition modes. The series applies on Linus tree and passes the selftests and a thread pool emulator which stress tests the ownership transitions. Changes vs. V1: - Move the mm_cid_fixup_tasks_to_cpus() wrapping where it belongs (patch= 1) - Add barriers before and after the fixup functions to prevent CPU reordering of the mode stores - Mathieu - Update change logs - Mathieu Delta patch against V1 is below Thanks, tglx --- --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -133,7 +133,6 @@ struct mm_cid_pcpu { * as that is modified by mmget()/mm_put() by other entities which * do not actually share the MM. * @pcpu_thrs: Threshold for switching back from per CPU mode - * @mode_change: Mode change in progress * @update_deferred: A deferred switch back to per task mode is pending. */ struct mm_mm_cid { --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10445,6 +10445,12 @@ static bool mm_update_max_cids(struct mm =20 /* Flip the mode and set the transition flag to bridge the transfer */ WRITE_ONCE(mc->mode, mc->mode ^ (MM_CID_TRANSIT | MM_CID_ONCPU)); + /* + * Order the store against the subsequent fixups so that + * acquire(rq::lock) cannot be reordered by the CPU before the + * store. + */ + smp_mb(); return true; } =20 @@ -10487,6 +10493,16 @@ static inline void mm_update_cpus_allowe irq_work_queue(&mc->irq_work); } =20 +static inline void mm_cid_complete_transit(struct mm_struct *mm, unsigned = int mode) +{ + /* + * Ensure that the store removing the TRANSIT bit cannot be + * reordered by the CPU before the fixups have been completed. + */ + smp_mb(); + WRITE_ONCE(mm->mm_cid.mode, mode); +} + static inline void mm_cid_transit_to_task(struct task_struct *t, struct mm= _cid_pcpu *pcp) { if (cid_on_cpu(t->mm_cid.cid)) { @@ -10530,8 +10546,7 @@ static void mm_cid_fixup_cpus_to_tasks(s } } } - /* Clear the transition bit in the mode */ - WRITE_ONCE(mm->mm_cid.mode, 0); + mm_cid_complete_transit(mm, 0); } =20 static inline void mm_cid_transit_to_cpu(struct task_struct *t, struct mm_= cid_pcpu *pcp) @@ -10603,8 +10618,7 @@ static void mm_cid_fixup_tasks_to_cpus(v struct mm_struct *mm =3D current->mm; =20 mm_cid_do_fixup_tasks_to_cpus(mm); - /* Clear the transition bit in the mode */ - WRITE_ONCE(mm->mm_cid.mode, MM_CID_ONCPU); + mm_cid_complete_transit(mm, MM_CID_ONCPU); } =20 static bool sched_mm_cid_add_user(struct task_struct *t, struct mm_struct = *mm) --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3914,8 +3914,7 @@ static __always_inline void mm_cid_sched =20 /* * If transition mode is done, transfer ownership when the CID is - * within the convergion range. Otherwise the next schedule in will - * have to allocate or converge + * within the convergence range to optimize the next schedule in. */ if (!cid_in_transit(mode) && cid < READ_ONCE(mm->mm_cid.max_cids)) { if (cid_on_cpu(mode))