From nobody Sun Apr 5 18:18:47 2026 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA2AA231842 for ; Mon, 23 Feb 2026 16:38:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771864731; cv=none; b=DOdG2L6CCccbYGF5Rue6iYiDMqJ8bvn2P2wnFM+oRRGv+CAU4xDpgEANtWx9pTHNqMUWMyy6vtcyst3DWnOwhYS6yVYN/S880NblDomnaq3KM/GJdGLyimhiPCMck2rG4il+K92UjqUSA9xwVZkApaaafujPINoxW26vaWLPf8Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771864731; c=relaxed/simple; bh=39+AfUwZxWeAPwyO0dFoG7APvKlU2sw11NKubenKGPM=; h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type: Content-Disposition; b=dmp+faqF+P/x6EkJribVfbAIE1AQVI3+9jYhYjcU8Bd/3h4uGVo1VSdwtJ4kXV3OKt2bWkqEm+33WO5u08p4q7L38n9xw3h/znDIU7dViHlspoQnTA/zJ7Il1MeUmN6PddSDcGqoYHsOKUncKWIwAqmmbWdWI4lu7/yxxDEH5LM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=iim4CR4+; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="iim4CR4+" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=vdAI+TcqmRLif5UuYJBJDpb0NZenoW5lI/CSUWeWmcc=; b=iim4CR4+ka3mAW+amT1cb3BhPs nYs78rhMu8gNaH3PFgK8whMwM8wvsHJZdWt9SoT3djmgogXaWgU/AugvF8X9KUf6n/CDSO3GFb02W 2DRJcABNEUaIdAETfL3HZY1TJtfZZsMe+2qSeaAU1p5ejESymrsbl+M5LCX3zSabjJi5f4HOVnEDm BYWp0hdcbYmfYVEND6kv9dH6n234HTsKNeKHVcs6U9u/zh+EsPpU4r36esSmYrSOR6+ucMJ/Jp/UR 9GFP7IQNTmqqDANWBnK0DSl/4eDQgbd8oxErEjb637v+o6pcBa+XjtVFlueFZu9pQiyxn1z0MDHpC 3XWfIC1g==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1vuYxI-00000005VEa-0nhM; Mon, 23 Feb 2026 16:38:44 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 3EDB330330E; Mon, 23 Feb 2026 17:38:43 +0100 (CET) Date: Mon, 23 Feb 2026 17:38:43 +0100 From: Peter Zijlstra To: linux-kernel@vger.kernel.org, Thomas Gleixner , mathieu.desnoyers@efficios.com, Mark Rutland , cmarinas@kernel.org, maddy@linux.ibm.com, hca@linux.ibm.com Cc: ryan.roberts@arm.com Subject: [RFC] in-kernel rseq Message-ID: <20260223163843.GR1282955@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hi, It has come to my attention that various people are struggling with preempt_disable()+preempt_enable() costs for various architectures. Mostly in relation to things like this_cpu_ and or local_.=20 The below is a very crude (and broken, more on that below) POC. So the 'main' advantage of this over preempt_disable()/preempt_enable(), it on the preempt_enable() side, this elides the whole conditional and call schedule() nonsense. Now, on to the broken part, the below 'commit' address should be the address of the 'STORE' instruction. In case of LL/SC, it should be the SC, in case of LSE, it should be the LSE instruction. This means, it needs to be woven into the asm... and I'm not that handy with arm64 asm. The pseudo code would be something like: current->sched_seq =3D &_R; ... _start: compute per cpu-addr load addr $OP _commit: store addr ... current->sched_rseq =3D NULL; Then when preemption happens (from interrupt), the instruction pointer is 'simply' reset to _start and it tries again. Anyway, this was aimed at arm64, which chose to use atomics for this_cpu. But if we move sched_rseq() from schedule-tail into interrupt entry, then this would also work for things like Power. Anyway, just throwing ideas out there. diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percp= u.h index b57b2bb00967..080a868391b7 100644 --- a/arch/arm64/include/asm/percpu.h +++ b/arch/arm64/include/asm/percpu.h @@ -11,6 +11,7 @@ #include #include #include +#include =20 static inline void set_my_cpu_offset(unsigned long off) { @@ -155,9 +156,23 @@ PERCPU_RET_OP(add, add, ldadd) =20 #define _pcp_protect(op, pcp, ...) \ ({ \ - preempt_disable_notrace(); \ + __label__ __rseq_begin; \ + __label__ __rseq_end; \ + static struct sched_rseq _R =3D { \ + .begin =3D (unsigned long)&&__rseq_begin, \ + .commit =3D (unsigned long)&&__rseq_end, \ + .restart =3D (unsigned long)&&__rseq_begin, \ + }; \ + struct sched_rseq **this_rseq; \ + asm ("mrs %0, sp_el0; add %0, %0, %1;" : "=3Dr" (this_rseq) : "i" (TSK_rs= eq));\ + *this_rseq =3D &_R; \ +__rseq_begin: \ + barrier(); \ op(raw_cpu_ptr(&(pcp)), __VA_ARGS__); \ - preempt_enable_notrace(); \ + /* XXX broken */ \ + barrier(); \ +__rseq_end: \ + *this_rseq =3D NULL; \ }) =20 #define _pcp_protect_return(op, pcp, args...) \ diff --git a/include/linux/sched.h b/include/linux/sched.h index a7b4a980eb2f..7960f3e21104 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -817,6 +817,12 @@ struct kmap_ctrl { #endif }; =20 +struct sched_rseq { + unsigned long begin; + unsigned long commit; + unsigned long restart; +}; + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -827,6 +833,8 @@ struct task_struct { #endif unsigned int __state; =20 + struct sched_rseq *sched_rseq; + /* saved state for "spinlock sleepers" */ unsigned int saved_state; =20 diff --git a/include/linux/sched/rseq.h b/include/linux/sched/rseq.h deleted file mode 100644 index e69de29bb2d1..000000000000 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bfd280ec0f97..d4702f8590f2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5087,6 +5087,23 @@ prepare_task_switch(struct rq *rq, struct task_struc= t *prev, prepare_arch_switch(next); } =20 +static inline void sched_rseq(struct task_struct *prev) +{ + struct sched_rseq *rseq =3D prev->sched_rseq; + struct pt_regs *regs; + unsigned long ip; + + if (likely(!rseq)) + return; + + regs =3D task_pt_regs(prev); + ip =3D instruction_pointer(regs); + if ((ip - rseq->begin) >=3D (rseq->commit - rseq->begin)) + return; + + instruction_pointer_set(regs, rseq->restart); +} + /** * finish_task_switch - clean up after a task-switch * @prev: the thread we just switched away from. @@ -5145,6 +5162,7 @@ static struct rq *finish_task_switch(struct task_stru= ct *prev) prev_state =3D READ_ONCE(prev->__state); vtime_task_switch(prev); perf_event_task_sched_in(prev, current); + sched_rseq(prev); finish_task(prev); tick_nohz_task_switch(); finish_lock_switch(rq); diff --git a/kernel/sched/rq-offsets.c b/kernel/sched/rq-offsets.c index a23747bbe25b..629989a89395 100644 --- a/kernel/sched/rq-offsets.c +++ b/kernel/sched/rq-offsets.c @@ -6,7 +6,8 @@ =20 int main(void) { - DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned)); + DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned)); + DEFINE(TSK_rseq, offsetof(struct task_struct, sched_rseq)); =20 return 0; }