From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 64E212EE616; Mon, 8 Sep 2025 22:59:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372395; cv=none; b=IFZl6GlBd6Seov2j3uzwOC2yoNeuchUakvOXeSZRjLsLoz4pVjnQWPs3rKZgz0ARytVU3i4dN3q40aLa7p+ybLD7RlLkkCDD/AZoEuCl50yJ+qAnH/cKbKf5xT7ULux4yENZ4PnaxQvp3kJMutsU6WeHdTbZPAfuk0p9+4pr+gY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372395; c=relaxed/simple; bh=nuEd8q2e/2UTJeCidCdWb3B3VX+fi9Ma57SLNZGlXGA=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=iQ29LI6VhHeRgMRHTGRX+Rzm6kDjYCJbI3OvfODaMG6f8H1uv5uCf0R5Xp/b2ZVDIoX6ScXMX1NcrJMyXnNFgaEoZrI76u1KEmHZ8AkozdPiUh9c6IGbpwWbqWpxyAT4dTJYJTqh9KjNuFLHIsBFYpVl0CFYLnltXFEuM8rw4n8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AEX2IPU3; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=5QomZ9HI; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AEX2IPU3"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="5QomZ9HI" Message-ID: <20250908225752.614755671@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372390; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=4yBlIy9tMVyvyFOf1vnGwV93nPQRJZQKPkBTjqocvGE=; b=AEX2IPU3GjqdPrR4D8HabfX0XgWLI4iSyULB9MI0F2HpuUFe1ckBnjAIQg2ln6mhP1vTnJ BhAIwcnRUiUfKHcAytPjr4IEIqPewcGh2oNV7EUzNKP8cS/rNYuJYIu+WECTneBYvxqF21 YzSiLdbukZXnnj3TsX+18Hi70/44z8zc7heiAYKdggIWNIMfLNKLspeQysOxwhhWvzjMsa x2fwUocJHGvlUL2MDGydDssf3Qm70631hHFhYljkp3OtWjlXgHq6T4dt+T4/ON9SCWC+Zs 2cb5e2AoVY33knp66KohWkCt7P2USQSdS313u+Ov6kKJkdhunnLHmnPFgWkrLQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372390; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=4yBlIy9tMVyvyFOf1vnGwV93nPQRJZQKPkBTjqocvGE=; b=5QomZ9HISLWwXXH9kA8VUhoA6pLcGT5KK5vWukPDEIi50dwBmB67MQ7KmZyPmuTqypvDDx CN7MzAwMCGHaHfCA== From: Thomas Gleixner To: LKML Cc: Peter Zilstra , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 01/12] sched: Provide and use set_need_resched_current() References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 00:59:49 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" set_tsk_need_resched(current) requires set_preempt_need_resched(current) to work correctly outside of the scheduler. Provide set_need_resched_current() which wraps this correctly and replace all the open coded instances. Signed-off-by: Peter Zilstra Signed-off-by: Thomas Gleixner --- arch/s390/mm/pfault.c | 3 +-- include/linux/sched.h | 7 +++++++ kernel/rcu/tiny.c | 8 +++----- kernel/rcu/tree.c | 14 +++++--------- kernel/rcu/tree_exp.h | 3 +-- kernel/rcu/tree_plugin.h | 9 +++------ kernel/rcu/tree_stall.h | 3 +-- 7 files changed, 21 insertions(+), 26 deletions(-) --- a/arch/s390/mm/pfault.c +++ b/arch/s390/mm/pfault.c @@ -199,8 +199,7 @@ static void pfault_interrupt(struct ext_ * return to userspace schedule() to block. */ __set_current_state(TASK_UNINTERRUPTIBLE); - set_tsk_need_resched(tsk); - set_preempt_need_resched(); + set_need_resched_current(); } } out: --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2034,6 +2034,13 @@ static inline int test_tsk_need_resched( return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); } =20 +static inline void set_need_resched_current(void) +{ + lockdep_assert_irqs_disabled(); + set_tsk_need_resched(current); + set_preempt_need_resched(); +} + /* * cond_resched() and cond_resched_lock(): latency reduction via * explicit rescheduling in places that are safe. The return --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -70,12 +70,10 @@ void rcu_qs(void) */ void rcu_sched_clock_irq(int user) { - if (user) { + if (user) rcu_qs(); - } else if (rcu_ctrlblk.donetail !=3D rcu_ctrlblk.curtail) { - set_tsk_need_resched(current); - set_preempt_need_resched(); - } + else if (rcu_ctrlblk.donetail !=3D rcu_ctrlblk.curtail) + set_need_resched_current(); } =20 /* --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2696,10 +2696,8 @@ void rcu_sched_clock_irq(int user) /* The load-acquire pairs with the store-release setting to true. */ if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) { /* Idle and userspace execution already are quiescent states. */ - if (!rcu_is_cpu_rrupt_from_idle() && !user) { - set_tsk_need_resched(current); - set_preempt_need_resched(); - } + if (!rcu_is_cpu_rrupt_from_idle() && !user) + set_need_resched_current(); __this_cpu_write(rcu_data.rcu_urgent_qs, false); } rcu_flavor_sched_clock_irq(user); @@ -2824,7 +2822,6 @@ static void strict_work_handler(struct w /* Perform RCU core processing work for the current CPU. */ static __latent_entropy void rcu_core(void) { - unsigned long flags; struct rcu_data *rdp =3D raw_cpu_ptr(&rcu_data); struct rcu_node *rnp =3D rdp->mynode; =20 @@ -2837,8 +2834,8 @@ static __latent_entropy void rcu_core(vo if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && (!(preempt_count() & PREEMPT_MASK= ))) { rcu_preempt_deferred_qs(current); } else if (rcu_preempt_need_deferred_qs(current)) { - set_tsk_need_resched(current); - set_preempt_need_resched(); + guard(irqsave)(); + set_need_resched_current(); } =20 /* Update RCU state based on any recent quiescent states. */ @@ -2847,10 +2844,9 @@ static __latent_entropy void rcu_core(vo /* No grace period and unregistered callbacks? */ if (!rcu_gp_in_progress() && rcu_segcblist_is_enabled(&rdp->cblist) && !rcu_rdp_is_offloaded(rdp))= { - local_irq_save(flags); + guard(irqsave)(); if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) rcu_accelerate_cbs_unlocked(rnp, rdp); - local_irq_restore(flags); } =20 rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check()); --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -729,8 +729,7 @@ static void rcu_exp_need_qs(void) __this_cpu_write(rcu_data.cpu_no_qs.b.exp, true); /* Store .exp before .rcu_urgent_qs. */ smp_store_release(this_cpu_ptr(&rcu_data.rcu_urgent_qs), true); - set_tsk_need_resched(current); - set_preempt_need_resched(); + set_need_resched_current(); } =20 #ifdef CONFIG_PREEMPT_RCU --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -756,8 +756,7 @@ static void rcu_read_unlock_special(stru // Also if no expediting and no possible deboosting, // slow is OK. Plus nohz_full CPUs eventually get // tick enabled. - set_tsk_need_resched(current); - set_preempt_need_resched(); + set_need_resched_current(); if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && needs_exp && rdp->defer_qs_iw_pending !=3D DEFER_QS_PENDING && cpu_online(rdp->cpu)) { @@ -818,10 +817,8 @@ static void rcu_flavor_sched_clock_irq(i if (rcu_preempt_depth() > 0 || (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) { /* No QS, force context switch if deferred. */ - if (rcu_preempt_need_deferred_qs(t)) { - set_tsk_need_resched(t); - set_preempt_need_resched(); - } + if (rcu_preempt_need_deferred_qs(t)) + set_need_resched_current(); } else if (rcu_preempt_need_deferred_qs(t)) { rcu_preempt_deferred_qs(t); /* Report deferred QS. */ return; --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -763,8 +763,7 @@ static void print_cpu_stall(unsigned lon * progress and it could be we're stuck in kernel space without context * switches for an entirely unreasonable amount of time. */ - set_tsk_need_resched(current); - set_preempt_need_resched(); + set_need_resched_current(); } =20 static bool csd_lock_suppress_rcu_stall; From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 963A82F28E3; Mon, 8 Sep 2025 22:59:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372397; cv=none; b=B4YGYObXyUIKDuXTStzxCwV7Bs5Q/yO/h//u/3SNXFJBN5l2GrRbB8cIWhd5115XRPJiNtahopupG4JVhVrhyIDp5b2ssrsyWfXy5XKIDMK58uuF9+YY7r0ONzl092nG/B/Ry3VuLSx7fYjDK3jOaTv4bjiCgkWUNW4/rr0nfeU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372397; c=relaxed/simple; bh=sJUvIkOk7xRsZ8N1md77VtwlcW+xrqBYELwzAD9L3kc=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=RhZkxJ95j5Fi0LBy3dJaxQnTMTI/vTACppeExLM8YxhnqMmDLby+jkULrp0qlm3y/g7MZCvFa7P4nTxWdygAHd9Xc7SqsmM1RNz2NZhYQOSyOQy/i0rt19zhnrRDL1CaENFIOGPwv3fTvQNXtEJEr78ZES/ZUqM1gRd80ucS8mY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=tRUuKTO4; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=DYZJ+vQK; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="tRUuKTO4"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="DYZJ+vQK" Message-ID: <20250908225752.679815003@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372392; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=uJbTU9ygKfI6/zvyWU+Vow5hjTzjort0MMfsywsVYMk=; b=tRUuKTO4XbAvXw+t88yT6uoUK8blCFEDcriT2OWhG3/G3zQ/Dw5CDcNNBJJ/kjQdCwS1T+ 0Sza16lnAZWRl1HodpDm9sMKYRbDMRIAdUAHM92Uq9I0J99n8AgfksKLvyjrKoXyQXF/Em sv3gYiet7TiwvEGnMPkivyaHmPzaOPbhLqysau7af2Icahzn+e/vO8xG8OwFzBP1nrsobG KtfjZm+eN8cxx9YOYKthicJ+bKClpVVHHiSmzYP4UeAs9KVPM3+Zi+GRkAxUvikUwWhnIx zpB/559RDzx6p8I5aghRtGQ2YerU0IpsFqM+rMD0q0olPJ5yFPbC2scV2e0ytw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372392; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=uJbTU9ygKfI6/zvyWU+Vow5hjTzjort0MMfsywsVYMk=; b=DYZJ+vQKS75bOERhbPJx6LFHgjRAmaPzCDluZ5ObcXRVy7/8hgJhCTMll1mh47pEsnKwCb 5Js/8CGfj5FCDNBw== From: Thomas Gleixner To: LKML Cc: Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Peter Zilstra , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 02/12] rseq: Add fields and constants for time slice extension References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 00:59:51 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Aside of a Kconfig knob add the following items: - Two flag bits for the rseq user space ABI, which allow user space to query the availability and enablement without a syscall. - A new member to the user space ABI struct rseq, which is going to be used to communicate request and grant between kernel and user space. - A rseq state struct to hold the kernel state of this - Documentation of the new mechanism Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Mathieu Desnoyers Cc: "Paul E. McKenney" Cc: Boqun Feng Cc: Jonathan Corbet Cc: Prakash Sangappa Cc: Madadi Vineeth Reddy Cc: K Prateek Nayak Cc: Steven Rostedt Cc: Sebastian Andrzej Siewior --- Documentation/userspace-api/index.rst | 1=20 Documentation/userspace-api/rseq.rst | 129 +++++++++++++++++++++++++++++= +++++ include/linux/rseq_types.h | 26 ++++++ include/uapi/linux/rseq.h | 28 +++++++ init/Kconfig | 12 +++ kernel/rseq.c | 8 ++ 6 files changed, 204 insertions(+) --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -21,6 +21,7 @@ System calls ebpf/index ioctl/index mseal + rseq =20 Security-related interfaces =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D --- /dev/null +++ b/Documentation/userspace-api/rseq.rst @@ -0,0 +1,129 @@ +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Restartable Sequences +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Restartable Sequences allow to register a per thread userspace memory area +to be used as an ABI between kernel and user-space for three purposes: + + * user-space restartable sequences + + * quick access to read the current CPU number, node ID from user-space + + * scheduler time slice extensions + +Restartable sequences (per-cpu atomics) +--------------------------------------- + +Restartables sequences allow user-space to perform update operations on +per-cpu data without requiring heavy-weight atomic operations. The actual +ABI is unfortunately only available in the code and selftests. + +Quick access to CPU number, node ID +----------------------------------- + +Allows to implement per CPU data efficiently. Documentation is in code and +selftests. :( + +Scheduler time slice extensions +------------------------------- + +This allows a thread to request a time slice extension when it enters a +critical section to avoid contention on a resource when the thread is +scheduled out inside of the critical section. + +The prerequisites for this functionality are: + + * Enabled in Kconfig + + * Enabled at boot time (default is enabled) + + * A rseq user space pointer has been registered for the thread + +The thread has to enable the functionality via prctl(2):: + + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); + +prctl() returns 0 on success and otherwise with the following error codes: + +=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Errorcode Meaning +=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +EINVAL Functionality not available or invalid function arguments. + Note: arg4 and arg5 must be zero +ENOTSUPP Functionality was disabled on the kernel command line +ENXIO Available, but no rseq user struct registered +=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The state can be also queried via prctl(2):: + + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); + +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if +disabled. Otherwise it returns with the following error codes: + +=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Errorcode Meaning +=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +EINVAL Functionality not available or invalid function arguments. + Note: arg3 and arg4 and arg5 must be zero +=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The availability and status is also exposed via the rseq ABI struct flags +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user +space and only for informational purposes. + +If the mechanism was enabled via prctl(), the thread can request a time +slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct +rseq slice_ctrl field. If the thread is interrupted and the interrupt +results in a reschedule request in the kernel, then the kernel can grant a +time slice extension and return to user space instead of scheduling +out. + +The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT`` +and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl +field. If there is a reschedule of the thread after granting the extension, +the kernel clears the granted bit to indicate that to user space. + +If the request bit is still set when the leaving the critical section, user +space can clear it and continue. + +If the granted bit is set, then user space has to invoke rseq_slice_yield() +when leaving the critical section to relinquish the CPU. The kernel +enforces this by arming a timer to prevent misbehaving user space from +abusing this mechanism. + +If both the request bit and the granted bit are false when leaving the +critical section, then this indicates that a grant was revoked and no +further action is required by user space. + +The required code flow is as follows:: + + rseq->slice_ctrl =3D REQUEST; + critical_section(); + if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) { + if (rseq->slice_ctrl & GRANTED) + rseq_slice_yield(); + } + +local_test_and_clear_bit() has to be local CPU atomic to prevent the +obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL +without LOCK prefix. On architectures, which do not provide lightweight CPU +local atomics this needs to be implemented with regular atomic operations. + +Setting REQUEST has no atomicity requirements as there is no concurrency +vs. the GRANTED bit. + +Checking the GRANTED has no atomicity requirements as there is obviously a +race which cannot be avoided at all:: + + if (rseq->slice_ctrl & GRANTED) + -> Interrupt results in schedule and grant revocation + rseq_slice_yield(); + +So there is no point in pretending that this might be solved by an atomic +operation. + +The kernel enforces flag consistency and terminates the thread with SIGSEGV +if it detects a violation. --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -71,12 +71,35 @@ struct rseq_ids { }; =20 /** + * union rseq_slice_state - Status information for rseq time slice extensi= on + * @state: Compound to access the overall state + * @enabled: Time slice extension is enabled for the task + * @granted: Time slice extension was granted to the task + */ +union rseq_slice_state { + u16 state; + struct { + u8 enabled; + u8 granted; + }; +}; + +/** + * struct rseq_slice - Status information for rseq time slice extension + * @state: Time slice extension state + */ +struct rseq_slice { + union rseq_slice_state state; +}; + +/** * struct rseq_data - Storage for all rseq related data * @usrptr: Pointer to the registered user space RSEQ memory * @len: Length of the RSEQ region * @sig: Signature of critial section abort IPs * @event: Storage for event management * @ids: Storage for cached CPU ID and MM CID + * @slice: Storage for time slice extension data */ struct rseq_data { struct rseq __user *usrptr; @@ -84,6 +107,9 @@ struct rseq_data { u32 sig; struct rseq_event event; struct rseq_ids ids; +#ifdef CONFIG_RSEQ_SLICE_EXTENSION + struct rseq_slice slice; +#endif }; =20 #else /* CONFIG_RSEQ */ --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -23,9 +23,15 @@ enum rseq_flags { }; =20 enum rseq_cs_flags_bit { + /* Historical and unsupported bits */ RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT =3D 0, RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT =3D 1, RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT =3D 2, + /* (3) Intentional gap to put new bits into a seperate byte */ + + /* User read only feature flags */ + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT =3D 4, + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT =3D 5, }; =20 enum rseq_cs_flags { @@ -35,6 +41,22 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =3D (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), + + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =3D + (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT), + RSEQ_CS_FLAG_SLICE_EXT_ENABLED =3D + (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT), +}; + +enum rseq_slice_bits { + /* Time slice extension ABI bits */ + RSEQ_SLICE_EXT_REQUEST_BIT =3D 0, + RSEQ_SLICE_EXT_GRANTED_BIT =3D 1, +}; + +enum rseq_slice_masks { + RSEQ_SLICE_EXT_REQUEST =3D (1U << RSEQ_SLICE_EXT_REQUEST_BIT), + RSEQ_SLICE_EXT_GRANTED =3D (1U << RSEQ_SLICE_EXT_GRANTED_BIT), }; =20 /* @@ -142,6 +164,12 @@ struct rseq { __u32 mm_cid; =20 /* + * Time slice extension control word. CPU local atomic updates from + * kernel and user space. + */ + __u32 slice_ctrl; + + /* * Flexible array member at end of structure, after last feature field. */ char end[]; --- a/init/Kconfig +++ b/init/Kconfig @@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE =20 If unsure, say N. =20 +config RSEQ_SLICE_EXTENSION + bool "Enable rseq based time slice extension mechanism" + depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_B= ITS + help + Allows userspace to request a limited time slice extension when + returning from an interrupt to user space via the RSEQ shared + data ABI. If granted, that allows to complete a critical section, + so that other threads are not stuck on a conflicted resource, + while the task is scheduled out. + + If unsure, say N. + config DEBUG_RSEQ default n bool "Enable debugging of rseq() system call" if EXPERT --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -387,6 +387,8 @@ static bool rseq_reset_ids(void) */ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flag= s, u32, sig) { + u32 rseqfl =3D 0; + if (flags & RSEQ_FLAG_UNREGISTER) { if (flags & ~RSEQ_FLAG_UNREGISTER) return -EINVAL; @@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user if (put_user_masked_u64(0UL, &rseq->rseq_cs)) return -EFAULT; =20 + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) + rseqfl |=3D RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; + + if (put_user_masked_u32(rseqfl, &rseq->flags)) + return -EFAULT; + /* * Activate the registration by setting the rseq area address, length * and signature in the task struct. From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 754E930ACF9; Mon, 8 Sep 2025 22:59:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372399; cv=none; b=oen3iYbUI1mVM/H8OhxytwRkPF9NRUFE3A5reBUkTChUbhgsyzhBkgYPUttLCZBiM4fDBwzkOOvsIbnfV0AnL9SI0j7M7cW0SVvxqjxzD9hufOLjS5yT2TdqYeo1yB8XHBXzTdwduR/YbrHfMMebNTOAxP9NFqkrRDNCGvTBOQM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372399; c=relaxed/simple; bh=Zk+7GUelBL5L11XtNETdSZvQBonABXAXwyKF4YXotDo=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=H48eZTIenrtWgC/6JEfR8b196oE2heH6mHvIpm8QzV2KSTGsB0IRvfedPCq8o2ZCb89hdHHeoJxVZPpuxEN6bFUy5oLX/9mKNNPAB57gF5a9P6/1++0tnV+P6JkK1ZXmpdZkxG+7hyXqtgOLQ3lU5SGLH+2ciWUhcQDwgiuLCOM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=QUaSMlFE; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=dfCtUSc9; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="QUaSMlFE"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="dfCtUSc9" Message-ID: <20250908225752.744169647@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372394; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=yifYKxxoOjDGJ+Kw/k9A+Jz2j8TKRsJ1YOBwxZLrTTg=; b=QUaSMlFE78bzzptgx7Y9ooL+93PMl8tc6JNMaeouChLbv6FPw+8gkncudRhcpZNKgc4+yx VTWljVh1fS9fLQrY/+PKfsFGiU2UUdqw8Ak7+troDV6KY82D89Xbik3Lexd0Q6Oy58oroW 6O5tM4KOjrczjFOAULM+eO8cOBc495ISKgw5P9z2zCPJI9BAMEqbfmIbCpD3KT46ml9Y+L M8as2/R1T9498DXcd76jgoXhYBkqC6YzR1LwfEWlljz8FHP/9f9gLYNp+X9oCS3Jgqcu1f WCnZ93g6LC482x+DymY6utevgstgS5LSEOroOJwJVmVQ8hnpPTcq4gADIiQHUw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372394; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=yifYKxxoOjDGJ+Kw/k9A+Jz2j8TKRsJ1YOBwxZLrTTg=; b=dfCtUSc9KqTE6gC6+gPpZlhd7SqTmYztxjDOQyOtSu8MuEuzq16a+5tb7lENP+W256//+z tE/1aPI15YdsOrAg== From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Peter Zilstra , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 03/12] rseq: Provide static branch for time slice extensions References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 00:59:53 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Guard the time slice extension functionality with a static key, which can be disabled on the kernel command line. Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng --- include/linux/rseq_entry.h | 11 +++++++++++ kernel/rseq.c | 17 +++++++++++++++++ 2 files changed, 28 insertions(+) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -77,6 +77,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB #define rseq_inline __always_inline #endif =20 +#ifdef CONFIG_RSEQ_SLICE_EXTENSION +DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key); + +static __always_inline bool rseq_slice_extension_enabled(void) +{ + return static_branch_likely(&rseq_slice_extension_key); +} +#else /* CONFIG_RSEQ_SLICE_EXTENSION */ +static inline bool rseq_slice_extension_enabled(void) { return false; } +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ + bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); bool rseq_debug_validate_ids(struct task_struct *t); =20 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -474,3 +474,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user =20 return 0; } + +#ifdef CONFIG_RSEQ_SLICE_EXTENSION +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); + +static int __init rseq_slice_cmdline(char *str) +{ + bool on; + + if (kstrtobool(str, &on)) + return -EINVAL; + + if (!on) + static_branch_disable(&rseq_slice_extension_key); + return 0; +} +__setup("rseq_slice_ext=3D", rseq_slice_cmdline); +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */ From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E887B3112A0; Mon, 8 Sep 2025 23:00:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372403; cv=none; b=Y3riYJBWmY+CtwMFjhblfurEbwK9KvFEzICCwEPKTlNu1vuNttcxsJK6BkaA4l5Qi8zO6GE+BJMS5SYXtWzVN2TGIMQB5d3jQOGtRFi+uHZnTDOr7BFLVCkPrDyCxEef43+nvABrRsUVYtBHAVhv3ejJOHhB5jryisBRrGQsQ4w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372403; c=relaxed/simple; bh=NqQjpQ6wmtHhaimTEy5dZdITuRSnH9EVOsOJjL3t7HI=; h=Message-ID:From:To:Subject:References:MIME-Version:Content-Type: cc:Date; b=mEbf+5Q0uLIxMpsn+1y5EcBFwtSd/rUHn6Q+UbxKHafcrhsnJKBBk/OdrElscpA53wkaQNxmVmQ6nIAO/eakGkr4IfPu1XOAmVT58Q4TU1Cx+y7OWQcEeOE1whqjcCLKpO9jr97FFyeWJJSyWv/MBwl8sSNzIPzG9tFiqCJfK6k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=nwFj8V9x; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=VBKj4myO; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="nwFj8V9x"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="VBKj4myO" Message-ID: <20250908225752.808973324@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372398; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=puOLVz/xusJgMHcGpgny0+0j3GVseKRPTpDHpqHQHdc=; b=nwFj8V9xpjJS2xjnYghzTPAyD6gCEZrg2AxvRsBLeDoR2nk/L3vYXTUbc3X+5RthrkvsSm /wAZpJ2nKKTcnTlv9jL3JKKReXvmtzXvNcDOA0OWcm/SxD5JsZi+07Jzt2UoSTIji/380D cjQU6oEn6G9vF/ry0V6bnw4lDL8vHmKKqesivT0DBaaOQLt8R/w7CIm73g0BBg9PdGAkSi m0ZuthDRAX6m1UExAzW3tNLb3nAjkCHwOxzw5YquANpyJwG5JAj2dXsmLnpSaag37i389u xkLz8s4uP9YfNV5dXf4xahqSVFBdqUS+3PTEwIWBALk5krUGQ32KyG2xwxEOGw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372398; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=puOLVz/xusJgMHcGpgny0+0j3GVseKRPTpDHpqHQHdc=; b=VBKj4myOJex1TJY340JUP5vfkx/G0eGJvpaP0Z9+PakZRIpGqhqepbTMZTrS9OSLcnudrw CHmJqgfaz6ZnKdCw== From: Thomas Gleixner To: LKML Subject: [patch 04/12] rseq: Add statistics for time slice extensions References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 cc: Peter Zilstra , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Date: Tue, 9 Sep 2025 00:59:56 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Extend the quick statistics with time slice specific fields. Signed-off-by: Thomas Gleixner --- include/linux/rseq_entry.h | 4 ++++ kernel/rseq.c | 12 ++++++++++++ 2 files changed, 16 insertions(+) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -15,6 +15,10 @@ struct rseq_stats { unsigned long cs; unsigned long clear; unsigned long fixup; + unsigned long s_granted; + unsigned long s_expired; + unsigned long s_revoked; + unsigned long s_yielded; }; =20 DECLARE_PER_CPU(struct rseq_stats, rseq_stats); --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -138,6 +138,12 @@ static int rseq_stats_show(struct seq_fi stats.cs +=3D data_race(per_cpu(rseq_stats.cs, cpu)); stats.clear +=3D data_race(per_cpu(rseq_stats.clear, cpu)); stats.fixup +=3D data_race(per_cpu(rseq_stats.fixup, cpu)); + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { + stats.s_granted +=3D data_race(per_cpu(rseq_stats.s_granted, cpu)); + stats.s_expired +=3D data_race(per_cpu(rseq_stats.s_expired, cpu)); + stats.s_revoked +=3D data_race(per_cpu(rseq_stats.s_revoked, cpu)); + stats.s_yielded +=3D data_race(per_cpu(rseq_stats.s_yielded, cpu)); + } } =20 seq_printf(m, "exit: %16lu\n", stats.exit); @@ -148,6 +154,12 @@ static int rseq_stats_show(struct seq_fi seq_printf(m, "cs: %16lu\n", stats.cs); seq_printf(m, "clear: %16lu\n", stats.clear); seq_printf(m, "fixup: %16lu\n", stats.fixup); + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { + seq_printf(m, "sgrant: %16lu\n", stats.s_granted); + seq_printf(m, "sexpir: %16lu\n", stats.s_expired); + seq_printf(m, "srevok: %16lu\n", stats.s_revoked); + seq_printf(m, "syield: %16lu\n", stats.s_yielded); + } return 0; } From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47B5031197B; Mon, 8 Sep 2025 23:00:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372405; cv=none; b=nBH62RByUQx/C9h5wMh5dcn9i9Ywodvbp19vnWWmbvw82poSG7Sd7r5zAJ52hiV3933pBtXtiG+92cFsy2/USBO3s1IhGi7R17Oz8SqOGMWTR2IKCjPtbzH2hxKa5TDI3ooXBfQrThCDlCBbOAFzdM1iacYzkiNrX4OvhJpcFdU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372405; c=relaxed/simple; bh=s5+0dXKO96Z6TE/svvTpMCqloqjUT8B7tIiQkcRJzUI=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=SJ0IFq2BYcao6y6z5DJq0IdA6PkDB0D6iSvlHt6ytOd1HGUPH+5flN9LNBUUhBTHpP4BAARSc7PXTvduW5Dd31Q7xpShCbg5NXMP9qZxoIuWeB6CX6hJqrE3Hmf/igFHgpRf4qx82B6R9xuLA4GMUaEo3fc/BqHzVgvbzCl6Wg4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=MB4gZqJ7; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Tws6YMGy; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="MB4gZqJ7"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Tws6YMGy" Message-ID: <20250908225752.872081859@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372401; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=q2HI1rZu/rTLqfMxXCHeJvZRAnjrUkclL/FW2QBEeDw=; b=MB4gZqJ7AYPIhpNWNvF3rJ1hQzQ1HN7DUUUqB3junfMwk4xbOflZ47ReyCmlvCBXM9IMXj bMCD+ZQyMXQTIU0jtlui0omMJB7W7tdtbZRvsf/pI3Rg71C23hY7K6+wFsM4TAz6TfDzMS JNVwMqBHADA3B6E5LNoybmLrm/AaCVLEde218YrBDBvwhZVpnMY1oDruX1COlE2rwqkxX4 4j8AeKwuFmtpdzspdavrJ1ScSZggEOcDtv8cdVmNY08QcKaAQ3PqQQb0TnsxE/mSSi4IZW yMKooEnaYqOwDSuYzy7X6qKvTvexBqNdvLTQJdnOVvOzMHxzPIGu/Z8nu37lnw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372401; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=q2HI1rZu/rTLqfMxXCHeJvZRAnjrUkclL/FW2QBEeDw=; b=Tws6YMGycBqrhvXHfDKm8QmAY1c7fASt6l/C2FD1efOFQOdRaFO2ddBN9Bulvj2KKKrTRV 3NsPgxgGnaE7acCQ== From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Peter Zilstra , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 05/12] rseq: Add prctl() to enable time slice extensions References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 00:59:59 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Implement a prctl() so that tasks can enable the time slice extension mechanism. This fails, when time slice extensions are disabled at compile time or on the kernel command line and when no rseq pointer is registered in the kernel. That allows to implement a single trivial check in the exit to user mode hotpath, to decide whether the whole mechanism needs to be invoked. Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng --- include/linux/rseq.h | 9 +++++++ include/uapi/linux/prctl.h | 10 ++++++++ kernel/rseq.c | 52 ++++++++++++++++++++++++++++++++++++++++= +++++ kernel/sys.c | 6 +++++ 4 files changed, 77 insertions(+) --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -190,4 +190,13 @@ void rseq_syscall(struct pt_regs *regs); static inline void rseq_syscall(struct pt_regs *regs) { } #endif /* !CONFIG_DEBUG_RSEQ */ =20 +#ifdef CONFIG_RSEQ_SLICE_EXTENSION +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3); +#else /* CONFIG_RSEQ_SLICE_EXTENSION */ +static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned = long arg3) +{ + return -EINVAL; +} +#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ + #endif /* _LINUX_RSEQ_H */ --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -376,4 +376,14 @@ struct prctl_mm_map { # define PR_FUTEX_HASH_SET_SLOTS 1 # define PR_FUTEX_HASH_GET_SLOTS 2 =20 +/* RSEQ time slice extensions */ +#define PR_RSEQ_SLICE_EXTENSION 79 +# define PR_RSEQ_SLICE_EXTENSION_GET 1 +# define PR_RSEQ_SLICE_EXTENSION_SET 2 +/* + * Bits for RSEQ_SLICE_EXTENSION_GET/SET + * PR_RSEQ_SLICE_EXT_ENABLE: Enable + */ +# define PR_RSEQ_SLICE_EXT_ENABLE 0x01 + #endif /* _LINUX_PRCTL_H */ --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -71,6 +71,7 @@ #define RSEQ_BUILD_SLOW_PATH =20 #include +#include #include #include #include @@ -490,6 +491,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user #ifdef CONFIG_RSEQ_SLICE_EXTENSION DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); =20 +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) +{ + switch (arg2) { + case PR_RSEQ_SLICE_EXTENSION_GET: + if (arg3) + return -EINVAL; + return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0; + + case PR_RSEQ_SLICE_EXTENSION_SET: { + u32 rflags, valid =3D RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; + bool enable =3D !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE); + + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE) + return -EINVAL; + if (!rseq_slice_extension_enabled()) + return -ENOTSUPP; + if (!current->rseq.usrptr) + return -ENXIO; + + /* No change? */ + if (enable =3D=3D !!current->rseq.slice.state.enabled) + return 0; + + if (get_user(rflags, ¤t->rseq.usrptr->flags)) + goto die; + + if (current->rseq.slice.state.enabled) + valid |=3D RSEQ_CS_FLAG_SLICE_EXT_ENABLED; + + if ((rflags & valid) !=3D valid) + goto die; + + rflags &=3D ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED; + rflags |=3D RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; + if (enable) + rflags |=3D RSEQ_CS_FLAG_SLICE_EXT_ENABLED; + + if (put_user(rflags, ¤t->rseq.usrptr->flags)) + goto die; + + current->rseq.slice.state.enabled =3D enable; + return 0; + } + default: + return -EINVAL; + } +die: + force_sig(SIGSEGV); + return -EFAULT; +} + static int __init rseq_slice_cmdline(char *str) { bool on; --- a/kernel/sys.c +++ b/kernel/sys.c @@ -53,6 +53,7 @@ #include #include #include +#include =20 #include #include @@ -2805,6 +2806,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi case PR_FUTEX_HASH: error =3D futex_hash_prctl(arg2, arg3, arg4); break; + case PR_RSEQ_SLICE_EXTENSION: + if (arg4 || arg5) + return -EINVAL; + error =3D rseq_slice_extension_prctl(arg2, arg3); + break; default: trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5); error =3D -EINVAL; From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EC4F62F3C35; Mon, 8 Sep 2025 23:00:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372407; cv=none; b=rV2uikTLxTlVFpyzLdIiMu7A9CMQslVtr10xYQU709wIjcu2BcXEABft7v1d+ds6wZ0ZscKNNWy/dsk6Ka4N/QQzzDn5HEmRJ3LeVHrvv2ICLGSeJh/hM8gK2hDasE8FqRSUZJpi8IGn3UD+I8uL8gS60qAO+E9IdtV6KCuW6dk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372407; c=relaxed/simple; bh=n2wD579iijEn3Q6LkN/A6qn3bPT+leOOH6JRj3hBSJ8=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=kFmNnqwFS8jZljr7Ab3ORf6FoCRp1D1EDqVDeYyIs8mR+1S4Ns0ZvgGXNLKrw4o7HZDmaf1TWkyVPTlSp11NoCyT7tcppIgo1ls70U4lsLUeAVek0YPUTZ0uKfs/PDQq8Dwrc2KIB/gNDkjBFp98GCS69iZTYKp94JCmB2KVJsU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=38J8ySTm; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=l939WA0/; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="38J8ySTm"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="l939WA0/" Message-ID: <20250908225752.936257349@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372404; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=8A1YMgmrPS+Ug02F4sFlVaGHSo1xhuOr+4o1pO4v4PU=; b=38J8ySTmA3REjEPjXeUnUOIUqSuvwUBCRuFMpoTekusPKmfVvC55E5wGaBvG9S2yjhiKS5 APBWZUb5zti/JBS4Y3rJoBjdRq1vZscHUIMatNHGbXq8M32JhGktgQYu/YfVwYwjCWc8RD nnJ0pMHlFmyCTr6Qi7yZfwo/IIxcyOfSaee/AN65o9BwqcwoDFVjMQ9j8x+KS9XyVIWfyK H5UjIsd4nQKmCZReen2GtJUlMLUMavyNeqM3wQQBmmIIHuM2+PIvExTR2BwmgV3KxcL6KR kTsfjnq8yTbRRdeIa4ZztmK7h0EzKYYv8vrmDcjXS0hiQHks6n2kpjkxcvhqrw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372404; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=8A1YMgmrPS+Ug02F4sFlVaGHSo1xhuOr+4o1pO4v4PU=; b=l939WA0/hJ94zbpG5Hm31JFDy2DA1JdGe+m3QRcQlKKhrRzIWPiSInhEk5M8Unxygm3BQL OF5bUgtgk5cwXbCw== From: Thomas Gleixner To: LKML Cc: Arnd Bergmann , linux-arch@vger.kernel.org, Peter Zilstra , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior Subject: [patch 06/12] rseq: Implement sys_rseq_slice_yield() References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 01:00:02 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide a new syscall which has the only purpose to yield the CPU after the kernel granted a time slice extension. sched_yield() is not suitable for that because it unconditionally schedules, but the end of the time slice extension is not required to schedule when the task was already preempted. This also allows to have a strict check for termination to catch user space invoking random syscalls including sched_yield() from a time slice extension region. Signed-off-by: Thomas Gleixner Cc: Arnd Bergmann Cc: linux-arch@vger.kernel.org --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/tools/syscall_32.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 5 ++++- kernel/rseq.c | 9 +++++++++ kernel/sys_ni.c | 1 + scripts/syscall.tbl | 1 + 21 files changed, 32 insertions(+), 1 deletion(-) --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -509,3 +509,4 @@ 577 common open_tree_attr sys_open_tree_attr 578 common file_getattr sys_file_getattr 579 common file_setattr sys_file_setattr +580 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -484,3 +484,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/arm64/tools/syscall_32.tbl +++ b/arch/arm64/tools/syscall_32.tbl @@ -481,3 +481,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -469,3 +469,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -475,3 +475,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -408,3 +408,4 @@ 467 n32 open_tree_attr sys_open_tree_attr 468 n32 file_getattr sys_file_getattr 469 n32 file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -384,3 +384,4 @@ 467 n64 open_tree_attr sys_open_tree_attr 468 n64 file_getattr sys_file_getattr 469 n64 file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -457,3 +457,4 @@ 467 o32 open_tree_attr sys_open_tree_attr 468 o32 file_getattr sys_file_getattr 469 o32 file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -468,3 +468,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -560,3 +560,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 nospu rseq_slice_yield sys_rseq_slice_yield --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -472,3 +472,4 @@ 467 common open_tree_attr sys_open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield sys_rseq_slice_yield --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -473,3 +473,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -515,3 +515,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -475,3 +475,4 @@ 467 i386 open_tree_attr sys_open_tree_attr 468 i386 file_getattr sys_file_getattr 469 i386 file_setattr sys_file_setattr +470 i386 rseq_slice_yield sys_rseq_slice_yield --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -393,6 +393,7 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield =20 # # Due to a historical design error, certain syscalls are numbered differen= tly --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -440,3 +440,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_slice_yield sys_rseq_slice_yield --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -957,6 +957,7 @@ asmlinkage long sys_statx(int dfd, const unsigned mask, struct statx __user *buffer); asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, int flags, uint32_t sig); +asmlinkage long sys_rseq_slice_yield(void); asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned f= lags); asmlinkage long sys_open_tree_attr(int dfd, const char __user *path, unsigned flags, --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -858,8 +858,11 @@ #define __NR_file_setattr 469 __SYSCALL(__NR_file_setattr, sys_file_setattr) =20 +#define __NR_rseq_slice_yield 470 +__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield) + #undef __NR_syscalls -#define __NR_syscalls 470 +#define __NR_syscalls 471 =20 /* * 32 bit systems traditionally used different --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -542,6 +542,15 @@ int rseq_slice_extension_prctl(unsigned return -EFAULT; } =20 +SYSCALL_DEFINE0(rseq_slice_yield) +{ + if (need_resched()) { + schedule(); + return 1; + } + return 0; +} + static int __init rseq_slice_cmdline(char *str) { bool on; --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -390,5 +390,6 @@ COND_SYSCALL(setuid16); =20 /* restartable sequence */ COND_SYSCALL(rseq); +COND_SYSCALL(rseq_sched_yield); =20 COND_SYSCALL(uretprobe); --- a/scripts/syscall.tbl +++ b/scripts/syscall.tbl @@ -410,3 +410,4 @@ 467 common open_tree_attr sys_open_tree_attr 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr +470 common rseq_sched_yield sys_rseq_sched_yield From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6767F314A9F; Mon, 8 Sep 2025 23:00:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372411; cv=none; b=Op11WbnxSvN1/L2Ndj4ROqWAryLo8H+KEyuauQZ1jtITb2zGjdD2kLwEdYo3I7DkCVqLveM9al8GeLxGLdSI+LwJdA60H70oRPmr3FGnqQy8/UYSamZkNWXHUm+bywHoa9Rtvashm7FSP2/A45s26Yc0L0bVbxD4sM7ovJs1kdo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372411; c=relaxed/simple; bh=9Jqv89KzGDy69OgJe10oMaluDUKtLEMBR5+Hj+dq4Y8=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=WV7U9N7Lrti0SaNpi2p0dnlp3Vde2JJUXa46yOccqfRhMOx80QSACvsRABNa6f7CA9gPCBtIvE01XNEDfPEkPdNqTzv5Air41qNoIJstF7FpAdKAluosWXjRjJCPkcnm4tsAfKzB6CHVsvzujGqUf9dkwOyiDofNiKvRmXUQ2sE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=C5L4aQbC; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=J192pnlA; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="C5L4aQbC"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="J192pnlA" Message-ID: <20250908225753.012514970@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372407; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=MEyxQCLOxK4AfO/sdIWSTudPuIIwQf/LLssO80sEA+s=; b=C5L4aQbCZa18Ds4M5hRfbKlAkKcYbJd6P6+r+fvOEPfZSXE7VcNDnvSRow7DiEHda+FUFF ltnNclCy9HughxlDYfeHOC8SpwYA2XNEN7zILo3iKdW2/GjZutogBODEpbeTAzzxTGsN8k 1TW6zKgTVlBVtXxpEMpc8mrxn/0spRoMcSRo0Tm/36H/4p9Mlzv3tiWNLFAZx8oBIFouVM renWiB0SEzIeBhWaaKTeen19MWBqgTR0sELm9ckSbfxYRa43W32GrG7v44WcPGZpkfzR69 eUuu1LGqUsirC0R9DxnArubhupbhLRjE+pvOaIUSo7MQF9Bk7N3Jp23Z/aGjWw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372407; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=MEyxQCLOxK4AfO/sdIWSTudPuIIwQf/LLssO80sEA+s=; b=J192pnlAaaRycv4n0+aigO3EQ7O9i9d2nRWAI7Yst1BGC0elkCD3z6sbf7tc50sR74JpCo l7ZZ2lWlCYn+pNDg== From: Thomas Gleixner To: LKML Cc: Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Peter Zilstra , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 07/12] rseq: Implement syscall entry work for time slice extensions References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 01:00:05 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice extension. This allows to handle the rseq_slice_yield() syscall, which is used by user space to relinquish the CPU after finishing the critical section for which it requested an extension. In case the kernel state is still GRANTED, the kernel resets both kernel and user space state with a set of sanity checks. If the kernel state is already cleared, then this raced against the timer or some other interrupt and just clears the work bit. Doing it in syscall entry work allows to catch misbehaving user space, which issues a syscall from the critical section. Wrong syscall and inconsistent user space result in a SIGSEGV. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Mathieu Desnoyers Cc: "Paul E. McKenney" Cc: Boqun Feng --- include/linux/entry-common.h | 2 - include/linux/rseq.h | 2 + include/linux/thread_info.h | 16 ++++---- kernel/entry/syscall-common.c | 11 ++++- kernel/rseq.c | 80 +++++++++++++++++++++++++++++++++++++= +++++ 5 files changed, 101 insertions(+), 10 deletions(-) --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -36,8 +36,8 @@ SYSCALL_WORK_SYSCALL_EMU | \ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ + SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \ ARCH_SYSCALL_WORK_ENTER) - #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \ --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -191,8 +191,10 @@ static inline void rseq_syscall(struct p #endif /* !CONFIG_DEBUG_RSEQ */ =20 #ifdef CONFIG_RSEQ_SLICE_EXTENSION +void rseq_syscall_enter_work(long syscall); int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3); #else /* CONFIG_RSEQ_SLICE_EXTENSION */ +static inline void rseq_syscall_enter_work(long syscall) { } static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned = long arg3) { return -EINVAL; --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -46,15 +46,17 @@ enum syscall_work_bit { SYSCALL_WORK_BIT_SYSCALL_AUDIT, SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, + SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE, }; =20 -#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) -#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE= POINT) -#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) -#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) -#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) -#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_US= ER_DISPATCH) -#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_T= RAP) +#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) +#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRAC= EPOINT) +#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE) +#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU) +#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) +#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_US= ER_DISPATCH) +#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_= TRAP) +#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ= _SLICE) #endif =20 #include --- a/kernel/entry/syscall-common.c +++ b/kernel/entry/syscall-common.c @@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s } } =20 -long syscall_trace_enter(struct pt_regs *regs, long syscall, - unsigned long work) +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long= work) { long ret =3D 0; =20 @@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs return -1L; } =20 + /* + * User space got a time slice extension granted and relinquishes + * the CPU. The work stops the slice timer to avoid an extra round + * through hrtimer_interrupt(). + */ + if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE) + rseq_syscall_enter_work(syscall); + /* Handle ptrace */ if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { ret =3D ptrace_report_syscall_entry(regs); --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -491,6 +491,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user #ifdef CONFIG_RSEQ_SLICE_EXTENSION DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); =20 +static inline void rseq_slice_set_need_resched(struct task_struct *curr) +{ + /* + * The interrupt guard is required to prevent inconsistent state in + * this case: + * + * set_tsk_need_resched() + * --> Interrupt + * wakeup() + * set_tsk_need_resched() + * set_preempt_need_resched() + * schedule_on_return() + * clear_tsk_need_resched() + * clear_preempt_need_resched() + * set_preempt_need_resched() <- Inconsistent state + * + * This is safe vs. a remote set of TIF_NEED_RESCHED because that + * only sets the already set bit and does not create inconsistent + * state. + */ + scoped_guard(irq) + set_need_resched_current(); +} + +static void rseq_slice_validate_ctrl(u32 expected) +{ + u32 __user *sctrl =3D ¤t->rseq.usrptr->slice_ctrl; + u32 uval; + + if (get_user_masked_u32(&uval, sctrl) || uval !=3D expected) + force_sig(SIGSEGV); +} + +/* + * Invoked from syscall entry if a time slice extension was granted and the + * kernel did not clear it before user space left the critical section. + */ +void rseq_syscall_enter_work(long syscall) +{ + struct task_struct *curr =3D current; + bool granted =3D curr->rseq.slice.state.granted; + + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); + + if (static_branch_unlikely(&rseq_debug_enabled)) + rseq_slice_validate_ctrl(granted ? RSEQ_SLICE_EXT_GRANTED : 0); + + /* + * The kernel might have raced, revoked the grant and updated + * userspace, but kept the SLICE work set. + */ + if (!granted) + return; + + rseq_stat_inc(rseq_stats.s_yielded); + + /* + * Required to make set_tsk_need_resched() correct on PREEMPT[RT] + * kernels. + */ + scoped_guard(preempt) { + /* + * Now that preemption is disabled, quickly check whether + * the task was already rescheduled before arriving here. + */ + if (!curr->rseq.event.sched_switch) + rseq_slice_set_need_resched(curr); + } + + curr->rseq.slice.state.granted =3D false; + /* + * Clear the grant in user space and check whether this was the + * correct syscall to yield. If the user access fails or the task + * used an arbitrary syscall, terminate it. + */ + if (put_user_masked_u32(0U, &curr->rseq.usrptr->slice_ctrl) || + syscall !=3D __NR_rseq_slice_yield) + force_sig(SIGSEGV); +} + int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3) { switch (arg2) { From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 059013164C0; Mon, 8 Sep 2025 23:00:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372413; cv=none; b=Gz3xKLhfQFFpaDWm+vC45BKLCWwxoKi6QFbjD6p9EDft2FsRdW2FrAFrb6DCf0OR3sNbho+XkhSiTHB2OcSmCNs596kONMab2u5gtXaOHz14Adl9QRPIzCrSxzUYsOcWbC5pd2nfsRpT031TfyOBw0ifaxzEIBJIS9Q7yxHuRDs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372413; c=relaxed/simple; bh=LCzu9gsThD++K9Z4jiB9FCpsAPny/CGoSg1pucBo7VU=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=E8jKiFV9iPio6/rUR2IBkbZpj3wioZmKd0PUD1Gwv8qMEaGbB/QgybuIhvC8m9GU7i4rlEQ0/98hRiTEXYCNm/eLppWHO1MbynEGEkrco8DBCqp/4+fQW2qD3Il46qUXkCSin0oMZG0Qu+7iziTm2R2/wCvRrG6k5LYHnmkoQnU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=QOpIu/H6; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=LiNJzjRC; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="QOpIu/H6"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="LiNJzjRC" Message-ID: <20250908225753.077967162@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372408; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=a/EPeIza24SnyAPRjSd15cR2KiBIkb5IAQQ8qLaLMZE=; b=QOpIu/H6K6kGbNEdtM3K/n+rAnb+kjHeJiwZAkXQFJdOuQn9BHx4zPgjRxIpJAott/QubE JSWsrDsqu8YC8aI1X0sOhcZwwHxeQEc9fulOsHdI7xov1RDi2k145X3YnIA3GuwzmgQ2KK uf5L+Lksfp9Xhv3xgB1Z/CUsV8gKVG09eywAVGFWSl0oeOBUSI/nJijXItYMg8/W7w2fvR Y6lyY0o2ifVvuOAvV7iqULl4J+n7K3BF2p48B9ZgM3amYRZLHwQAO4yOA9/ZkZfER2F02W NoHwVx4s64Qi843wiTszZ2PJ+tPpcu9mOsAQNdLZJNn2MusT6MZbtqPjY2Rijw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372408; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=a/EPeIza24SnyAPRjSd15cR2KiBIkb5IAQQ8qLaLMZE=; b=LiNJzjRCpQXHpas7uMEg88ZbDAN3IC/YH5r99an+663yzDFalIWBWIamfCn0c6Za2i/DUK LG9Qt1vmi/gI22CQ== From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Peter Zilstra , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 08/12] rseq: Implement time slice extension enforcement timer References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 01:00:07 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If a time slice extension is granted and the reschedule delayed, the kernel has to ensure that user space cannot abuse the extension and exceed the maximum granted time. It was suggested to implement this via the existing hrtick() timer in the scheduler, but that turned out to be problematic for several reasons: 1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled independently of CONFIG_HIGHRES_TIMERS 2) HRTICK usage in the scheduler can be runtime disabled or is only used for certain aspects of scheduling. 3) The function is calling into the scheduler code and that might have unexpected consequences when this is invoked due to a time slice enforcement expiry. Especially when the task managed to clear the grant via sched_yield(0). It would be possible to address #2 and #3 by storing state in the scheduler, but that is extra complexity and fragility for no value. Implement a dedicated per CPU hrtimer instead, which is solely used for the purpose of time slice enforcement. The timer is armed when an extenstion was granted right before actually returning to user mode in rseq_exit_to_user_mode_restart(). It is disarmed, when the task relinquishes the CPU. This is expensive as the timer is probably the first expiring timer on the CPU, which means it has to reprogram the hardware. But that's less expensive than going through a full hrtimer interrupt cycle for nothing. Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng --- include/linux/rseq_entry.h | 22 +++++++- include/linux/rseq_types.h | 2=20 kernel/rseq.c | 119 ++++++++++++++++++++++++++++++++++++++++= ++++- 3 files changed, 140 insertions(+), 3 deletions(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -88,8 +88,24 @@ static __always_inline bool rseq_slice_e { return static_branch_likely(&rseq_slice_extension_key); } + +extern unsigned int rseq_slice_ext_nsecs; +bool __rseq_arm_slice_extension_timer(void); + +static __always_inline bool rseq_arm_slice_extension_timer(void) +{ + if (!rseq_slice_extension_enabled()) + return false; + + if (likely(!current->rseq.slice.state.granted)) + return false; + + return __rseq_arm_slice_extension_timer(); +} + #else /* CONFIG_RSEQ_SLICE_EXTENSION */ static inline bool rseq_slice_extension_enabled(void) { return false; } +static inline bool rseq_arm_slice_extension_timer(void) { return false; } #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ =20 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); @@ -560,8 +576,12 @@ static __always_inline void clear_tif_rs static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work) { + /* + * Arm the slice extension timer if nothing to do anymore and the + * task really goes out to user space. + */ if (likely(!test_tif_rseq(ti_work))) - return false; + return rseq_arm_slice_extension_timer(); =20 if (unlikely(__rseq_exit_to_user_mode_restart(regs))) return true; --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -87,9 +87,11 @@ union rseq_slice_state { /** * struct rseq_slice - Status information for rseq time slice extension * @state: Time slice extension state + * @expires: The time when a grant expires */ struct rseq_slice { union rseq_slice_state state; + u64 expires; }; =20 /** --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -71,6 +71,8 @@ #define RSEQ_BUILD_SLOW_PATH =20 #include +#include +#include #include #include #include @@ -489,8 +491,82 @@ SYSCALL_DEFINE4(rseq, struct rseq __user } =20 #ifdef CONFIG_RSEQ_SLICE_EXTENSION +struct slice_timer { + struct hrtimer timer; + void *cookie; +}; + +unsigned int rseq_slice_ext_nsecs __read_mostly =3D 30 * NSEC_PER_USEC; +static DEFINE_PER_CPU(struct slice_timer, slice_timer); DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key); =20 +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr) +{ + struct slice_timer *st =3D container_of(tmr, struct slice_timer, timer); + + if (st->cookie =3D=3D current && current->rseq.slice.state.granted) { + rseq_stat_inc(rseq_stats.s_expired); + set_need_resched_current(); + } + return HRTIMER_NORESTART; +} + +bool __rseq_arm_slice_extension_timer(void) +{ + struct slice_timer *st =3D this_cpu_ptr(&slice_timer); + struct task_struct *curr =3D current; + + lockdep_assert_irqs_disabled(); + + /* + * This check prevents that a granted time slice extension exceeds + * the maximum scheduling latency when the grant expired before + * going out to user space. Don't bother to clear the grant here, + * it will be cleaned up automatically before going out to user + * space. + */ + if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) { + set_need_resched_current(); + return true; + } + + /* + * Store the task pointer as a cookie for comparison in the timer + * function. This is safe as the timer is CPU local and cannot be + * in the expiry function at this point. + */ + st->cookie =3D curr; + hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINN= ED_HARD); + /* Arm the syscall entry work */ + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE); + return false; +} + +static void rseq_cancel_slice_extension_timer(void) +{ + struct slice_timer *st =3D this_cpu_ptr(&slice_timer); + + /* + * st->cookie can be safely read as preemption is disabled and the + * timer is CPU local. The active check can obviously race with the + * hrtimer interrupt, but that's better than disabling interrupts + * unconditionaly right away. + * + * As this is most probably the first expiring timer, the cancel is + * expensive as it has to reprogram the hardware, but that's less + * expensive than going through a full hrtimer_interrupt() cycle + * for nothing. + * + * hrtimer_try_to_cancel() is sufficient here as with interrupts + * disabled the timer callback cannot be running and the timer base + * is well determined as the timer is pinned on the local CPU. + */ + if (st->cookie =3D=3D current && hrtimer_active(&st->timer)) { + scoped_guard(irq) + hrtimer_try_to_cancel(&st->timer); + } +} + static inline void rseq_slice_set_need_resched(struct task_struct *curr) { /* @@ -548,10 +624,11 @@ void rseq_syscall_enter_work(long syscal rseq_stat_inc(rseq_stats.s_yielded); =20 /* - * Required to make set_tsk_need_resched() correct on PREEMPT[RT] - * kernels. + * Required to stabilize the per CPU timer pointer and to make + * set_tsk_need_resched() correct on PREEMPT[RT] kernels. */ scoped_guard(preempt) { + rseq_cancel_slice_extension_timer(); /* * Now that preemption is disabled, quickly check whether * the task was already rescheduled before arriving here. @@ -631,6 +708,31 @@ SYSCALL_DEFINE0(rseq_slice_yield) return 0; } =20 +#ifdef CONFIG_SYSCTL +static const unsigned int rseq_slice_ext_nsecs_min =3D 10 * NSEC_PER_USEC; +static const unsigned int rseq_slice_ext_nsecs_max =3D 50 * NSEC_PER_USEC; + +static const struct ctl_table rseq_slice_ext_sysctl[] =3D { + { + .procname =3D "rseq_slice_extension_nsec", + .data =3D &rseq_slice_ext_nsecs, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_douintvec_minmax, + .extra1 =3D (unsigned int *)&rseq_slice_ext_nsecs_min, + .extra2 =3D (unsigned int *)&rseq_slice_ext_nsecs_max, + }, +}; + +static void rseq_slice_sysctl_init(void) +{ + if (rseq_slice_extension_enabled()) + register_sysctl_init("kernel", rseq_slice_ext_sysctl); +} +#else /* CONFIG_SYSCTL */ +static inline void rseq_slice_sysctl_init(void) { } +#endif /* !CONFIG_SYSCTL */ + static int __init rseq_slice_cmdline(char *str) { bool on; @@ -643,4 +745,17 @@ static int __init rseq_slice_cmdline(cha return 0; } __setup("rseq_slice_ext=3D", rseq_slice_cmdline); + +static int __init rseq_slice_init(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) { + hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired, + CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD); + } + rseq_slice_sysctl_init(); + return 0; +} +device_initcall(rseq_slice_init); #endif /* CONFIG_RSEQ_SLICE_EXTENSION */ From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 491A231B10C; Mon, 8 Sep 2025 23:00:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372414; cv=none; b=qf2IogJFPY7NoFiI7Uk1m0aPD6LNH/T4gldXt6nIjeuTNn48WtMmaTLhapk/AYrcKL6ZgsVpb0w3dx84+74A+6KMrgwfpbYlASy8IH4TSai+4wK/aUCViF+ae7nks4u24ZUAa2/6q3ZQk2A73OFpJWDpXoiwvmwOLGvffox88RA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372414; c=relaxed/simple; bh=4RaYFSfzpQYjNPy+xqWBOA0DfR0lPTjH2Zofu9UBIfo=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=h5Cjo3iBp2RZfS7qNs9zYuOuY5X3n0bx9/QJlMBYLPBLPwdu4OXs+KvHuKAI75HcVYmJ5U++s1JSTR6uUM5GQxiLct89Lp4VQnHkWs1KgZsWKWb+lcHCgRzvwJ+CCxkbjmv8qoWg1koMsPqxTGsO9jh1XFp6stU9Bd5z85SJx+Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=t212k+6B; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0UaHn4iZ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="t212k+6B"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0UaHn4iZ" Message-ID: <20250908225753.142571755@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372411; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=HDN6yarSBUrggJYbJPpwYDv0yRMy1C4SQrJn4Pj85Sc=; b=t212k+6BzNQJrCQbXcnYNzePsEP4yKFt5/AQWbBNPIYBnCRT9tUiniBBNN779kf3dZosjY gfcjGZWwFin9dYjQfGu9RnjXnaOPozz/Hq++IJOKfBMWuhG1Wr/Q+EAg6SNyXnV7b3XZTv DLptFmspp7xtS0kZVfZkBpm7MFjfl5BNSbupAn1tM5Pu3uncRrEweJv0kscOLXt/GDYpU1 T9Klmo5KYDyQyr8SMndQ1Lydw76warv9ulHBdglRmCJKIyPZeIok381ahyyvOB/ajphwcw EFi6jRBY4tXA2xyFuimAry3rCbUUv5ed//p3GlTLaLKs/W8q0rAeil4l1Rg8qQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372411; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=HDN6yarSBUrggJYbJPpwYDv0yRMy1C4SQrJn4Pj85Sc=; b=0UaHn4iZ519YS4Q1IVZfZoz5Ekvr35WnsII30rtOG4belD/xYWFfkUcK+Bcl+0NvHH1PDM I4gzyrbJYVFck7Dg== From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Peter Zilstra , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 09/12] rseq: Reset slice extension when scheduled References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 01:00:09 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a time slice extension was granted in the need_resched() check on exit to user space, the task can still be scheduled out in one of the other pending work items. When it gets scheduled back in, and need_resched() is not set, then the stale grant would be preserved, which is just wrong. RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the critical section and ID update mechanisms. Utilize them and clear the user space slice control member of struct rseq unconditionally within the existing user access sections. That's just an unconditional store more in that path. Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng --- include/linux/rseq_entry.h | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -103,9 +103,17 @@ static __always_inline bool rseq_arm_sli return __rseq_arm_slice_extension_timer(); } =20 +static __always_inline void rseq_slice_clear_grant(struct task_struct *t) +{ + if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted) + rseq_stat_inc(rseq_stats.s_revoked); + t->rseq.slice.state.granted =3D false; +} + #else /* CONFIG_RSEQ_SLICE_EXTENSION */ static inline bool rseq_slice_extension_enabled(void) { return false; } static inline bool rseq_arm_slice_extension_timer(void) { return false; } +static inline void rseq_slice_clear_grant(struct task_struct *t) { } #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ =20 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); @@ -404,6 +412,13 @@ bool rseq_set_ids_get_csaddr(struct task unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault); if (csaddr) unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); + + /* Open coded, so it's in the same user access region */ + if (rseq_slice_extension_enabled()) { + /* Unconditionally clear it, no point in conditionals */ + unsafe_put_user(0U, &rseq->slice_ctrl, efault); + rseq_slice_clear_grant(t); + } user_access_end(); =20 /* Cache the new values */ @@ -518,10 +533,19 @@ static __always_inline bool __rseq_exit_ * If IDs have not changed rseq_event::user_irq must be true * See rseq_sched_switch_event(). */ + struct rseq __user *rseq =3D t->rseq.usrptr; u64 csaddr; =20 - if (unlikely(get_user_masked_u64(&csaddr, &t->rseq.usrptr->rseq_cs))) + if (!user_rw_masked_begin(rseq)) goto fail; + unsafe_get_user(csaddr, &rseq->rseq_cs, fault); + /* Open coded, so it's in the same user access region */ + if (rseq_slice_extension_enabled()) { + /* Unconditionally clear it, no point in conditionals */ + unsafe_put_user(0U, &rseq->slice_ctrl, fault); + rseq_slice_clear_grant(t); + } + user_access_end(); =20 if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) { if (unlikely(!rseq_update_user_cs(t, regs, csaddr))) @@ -545,6 +569,8 @@ static __always_inline bool __rseq_exit_ t->rseq.event.events =3D 0; return false; =20 +fault: + user_access_end(); fail: pagefault_enable(); /* Force it into the slow path. Don't clear the state! */ From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 555DD31C583; Mon, 8 Sep 2025 23:00:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372417; cv=none; b=ObXmFMnKc6t8/klsYCE05fkSlJH24jaH4b+nwufdrVPbXqTM56+dGXQxKD/qquWy6Auzxof8hgvattALE485MBchwsbDaKkkZ8PoyEjDhEbkOq5zdD3UeXQ6UjyWGX3APs9IZmw7luGeuUTwYpb2q1lRfE+nFzXZ56etFVShOaM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372417; c=relaxed/simple; bh=SEywHGYDZB/vWqQs5f8TkQkX7TpCbv5rbeoX0Dxswtk=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=ZSXOkKkf/KRaM8+yxc2pqPsMAeDVF03kpHvSmrIEoZdm2OE+aSmyvFScqtqGtkIy2rAItAJsVN/d9etzRpHN/1skeaRkfh6RTtpX4p/aXOy7R+PPhuau3lTLPW4FtXxoq834Yy3/zr79ffiMRS8KUMKzvYCA2l3Wmx+qA3Ds2Uo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=MdoXWnI/; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AcVER9Sj; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="MdoXWnI/"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AcVER9Sj" Message-ID: <20250908225753.205700259@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372413; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=cVw3C3mDyDXs1g4Ff1jO8tDLgnqTZ1fhyroC/ENYiNA=; b=MdoXWnI/7k3DaxRWw4l4xnin20D0FjnMOCER5uvlBf820w7UBK3r/SmysF7IGE8n5NSgSD JqWJ2vpypYlIIUPk4uWigJqHPCAiuSKk+AsWjKf0raQFh/k1M+1OBoNfXdq0qbvSi70b6W D5P18GNjN7xmTeTSvHSpgpLFD1NHeeSMRhplzoGqpwvFYh993ucQYiGSOlKRmsprNUlqlX g1zpb53DLin4zO9kNkBC4vJEfqLeCHH/l3U9Bi3VIrZ5tKAnLfZIkF30/Mv24OwEWtxQFu 5SRbN1vhMy+fvwYBrhKPuAY9hBYjFxeFOkJ9tlle3X5jJ4gfWoTozh+gOvq2Aw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372413; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=cVw3C3mDyDXs1g4Ff1jO8tDLgnqTZ1fhyroC/ENYiNA=; b=AcVER9SjyDp1B7DVY8SuLcM2LVjiAzx0qVHPIaDp8UtpF/YiIIivNZChsuVCdaDKADj8UA Sz792mq5QYzRztAA== From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Peter Zilstra , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 10/12] rseq: Implement rseq_grant_slice_extension() References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 01:00:12 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide the actual decision function, which decides whether a time slice extension is granted in the exit to user mode path when NEED_RESCHED is evaluated. The decision is made in two stages. First an inline quick check to avoid going into the actual decision function. This checks whether: #1 the functionality is enabled #2 the exit is a return from interrupt to user mode #3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ, which means the task was already scheduled out. =20 The slow path, which implements the actual user space ABI, is invoked when: A) #1 is true, #2 is true and #3 is false It checks whether user space requested a slice extension by setting the request bit in the rseq slice_ctrl field. If so, it grants the extension and stores the slice expiry time, so that the actual exit code can double check whether the slice is already exhausted before going back. B) #1 - #3 are true _and_ a slice extension was granted in a previous loop iteration In this case the grant is revoked. In case that the user space access faults or invalid state is detected, the task is terminated with SIGSEGV. Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng --- include/linux/rseq_entry.h | 111 ++++++++++++++++++++++++++++++++++++++++= +++++ 1 file changed, 111 insertions(+) --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -41,6 +41,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_ #ifdef CONFIG_RSEQ #include #include +#include #include =20 #include @@ -110,10 +111,120 @@ static __always_inline void rseq_slice_c t->rseq.slice.state.granted =3D false; } =20 +static __always_inline bool rseq_grant_slice_extension(bool work_pending) +{ + struct task_struct *curr =3D current; + union rseq_slice_state state; + struct rseq __user *rseq; + u32 usr_ctrl; + + if (!rseq_slice_extension_enabled()) + return false; + + /* If not enabled or not a return from interrupt, nothing to do. */ + state =3D curr->rseq.slice.state; + state.enabled &=3D curr->rseq.event.user_irq; + if (likely(!state.state)) + return false; + + rseq =3D curr->rseq.usrptr; + if (!user_rw_masked_begin(rseq)) + goto die; + + /* + * Quick check conditions where a grant is not possible or + * needs to be revoked. + * + * 1) Any TIF bit which needs to do extra work aside of + * rescheduling prevents a grant. + * + * 2) A previous rescheduling request resulted in a slice + * extension grant. + */ + if (unlikely(work_pending || state.granted)) { + /* Clear user control unconditionally. No point for checking */ + unsafe_put_user(0U, &rseq->slice_ctrl, fail); + user_access_end(); + rseq_slice_clear_grant(curr); + return false; + } + + unsafe_get_user(usr_ctrl, &rseq->slice_ctrl, fail); + if (likely(!(usr_ctrl & RSEQ_SLICE_EXT_REQUEST))) { + user_access_end(); + return false; + } + + /* Grant the slice extention */ + unsafe_put_user(RSEQ_SLICE_EXT_GRANTED, &rseq->slice_ctrl, fail); + user_access_end(); + + rseq_stat_inc(rseq_stats.s_granted); + + curr->rseq.slice.state.granted =3D true; + /* Store expiry time for arming the timer on the way out */ + curr->rseq.slice.expires =3D data_race(rseq_slice_ext_nsecs) + ktime_get_= mono_fast_ns(); + /* + * This is racy against a remote CPU setting TIF_NEED_RESCHED in + * several ways: + * + * 1) + * CPU0 CPU1 + * clear_tsk() + * set_tsk() + * clear_preempt() + * Raise scheduler IPI on CPU0 + * --> IPI + * fold_need_resched() -> Folds correctly + * 2) + * CPU0 CPU1 + * set_tsk() + * clear_tsk() + * clear_preempt() + * Raise scheduler IPI on CPU0 + * --> IPI + * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false + * + * #1 is not any different from a regular remote reschedule as it + * sets the previously not set bit and then raises the IPI which + * folds it into the preempt counter + * + * #2 is obviously incorrect from a scheduler POV, but it's not + * differently incorrect than the code below clearing the + * reschedule request with the safety net of the timer. + * + * The important part is that the clearing is protected against the + * scheduler IPI and also against any other interrupt which might + * end up waking up a task and setting the bits in the middle of + * the operation: + * + * clear_tsk() + * ---> Interrupt + * wakeup_on_this_cpu() + * set_tsk() + * set_preempt() + * clear_preempt() + * + * which would be inconsistent state. + */ + scoped_guard(irq) { + clear_tsk_need_resched(curr); + clear_preempt_need_resched(); + } + return true; + +fail: + user_access_end(); +die: + force_sig(SIGSEGV); + return false; +} + #else /* CONFIG_RSEQ_SLICE_EXTENSION */ static inline bool rseq_slice_extension_enabled(void) { return false; } static inline bool rseq_arm_slice_extension_timer(void) { return false; } static inline void rseq_slice_clear_grant(struct task_struct *t) { } +static inline bool rseq_grant_slice_extension(bool work_pending) { return = false; } #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ =20 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs= , unsigned long csaddr); From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 02BFB31D750; Mon, 8 Sep 2025 23:00:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372419; cv=none; b=FNo/QlJ6KiNB+g+m87PCiCxHikN4be/YsvEWcjpmQ8Wyd+offpcgv6wN8b+vw2BJVAaU+h4KUFrlW0PGlKNMkiPQ0uBsngQyv+HC+JqIpvyGNh9rFAbipzA3ZQRIwwuwe1bhO17dtCXf9FKafGxzapoQDyzytidFg+jz3DQ5GDk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372419; c=relaxed/simple; bh=RhXdOrpmvWhEXZAJBOlVp00/Ibtc92pO9VfibfnzKL4=; h=Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type:Date; b=KUugInjhnz43cy3go5JNMSP5SvKf5gat3t05eo2CLdwBnalJNuKA+bUrG1yxeLpXSEmR1pvUcCA4hv/Tx6MyQWi+y1CYjLWdjIYwMAl63O+YMFNR6Wz/AamSOHXLQ26NMloH5wNMbT/0qtFbljwWhBgRRU6ftwhwjYoF0eh3Z+o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=4VHy6Q4Z; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=K5DH7yM5; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="4VHy6Q4Z"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="K5DH7yM5" Message-ID: <20250908225753.269052052@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=gQFvLoA+8aqRp49/IJYirlydqYT+cxE6z9BZ2w2sdfQ=; b=4VHy6Q4Z+X7wotuWb5XH/oWzf2xQ8Gd0gQPu9ssvBPHM4VT7QrOyzhbx/QFgz8+y1QHeEw mSVVzXvAy/FjkHbMaik6gBQGNW5pPxF9+/gUJhWjPU5nF1BHHiUCoFFQsFISIfiqQCqTmG rHS0xtst6Lvm/5H6SOWcFSp1R2yJUgOM4Is+pMwMESCCvBD1/6EA8XYPV9uuSlGhObgW1c NAPjqtd62loFd4ip+aTKUsvCIupJbv9ghvFNCoXmfU8YRIjWt1NSF8BBgpPHLy284UJ2kl NkLBzghgN5rvqfZvuP38Ja/1PaRWtas2T9O9vvHoHFLvYAVHI7GrU/7ykFf/DA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=gQFvLoA+8aqRp49/IJYirlydqYT+cxE6z9BZ2w2sdfQ=; b=K5DH7yM5WVkWYaguBKpIUFa6o5NWJVq4gdxsJ+b5yQadggpnnNZGAnPlkUFbkqokkTKa6G 75xOkj58B796zzBQ== From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Peter Zilstra , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Subject: [patch 11/12] entry: Hook up rseq time slice extension References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Date: Tue, 9 Sep 2025 01:00:14 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Wire the grant decision function up in exit_to_user_mode_loop() Signed-off-by: Thomas Gleixner Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng --- kernel/entry/common.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st #define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK) #endif =20 +/* TIF bits, which prevent a time slice extension. */ +#ifdef CONFIG_PREEMPT_RT +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY) +#else +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) +#endif +#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED) + static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_re= gs *regs, unsigned long ti_work) { @@ -28,8 +36,10 @@ static __always_inline unsigned long __e =20 local_irq_enable_exit_to_user(ti_work); =20 - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) - schedule(); + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) { + if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY)) + schedule(); + } =20 if (ti_work & _TIF_UPROBE) uprobe_notify_resume(regs); From nobody Wed Sep 10 01:55:31 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 668453054E4; Mon, 8 Sep 2025 23:00:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372421; cv=none; b=JtB/kv+MoSzM0gc+PMRC15KqrXKmZCU9LCDgr6bqdYwlB0C7YR6Y1V3Po7T2WkMEZVLhUPJAk/9BFcIEJFGjII9qdVSTxo/Zi2OUD471CBh1hBp6N4WY47GjJOYQlRaH6LmWGdIwUz1q2Klpvj6+njAP6B0zUh85mQ9LcyJ/YRY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372421; c=relaxed/simple; bh=TFBnJqoHxYkjFAybe9EyMqjGGU61dfvgNqTEKh7olIo=; h=Message-ID:From:To:Subject:References:MIME-Version:Content-Type: cc:Date; b=bhmW1yfnKroejP6Nw2NRmpcmbq2tvwE5Z6yxU5UO5QS8wCtIv7uPo23s6hoKGL9+8WVLGGgsk4Kq/EAtPmVGs/wtfzaBvb7waUoqTOSVbON9mKmVPGWPCWyIcrRqAUdFJusLxZuiQFmMXC1kWXkaYMta4FkCzCSalfBgWwsb6+s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=eEzwGzN+; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=q+D/xV0G; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="eEzwGzN+"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="q+D/xV0G" Message-ID: <20250908225753.332052396@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=g114yMIHJ/N+lL2K1FcdTSWH6GyatJrycQTW2sArYtE=; b=eEzwGzN+zl7bNel84uq3ip/lLOcZjZ4StNjfhW09yqAVUkUwBfYwL5sr9ydM/NWBtYgOXs xLQPVTRKDvH0JZ2BhuWKy7Kb8jdd9VVN/Ad06OsiQ+9nljNxGGTIYviXsrr23X9vCjqYl/ +/mdTTnpAUAMWx19riUHSXdYFbtm3qJ0G54gEgUIQ8rvhRo47XiMOqSFSY8HXT3Bdt+zRu jLyUhsD55wA6LqkCVzY52TmvLkeVB5uDwlo0tl5Rjs6hFyxkj2FOh6fbiqmtNGgwxUgZ/e qwDbq9/d5ueJ2yCptNdRXCx6EDwWPkHx0dy+FR0Kz3Awm0PiJ3/y1NnUVfFidA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=g114yMIHJ/N+lL2K1FcdTSWH6GyatJrycQTW2sArYtE=; b=q+D/xV0GrEUUu4dbzRFzl4eR36ktHIU8Fx4UohHebnVQsg/CjNmtOAqeBp+AbDs9+zEbfr eb3rWllkAYxZMRAA== From: Thomas Gleixner To: LKML Subject: [patch 12/12] selftests/rseq: Implement time slice extension test References: <20250908225709.144709889@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 cc: Peter Zilstra , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Date: Tue, 9 Sep 2025 01:00:16 +0200 (CEST) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide an initial test case to evaluate the functionality. This needs to be extended to cover the ABI violations and expose the race condition between observing granted and ariving in rseq_slice_yield(). Signed-off-by: Thomas Gleixner --- tools/testing/selftests/rseq/.gitignore | 1=20 tools/testing/selftests/rseq/Makefile | 5=20 tools/testing/selftests/rseq/rseq-abi.h | 2=20 tools/testing/selftests/rseq/slice_test.c | 217 +++++++++++++++++++++++++= +++++ 4 files changed, 224 insertions(+), 1 deletion(-) --- a/tools/testing/selftests/rseq/.gitignore +++ b/tools/testing/selftests/rseq/.gitignore @@ -10,3 +10,4 @@ param_test_mm_cid param_test_mm_cid_benchmark param_test_mm_cid_compare_twice syscall_errors_test +slice_test --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -17,7 +17,7 @@ OVERRIDE_TARGETS =3D 1 TEST_GEN_PROGS =3D basic_test basic_percpu_ops_test basic_percpu_ops_mm_ci= d_test param_test \ param_test_benchmark param_test_compare_twice param_test_mm_cid \ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ - syscall_errors_test + syscall_errors_test slice_test =20 TEST_GEN_PROGS_EXTENDED =3D librseq.so =20 @@ -59,3 +59,6 @@ include ../lib.mk $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTE= NDED) \ rseq.h rseq-*.h $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ + +$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-= *.h + $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@ --- a/tools/testing/selftests/rseq/rseq-abi.h +++ b/tools/testing/selftests/rseq/rseq-abi.h @@ -164,6 +164,8 @@ struct rseq_abi { */ __u32 mm_cid; =20 + __u32 slice_ctrl; + /* * Flexible array member at end of structure, after last feature field. */ --- /dev/null +++ b/tools/testing/selftests/rseq/slice_test.c @@ -0,0 +1,217 @@ +// SPDX-License-Identifier: LGPL-2.1 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "rseq.h" + +#include "../kselftest_harness.h" + +#ifndef __NR_rseq_slice_yield +# define __NR_rseq_slice_yield 470 +#endif + +#define BITS_PER_INT 32 +#define BITS_PER_BYTE 8 + +#ifndef PR_RSEQ_SLICE_EXTENSION +# define PR_RSEQ_SLICE_EXTENSION 79 +# define PR_RSEQ_SLICE_EXTENSION_GET 1 +# define PR_RSEQ_SLICE_EXTENSION_SET 2 +# define PR_RSEQ_SLICE_EXT_ENABLE 0x01 +#endif + +#ifndef RSEQ_SLICE_EXT_REQUEST_BIT +# define RSEQ_SLICE_EXT_REQUEST_BIT 0 +# define RSEQ_SLICE_EXT_GRANTED_BIT 1 +#endif + +#ifndef asm_inline +# define asm_inline asm __inline +#endif + +#if defined(__x86_64__) || defined(__i386__) +static __always_inline bool test_and_clear_request(unsigned int *addr) +{ + const unsigned int bit =3D RSEQ_SLICE_EXT_REQUEST_BIT; + bool res; + + asm inline volatile("btrl %[__bit], %[__addr]\n" + : [__addr] "+m" (*addr), "=3D@cc" "c" (res) + : [__bit] "Ir" (bit) + : "memory"); + return res; +} +#else +static __always_inline bool test_and_clear_request(unsigned int *addr) +{ + const unsigned int mask =3D (1U << RSEQ_SLICE_EXT_REQUEST_BIT); + + return __atomic_fetch_and(addr, ~mask, __ATOMIC_RELAXED) & mask; +} +#endif + +static __always_inline void set_request(unsigned int *addr) +{ + *addr =3D 1U << RSEQ_SLICE_EXT_REQUEST_BIT; +} + +static __always_inline bool test_granted(unsigned int *addr) +{ + return !!(*addr & (1U << RSEQ_SLICE_EXT_GRANTED_BIT)); +} + +#define NSEC_PER_SEC 1000000000L +#define NSEC_PER_USEC 1000L + +struct noise_params { + int noise_nsecs; + int sleep_nsecs; + int run; +}; + +FIXTURE(slice_ext) +{ + pthread_t noise_thread; + struct noise_params noise_params; +}; + +FIXTURE_VARIANT(slice_ext) +{ + int64_t total_nsecs; + int slice_nsecs; + int noise_nsecs; + int sleep_nsecs; +}; + +FIXTURE_VARIANT_ADD(slice_ext, n2_2_50) +{ + .total_nsecs =3D 5 * NSEC_PER_SEC, + .slice_nsecs =3D 2 * NSEC_PER_USEC, + .noise_nsecs =3D 2 * NSEC_PER_USEC, + .sleep_nsecs =3D 50 * NSEC_PER_USEC, +}; + +FIXTURE_VARIANT_ADD(slice_ext, n50_2_50) +{ + .total_nsecs =3D 5 * NSEC_PER_SEC, + .slice_nsecs =3D 50 * NSEC_PER_USEC, + .noise_nsecs =3D 2 * NSEC_PER_USEC, + .sleep_nsecs =3D 50 * NSEC_PER_USEC, +}; + +static inline bool elapsed(struct timespec *start, struct timespec *now, + int64_t span) +{ + int64_t delta =3D now->tv_sec - start->tv_sec; + + delta *=3D NSEC_PER_SEC; + delta +=3D now->tv_nsec - start->tv_nsec; + return delta >=3D span; +} + +static void *noise_thread(void *arg) +{ + struct noise_params *p =3D arg; + + while (RSEQ_READ_ONCE(p->run)) { + struct timespec ts_start, ts_now; + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + do { + clock_gettime(CLOCK_MONOTONIC, &ts_now); + } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs)); + + ts_start.tv_sec =3D 0; + ts_start.tv_nsec =3D p->sleep_nsecs; + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL); + } + return NULL; +} + +FIXTURE_SETUP(slice_ext) +{ + cpu_set_t affinity; + + ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0); + + /* Pin it on a single CPU. Avoid CPU 0 */ + for (int i =3D 1; i < CPU_SETSIZE; i++) { + if (!CPU_ISSET(i, &affinity)) + continue; + + CPU_ZERO(&affinity); + CPU_SET(i, &affinity); + ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0); + break; + } + + ASSERT_EQ(rseq_register_current_thread(), 0); + + ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0); + + self->noise_params.noise_nsecs =3D variant->noise_nsecs; + self->noise_params.sleep_nsecs =3D variant->sleep_nsecs; + self->noise_params.run =3D 1; + + ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->= noise_params), 0); +} + +FIXTURE_TEARDOWN(slice_ext) +{ + self->noise_params.run =3D 0; + pthread_join(self->noise_thread, NULL); +} + +TEST_F(slice_ext, slice_test) +{ + unsigned long success =3D 0, yielded =3D 0, scheduled =3D 0, raced =3D 0; + struct rseq_abi *rs =3D rseq_get_abi(); + struct timespec ts_start, ts_now; + + ASSERT_NE(rs, NULL); + + clock_gettime(CLOCK_MONOTONIC, &ts_start); + do { + struct timespec ts_cs; + + clock_gettime(CLOCK_MONOTONIC, &ts_cs); + + set_request(&rs->slice_ctrl); + do { + clock_gettime(CLOCK_MONOTONIC, &ts_now); + } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs)); + + if (!test_and_clear_request(&rs->slice_ctrl)) { + if (test_granted(&rs->slice_ctrl)) { + yielded++; + if (!syscall(__NR_rseq_slice_yield)) + raced++; + } else { + scheduled++; + } + } else { + success++; + } + + clock_gettime(CLOCK_MONOTONIC, &ts_now); + } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs)); + + printf("# Success %12ld\n", success); + printf("# Yielded %12ld\n", yielded); + printf("# Scheduled %12ld\n", scheduled); + printf("# Raced %12ld\n", raced); +} + +TEST_HARNESS_MAIN