This is the proper implementation of the PoC code, which I posted in reply
to the latest iteration of Prakash's time slice extension patches:
https://lore.kernel.org/all/87o6smb3a0.ffs@tglx
Time slice extensions are an attempt to provide opportunistic priority
ceiling without the overhead of an actual priority ceiling protocol, but
also without the guarantees such a protocol provides.
The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.
This has been attempted to solve at least for a decade, but so far this
went nowhere. The recent attempts, which started to integrate with the
already existing RSEQ mechanism, have been at least going into the right
direction. The full history is partially in the above mentioned mail thread
and it's ancestors, but also in various threads in the LKML archives, which
require archaeological efforts to retrieve.
When trying to morph the PoC into actual mergeable code, I stumbled over
various shortcomings in the RSEQ code, which have been addressed in a
separate effort. The latest iteration can be found here:
https://lore.kernel.org/all/20250908212737.353775467@linutronix.de
That is a prerequisite for this series as it allows a tight integration
into the RSEQ code without inflicting a lot of extra overhead into the hot
paths.
The main change vs. the PoC and the previous attempts is that it utilizes a
new field in the user space ABI rseq struct, which allows to reduce the
atomic operations in user space to a bare minimum. If the architecture
supports CPU local atomics, which protect against the obvious RMW race
vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
x86, required.
The kernel user space ABI consists only of two bits in this new field:
REQUEST and GRANTED
User space sets REQUEST at the begin of the critical section. If it
finishes the critical section without interruption then it can clear the
bit and move on.
If it is interrupted and the interrupt return path in the kernel observes a
rescheduling request, then the kernel can grant a time slice extension. The
kernel clears the REQUEST bit and sets the GRANTED bit with a simple
non-atomic store operation. If it does not grant the extension only the
REQUEST bit is cleared.
If user space observes the REQUEST bit cleared, when it finished the
critical section, then it has to check the GRANTED bit. If that is set,
then it has to invoke the rseq_slice_yield() syscall to terminate the
extension and yield the CPU.
The code flow in user space is:
// Simple store as there is no concurrency vs. the GRANTED bit
rseq->slice_ctrl = REQUEST;
critical_section();
// CPU local atomic required here:
if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
// Non-atomic check is sufficient as this can race
// against an interrupt, which revokes the grant
//
// If not set, then the request was either cleared by the kernel
// without grant or the grant was revoked.
//
// If set, tell the kernel that the critical section is done
// so it can reschedule
if (rseq->slice_ctrl & GRANTED)
rseq_slice_yield();
}
The other details, which differ from earlier attempts and the PoC, are:
- A separate syscall for terminating the extension to avoid side
effects and overloading of the already ill defined sched_yield(2)
- A separate per CPU timer, which again does not inflict side effects
on the scheduler internal hrtick timer. The hrtick timer can be
disabled at run-time and an expiry can cause interesting problems in
the scheduler code when it is unexpectedly invoked.
- Tight integration into the rseq exit to user mode code. It utilizes
the path when TIF_RESQ is not set at the end of exit_to_user_mode()
to arm the timer if an extension was granted. TIF_RSEQ indicates that
the task was scheduled and therefore would revoke the grant anyway.
- A futile attempt to make this "work" on the PREEMPT_LAZY preemption
model which is utilized by PREEMPT_RT.
It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
but not TIF_PREEMPT.
Pretending that this can be made work for TIF_PREEMPT on a fully
preemptible kernel is just wishful thinking as the chance that
TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
obvious reasons.
This only "works" by some definition of works, i.e. on a best effort
basis, for the PREEMPT_NONE model and nothing else. Though given the
problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
running code sections, the days of these models should be hopefully
numbered and everything consolidated on the LAZY model.
That makes this distinction moot and everything restricted to
TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
extension mechanism into the scheduler hotpath. I'm sure there will
be attempts to do that as there is no lack of crazy folks out
there...
- Actual documentation of the user space ABI and a initial self test.
The RSEQ modifications on which this series is based can be found here:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
For your convenience all of it is also available as a conglomerate from
git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Thanks,
tglx
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 129 ++++++++++++
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/tools/syscall_32.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/s390/mm/pfault.c | 3
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
include/linux/entry-common.h | 2
include/linux/rseq.h | 11 +
include/linux/rseq_entry.h | 176 ++++++++++++++++
include/linux/rseq_types.h | 28 ++
include/linux/sched.h | 7
include/linux/syscalls.h | 1
include/linux/thread_info.h | 16 -
include/uapi/asm-generic/unistd.h | 5
include/uapi/linux/prctl.h | 10
include/uapi/linux/rseq.h | 28 ++
init/Kconfig | 12 +
kernel/entry/common.c | 14 +
kernel/entry/syscall-common.c | 11 -
kernel/rcu/tiny.c | 8
kernel/rcu/tree.c | 14 -
kernel/rcu/tree_exp.h | 3
kernel/rcu/tree_plugin.h | 9
kernel/rcu/tree_stall.h | 3
kernel/rseq.c | 293 ++++++++++++++++++++++++++++
kernel/sys.c | 6
kernel/sys_ni.c | 1
scripts/syscall.tbl | 1
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 2
tools/testing/selftests/rseq/slice_test.c | 217 ++++++++++++++++++++
45 files changed, 991 insertions(+), 42 deletions(-)