[patch V2 00/12] rseq: Implement time slice extension mechanism

Thomas Gleixner posted 12 patches 3 months, 2 weeks ago
There is a newer version of this series
Documentation/userspace-api/index.rst       |    1
Documentation/userspace-api/rseq.rst        |  118 ++++++++++
arch/alpha/kernel/syscalls/syscall.tbl      |    1
arch/arm/tools/syscall.tbl                  |    1
arch/arm64/tools/syscall_32.tbl             |    1
arch/m68k/kernel/syscalls/syscall.tbl       |    1
arch/microblaze/kernel/syscalls/syscall.tbl |    1
arch/mips/kernel/syscalls/syscall_n32.tbl   |    1
arch/mips/kernel/syscalls/syscall_n64.tbl   |    1
arch/mips/kernel/syscalls/syscall_o32.tbl   |    1
arch/parisc/kernel/syscalls/syscall.tbl     |    1
arch/powerpc/kernel/syscalls/syscall.tbl    |    1
arch/s390/kernel/syscalls/syscall.tbl       |    1
arch/s390/mm/pfault.c                       |    3
arch/sh/kernel/syscalls/syscall.tbl         |    1
arch/sparc/kernel/syscalls/syscall.tbl      |    1
arch/x86/entry/syscalls/syscall_32.tbl      |    1
arch/x86/entry/syscalls/syscall_64.tbl      |    1
arch/xtensa/kernel/syscalls/syscall.tbl     |    1
include/linux/entry-common.h                |    2
include/linux/rseq.h                        |   11 +
include/linux/rseq_entry.h                  |  190 ++++++++++++++++-
include/linux/rseq_types.h                  |   28 ++
include/linux/sched.h                       |    7
include/linux/syscalls.h                    |    1
include/linux/thread_info.h                 |   16 -
include/uapi/asm-generic/unistd.h           |    5
include/uapi/linux/prctl.h                  |   10
include/uapi/linux/rseq.h                   |   38 +++
init/Kconfig                                |   12 +
kernel/entry/common.c                       |   14 +
kernel/entry/syscall-common.c               |   11 -
kernel/rcu/tiny.c                           |    8
kernel/rcu/tree.c                           |   14 -
kernel/rcu/tree_exp.h                       |    3
kernel/rcu/tree_plugin.h                    |    9
kernel/rcu/tree_stall.h                     |    3
kernel/rseq.c                               |  304 ++++++++++++++++++++++++++++
kernel/sys.c                                |    6
kernel/sys_ni.c                             |    1
scripts/syscall.tbl                         |    1
tools/testing/selftests/rseq/.gitignore     |    1
tools/testing/selftests/rseq/Makefile       |    5
tools/testing/selftests/rseq/rseq-abi.h     |   27 ++
tools/testing/selftests/rseq/slice_test.c   |  198 ++++++++++++++++++
45 files changed, 1011 insertions(+), 52 deletions(-)
[patch V2 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 3 months, 2 weeks ago
This is a follow up on the V1 version:

     https://lore.kernel.org/20250908225709.144709889@linutronix.de

Time slice extensions are an attempt to provide opportunistic priority
ceiling without the overhead of an actual priority ceiling protocol, but
also without the guarantees such a protocol provides.

The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.

This series uses the existing RSEQ user memory to implement it.

Changes vs. V1:

   - Rebase on the newest RSEQ and uaccess changes

   - Use seperate bytes for request and grant and lift the atomic operation
     requirement for user space - Mathieu

   - Kconfig indentation, fix typos and expressions - Randy

   - Provide an extra stub for the !RSEQ case - Prateek

   - Use the proper name in sys_ni.c and add comment - Prateek

   - Return 1 from __setup() - Prateek


The uaccess and RSEQ modifications on which this series is based can be
found here:

    https://lore.kernel.org/20251022104005.907410538@linutronix.de/

and in git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Thanks,

	tglx
---
Peter Zilstra (1):
      sched: Provide and use set_need_resched_current()

Thomas Gleixner (11):
      rseq: Add fields and constants for time slice extension
      rseq: Provide static branch for time slice extensions
      rseq: Add statistics for time slice extensions
      rseq: Add prctl() to enable time slice extensions
      rseq: Implement sys_rseq_slice_yield()
      rseq: Implement syscall entry work for time slice extensions
      rseq: Implement time slice extension enforcement timer
      rseq: Reset slice extension when scheduled
      rseq: Implement rseq_grant_slice_extension()
      entry: Hook up rseq time slice extension
      selftests/rseq: Implement time slice extension test

 Documentation/userspace-api/index.rst       |    1 
 Documentation/userspace-api/rseq.rst        |  118 ++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/tools/syscall_32.tbl             |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/s390/mm/pfault.c                       |    3 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 include/linux/entry-common.h                |    2 
 include/linux/rseq.h                        |   11 +
 include/linux/rseq_entry.h                  |  190 ++++++++++++++++-
 include/linux/rseq_types.h                  |   28 ++
 include/linux/sched.h                       |    7 
 include/linux/syscalls.h                    |    1 
 include/linux/thread_info.h                 |   16 -
 include/uapi/asm-generic/unistd.h           |    5 
 include/uapi/linux/prctl.h                  |   10 
 include/uapi/linux/rseq.h                   |   38 +++
 init/Kconfig                                |   12 +
 kernel/entry/common.c                       |   14 +
 kernel/entry/syscall-common.c               |   11 -
 kernel/rcu/tiny.c                           |    8 
 kernel/rcu/tree.c                           |   14 -
 kernel/rcu/tree_exp.h                       |    3 
 kernel/rcu/tree_plugin.h                    |    9 
 kernel/rcu/tree_stall.h                     |    3 
 kernel/rseq.c                               |  304 ++++++++++++++++++++++++++++
 kernel/sys.c                                |    6 
 kernel/sys_ni.c                             |    1 
 scripts/syscall.tbl                         |    1 
 tools/testing/selftests/rseq/.gitignore     |    1 
 tools/testing/selftests/rseq/Makefile       |    5 
 tools/testing/selftests/rseq/rseq-abi.h     |   27 ++
 tools/testing/selftests/rseq/slice_test.c   |  198 ++++++++++++++++++
 45 files changed, 1011 insertions(+), 52 deletions(-)
Re: [patch V2 00/12] rseq: Implement time slice extension mechanism
Posted by Sebastian Andrzej Siewior 3 months, 1 week ago
On 2025-10-22 14:57:28 [+0200], Thomas Gleixner wrote:
> Time slice extensions are an attempt to provide opportunistic priority
> ceiling without the overhead of an actual priority ceiling protocol, but
> also without the guarantees such a protocol provides.
> 
> The intent is to avoid situations where a user space thread is interrupted
> in a critical section and scheduled out, while holding a resource on which
> the preempting thread or other threads in the system might block on. That
> obviously prevents those threads from making progress in the worst case for
> at least a full time slice. Especially in the context of user space
> spinlocks, which are a patently bad idea to begin with, but that's also
> true for other mechanisms.

I've been playing with it a bit with RT enabled and started to debug
this:

|       slice_test-2903    [001] d.h..  2313.285439: local_timer_entry: vector=236
|       slice_test-2903    [001] d.h1.  2313.285440: hrtimer_cancel: hrtimer=000000000507e6d5
|       slice_test-2903    [001] d.h..  2313.285440: hrtimer_expire_entry: hrtimer=000000000507e6d5 function=tick_nohz_handler now=2313208001152
|       slice_test-2903    [001] d.h1.  2313.285449: sched_stat_runtime: comm=slice_test pid=2903 runtime=3982905 [ns]
|       slice_test-2903    [001] dlh..  2313.285452: softirq_raise: vec=7 [action=SCHED]
|       slice_test-2903    [001] dlh..  2313.285452: hrtimer_expire_exit: hrtimer=000000000507e6d5
|       slice_test-2903    [001] dlh1.  2313.285452: hrtimer_start: hrtimer=000000000507e6d5 function=tick_nohz_handler expires=2313212000000 softexpires=2313212000000 mode=ABS
|       slice_test-2903    [001] dlh..  2313.285453: local_timer_exit: vector=236
|       slice_test-2903    [001] dl.2.  2313.285453: sched_waking: comm=ksoftirqd/1 pid=32 prio=120 target_cpu=001
|       slice_test-2903    [001] dl.3.  2313.285456: sched_wakeup: comm=ksoftirqd/1 pid=32 prio=120 target_cpu=001
|       slice_test-2903    [001] d....  2313.285457: irqentry_exit: rseq_grant_slice_extension(216)

granting the extension and removing the lazy wakup. We are still on
return from IRQ but the 'h' flag has been already removed…

|       slice_test-2903    [001] d..1.  2313.285458: hrtimer_start: hrtimer=0000000030a688cc function=rseq_slice_expired expires=2313208047790 softexpires=2313208047790 mode=ABS|PINNED|HARD
|       slice_test-2903    [001] d....  2313.285458: __rseq_arm_slice_extension_timer: timer
|       slice_test-2903    [001] d..2.  2313.285484: hrtimer_cancel: hrtimer=0000000030a688cc
extension granted, timer started and revoked and set need resched.

|       slice_test-2903    [001] dN.2.  2313.285487: sched_stat_runtime: comm=slice_test pid=2903 runtime=36886 [ns]
This is coming from schedule() already. It took me a while since I was
hunting a missing clear of need-resched.

|       slice_test-2903    [001] d..2.  2313.285489: sched_switch: prev_comm=slice_test prev_pid=2903 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/1 next_pid=32 next_prio=120
|      ksoftirqd/1-32      [001] ..s.1  2313.285490: softirq_entry: vec=7 [action=SCHED]
|      ksoftirqd/1-32      [001] ..s.1  2313.285501: softirq_exit: vec=7 [action=SCHED]
|      ksoftirqd/1-32      [001] d..2.  2313.285502: sched_stat_runtime: comm=ksoftirqd/1 pid=32 runtime=16438 [ns]
|      ksoftirqd/1-32      [001] d..2.  2313.285503: sched_switch: prev_comm=ksoftirqd/1 prev_pid=32 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2904 next_prio=120
|       slice_test-2904    [001] .....  2313.285507: sys_enter: NR 230 (1, 0, 7f4692c7baa0, 0, 0, 0)
|       slice_test-2904    [001] .....  2313.285507: hrtimer_setup: hrtimer=00000000f2d53899 clockid=CLOCK_MONOTONIC mode=REL
|       slice_test-2904    [001] d..1.  2313.285507: hrtimer_start: hrtimer=00000000f2d53899 function=hrtimer_wakeup expires=2313208168792 softexpires=2313208118792 mode=REL
|       slice_test-2904    [001] d..2.  2313.285508: sched_stat_runtime: comm=slice_test pid=2904 runtime=6149 [ns]
|       slice_test-2904    [001] d..2.  2313.285510: sched_switch: prev_comm=slice_test prev_pid=2904 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2903 next_prio=120
|       slice_test-2903    [001] .....  2313.285510: sys_enter: NR 470 (7fffc04f1ff0, c350, 11a0e0, 0, 7f4692e99000, 0)

slice_test-2903 enters _now_ rseq_slice_yield() so it must have been in
userland during the suppressed wake up at 2313.285457.
But a few iterations later it turns at out this trace event is recorded
_after_ the rseq magic happens at sys_enter time. We entered
rseq_slice_yield() a few cycles after the extension was granted. Buh.
So it seems to work as intended but it is not obvious tell from tracing
why it does not work.

Sebastian
Re: [patch V2 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 3 months, 1 week ago
On Mon, Oct 27 2025 at 18:30, Sebastian Andrzej Siewior wrote:

> |       slice_test-2903    [001] d..2.  2313.285484: hrtimer_cancel: hrtimer=0000000030a688cc
> extension granted, timer started and revoked and set need resched.
>
> |       slice_test-2903    [001] dN.2.  2313.285487: sched_stat_runtime: comm=slice_test pid=2903 runtime=36886 [ns]
> This is coming from schedule() already. It took me a while since I was
> hunting a missing clear of need-resched.
>
> |       slice_test-2903    [001] d..2.  2313.285489: sched_switch: prev_comm=slice_test prev_pid=2903 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/1 next_pid=32 next_prio=120
> |      ksoftirqd/1-32      [001] ..s.1  2313.285490: softirq_entry: vec=7 [action=SCHED]
> |      ksoftirqd/1-32      [001] ..s.1  2313.285501: softirq_exit: vec=7 [action=SCHED]
> |      ksoftirqd/1-32      [001] d..2.  2313.285502: sched_stat_runtime: comm=ksoftirqd/1 pid=32 runtime=16438 [ns]
> |      ksoftirqd/1-32      [001] d..2.  2313.285503: sched_switch: prev_comm=ksoftirqd/1 prev_pid=32 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2904 next_prio=120
> |       slice_test-2904    [001] .....  2313.285507: sys_enter: NR 230 (1, 0, 7f4692c7baa0, 0, 0, 0)
> |       slice_test-2904    [001] .....  2313.285507: hrtimer_setup: hrtimer=00000000f2d53899 clockid=CLOCK_MONOTONIC mode=REL
> |       slice_test-2904    [001] d..1.  2313.285507: hrtimer_start: hrtimer=00000000f2d53899 function=hrtimer_wakeup expires=2313208168792 softexpires=2313208118792 mode=REL
> |       slice_test-2904    [001] d..2.  2313.285508: sched_stat_runtime: comm=slice_test pid=2904 runtime=6149 [ns]
> |       slice_test-2904    [001] d..2.  2313.285510: sched_switch: prev_comm=slice_test prev_pid=2904 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2903 next_prio=120
> |       slice_test-2903    [001] .....  2313.285510: sys_enter: NR 470 (7fffc04f1ff0, c350, 11a0e0, 0, 7f4692e99000, 0)
>
> slice_test-2903 enters _now_ rseq_slice_yield() so it must have been in
> userland during the suppressed wake up at 2313.285457.
> But a few iterations later it turns at out this trace event is recorded
> _after_ the rseq magic happens at sys_enter time. We entered
> rseq_slice_yield() a few cycles after the extension was granted. Buh.
> So it seems to work as intended but it is not obvious tell from tracing
> why it does not work.

Tracing of the syscall happens _after_ syscall_trace_enter() invoked
rseq_syscall_enter_work() which canceled the timer and set
NEED_RESCHED. That immediately rescheduled _after_ the preempt enable:

  syscall()
    do_syscall_64()
      syscall_enter_from_user_mode() {
        syscall_enter_from_user_mode_work()
          syscall_trace_enter()
            rseq_syscall_enter_work()
              preempt_disable()
              hrtimer_try_to_cancel()
                remove_hrtimer()                <- tracepoint
              set_need_resched()
              preempt_enable()
                schedule()
           ...
           trace_sys_enter()                    <- tracepoint

Even if it would not reschedule immediately the ordering would be
reverse.

Thanks,

        tglx
Re: [patch V2 00/12] rseq: Implement time slice extension mechanism
Posted by Sebastian Andrzej Siewior 3 months, 1 week ago
On 2025-10-27 19:48:56 [+0100], Thomas Gleixner wrote:
> 
> Tracing of the syscall happens _after_ syscall_trace_enter() invoked
> rseq_syscall_enter_work() which canceled the timer and set
> NEED_RESCHED. That immediately rescheduled _after_ the preempt enable:
> 
>   syscall()
>     do_syscall_64()
>       syscall_enter_from_user_mode() {
>         syscall_enter_from_user_mode_work()
>           syscall_trace_enter()
>             rseq_syscall_enter_work()
>               preempt_disable()
>               hrtimer_try_to_cancel()
>                 remove_hrtimer()                <- tracepoint
>               set_need_resched()
>               preempt_enable()
>                 schedule()
>            ...
>            trace_sys_enter()                    <- tracepoint
> 
> Even if it would not reschedule immediately the ordering would be
> reverse.

I know that know after doing the tracing. But having only the sched
events looked like the slice gets granted and usecs later scheduling
happens. Adding interrupts and syscalls continued pointing to the wrong
direction.
Maybe the lack of events here is okay if you know what you do and what
to expect in terms of available trace events.

In that spirit, I did test it and didn't find anything wrong with it ;)

> Thanks,
> 
>         tglx

Sebastian