[patch 00/12] rseq: Implement time slice extension mechanism

Thomas Gleixner posted 12 patches 5 months ago
There is a newer version of this series
Documentation/userspace-api/index.rst       |    1
Documentation/userspace-api/rseq.rst        |  129 ++++++++++++
arch/alpha/kernel/syscalls/syscall.tbl      |    1
arch/arm/tools/syscall.tbl                  |    1
arch/arm64/tools/syscall_32.tbl             |    1
arch/m68k/kernel/syscalls/syscall.tbl       |    1
arch/microblaze/kernel/syscalls/syscall.tbl |    1
arch/mips/kernel/syscalls/syscall_n32.tbl   |    1
arch/mips/kernel/syscalls/syscall_n64.tbl   |    1
arch/mips/kernel/syscalls/syscall_o32.tbl   |    1
arch/parisc/kernel/syscalls/syscall.tbl     |    1
arch/powerpc/kernel/syscalls/syscall.tbl    |    1
arch/s390/kernel/syscalls/syscall.tbl       |    1
arch/s390/mm/pfault.c                       |    3
arch/sh/kernel/syscalls/syscall.tbl         |    1
arch/sparc/kernel/syscalls/syscall.tbl      |    1
arch/x86/entry/syscalls/syscall_32.tbl      |    1
arch/x86/entry/syscalls/syscall_64.tbl      |    1
arch/xtensa/kernel/syscalls/syscall.tbl     |    1
include/linux/entry-common.h                |    2
include/linux/rseq.h                        |   11 +
include/linux/rseq_entry.h                  |  176 ++++++++++++++++
include/linux/rseq_types.h                  |   28 ++
include/linux/sched.h                       |    7
include/linux/syscalls.h                    |    1
include/linux/thread_info.h                 |   16 -
include/uapi/asm-generic/unistd.h           |    5
include/uapi/linux/prctl.h                  |   10
include/uapi/linux/rseq.h                   |   28 ++
init/Kconfig                                |   12 +
kernel/entry/common.c                       |   14 +
kernel/entry/syscall-common.c               |   11 -
kernel/rcu/tiny.c                           |    8
kernel/rcu/tree.c                           |   14 -
kernel/rcu/tree_exp.h                       |    3
kernel/rcu/tree_plugin.h                    |    9
kernel/rcu/tree_stall.h                     |    3
kernel/rseq.c                               |  293 ++++++++++++++++++++++++++++
kernel/sys.c                                |    6
kernel/sys_ni.c                             |    1
scripts/syscall.tbl                         |    1
tools/testing/selftests/rseq/.gitignore     |    1
tools/testing/selftests/rseq/Makefile       |    5
tools/testing/selftests/rseq/rseq-abi.h     |    2
tools/testing/selftests/rseq/slice_test.c   |  217 ++++++++++++++++++++
45 files changed, 991 insertions(+), 42 deletions(-)
[patch 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 5 months ago
This is the proper implementation of the PoC code, which I posted in reply
to the latest iteration of Prakash's time slice extension patches:

     https://lore.kernel.org/all/87o6smb3a0.ffs@tglx

Time slice extensions are an attempt to provide opportunistic priority
ceiling without the overhead of an actual priority ceiling protocol, but
also without the guarantees such a protocol provides.

The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.

This has been attempted to solve at least for a decade, but so far this
went nowhere.  The recent attempts, which started to integrate with the
already existing RSEQ mechanism, have been at least going into the right
direction. The full history is partially in the above mentioned mail thread
and it's ancestors, but also in various threads in the LKML archives, which
require archaeological efforts to retrieve.

When trying to morph the PoC into actual mergeable code, I stumbled over
various shortcomings in the RSEQ code, which have been addressed in a
separate effort. The latest iteration can be found here:

     https://lore.kernel.org/all/20250908212737.353775467@linutronix.de

That is a prerequisite for this series as it allows a tight integration
into the RSEQ code without inflicting a lot of extra overhead into the hot
paths.

The main change vs. the PoC and the previous attempts is that it utilizes a
new field in the user space ABI rseq struct, which allows to reduce the
atomic operations in user space to a bare minimum. If the architecture
supports CPU local atomics, which protect against the obvious RMW race
vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
x86, required.

The kernel user space ABI consists only of two bits in this new field:

	REQUEST and GRANTED

User space sets REQUEST at the begin of the critical section. If it
finishes the critical section without interruption then it can clear the
bit and move on.

If it is interrupted and the interrupt return path in the kernel observes a
rescheduling request, then the kernel can grant a time slice extension. The
kernel clears the REQUEST bit and sets the GRANTED bit with a simple
non-atomic store operation. If it does not grant the extension only the
REQUEST bit is cleared.

If user space observes the REQUEST bit cleared, when it finished the
critical section, then it has to check the GRANTED bit. If that is set,
then it has to invoke the rseq_slice_yield() syscall to terminate the
extension and yield the CPU.

The code flow in user space is:

   	  // Simple store as there is no concurrency vs. the GRANTED bit
      	  rseq->slice_ctrl = REQUEST;

	  critical_section();

	  // CPU local atomic required here:
	  if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
	     	// Non-atomic check is sufficient as this can race
		// against an interrupt, which revokes the grant
		//
		// If not set, then the request was either cleared by the kernel
		// without grant or the grant was revoked.
		//
		// If set, tell the kernel that the critical section is done
		// so it can reschedule
	  	if (rseq->slice_ctrl & GRANTED)
			rseq_slice_yield();
	  }

The other details, which differ from earlier attempts and the PoC, are:

    - A separate syscall for terminating the extension to avoid side
      effects and overloading of the already ill defined sched_yield(2)

    - A separate per CPU timer, which again does not inflict side effects
      on the scheduler internal hrtick timer. The hrtick timer can be
      disabled at run-time and an expiry can cause interesting problems in
      the scheduler code when it is unexpectedly invoked.

    - Tight integration into the rseq exit to user mode code. It utilizes
      the path when TIF_RESQ is not set at the end of exit_to_user_mode()
      to arm the timer if an extension was granted. TIF_RSEQ indicates that
      the task was scheduled and therefore would revoke the grant anyway.

    - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
      model which is utilized by PREEMPT_RT.

      It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
      but not TIF_PREEMPT.

      Pretending that this can be made work for TIF_PREEMPT on a fully
      preemptible kernel is just wishful thinking as the chance that
      TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
      obvious reasons.

      This only "works" by some definition of works, i.e. on a best effort
      basis, for the PREEMPT_NONE model and nothing else. Though given the
      problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
      running code sections, the days of these models should be hopefully
      numbered and everything consolidated on the LAZY model.

      That makes this distinction moot and everything restricted to
      TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
      extension mechanism into the scheduler hotpath. I'm sure there will
      be attempts to do that as there is no lack of crazy folks out
      there...

    - Actual documentation of the user space ABI and a initial self test.

The RSEQ modifications on which this series is based can be found here:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Thanks,

	tglx
---
 Documentation/userspace-api/index.rst       |    1 
 Documentation/userspace-api/rseq.rst        |  129 ++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/tools/syscall_32.tbl             |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/s390/mm/pfault.c                       |    3 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 include/linux/entry-common.h                |    2 
 include/linux/rseq.h                        |   11 +
 include/linux/rseq_entry.h                  |  176 ++++++++++++++++
 include/linux/rseq_types.h                  |   28 ++
 include/linux/sched.h                       |    7 
 include/linux/syscalls.h                    |    1 
 include/linux/thread_info.h                 |   16 -
 include/uapi/asm-generic/unistd.h           |    5 
 include/uapi/linux/prctl.h                  |   10 
 include/uapi/linux/rseq.h                   |   28 ++
 init/Kconfig                                |   12 +
 kernel/entry/common.c                       |   14 +
 kernel/entry/syscall-common.c               |   11 -
 kernel/rcu/tiny.c                           |    8 
 kernel/rcu/tree.c                           |   14 -
 kernel/rcu/tree_exp.h                       |    3 
 kernel/rcu/tree_plugin.h                    |    9 
 kernel/rcu/tree_stall.h                     |    3 
 kernel/rseq.c                               |  293 ++++++++++++++++++++++++++++
 kernel/sys.c                                |    6 
 kernel/sys_ni.c                             |    1 
 scripts/syscall.tbl                         |    1 
 tools/testing/selftests/rseq/.gitignore     |    1 
 tools/testing/selftests/rseq/Makefile       |    5 
 tools/testing/selftests/rseq/rseq-abi.h     |    2 
 tools/testing/selftests/rseq/slice_test.c   |  217 ++++++++++++++++++++
 45 files changed, 991 insertions(+), 42 deletions(-)
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Mathieu Desnoyers 4 months, 4 weeks ago
On 2025-09-08 18:59, Thomas Gleixner wrote:
> This is the proper implementation of the PoC code, which I posted in reply
> to the latest iteration of Prakash's time slice extension patches:
> 
>       https://lore.kernel.org/all/87o6smb3a0.ffs@tglx
> 
> Time slice extensions are an attempt to provide opportunistic priority
> ceiling without the overhead of an actual priority ceiling protocol, but
> also without the guarantees such a protocol provides.
> 
> The intent is to avoid situations where a user space thread is interrupted
> in a critical section and scheduled out, while holding a resource on which
> the preempting thread or other threads in the system might block on. That
> obviously prevents those threads from making progress in the worst case for
> at least a full time slice. Especially in the context of user space
> spinlocks, which are a patently bad idea to begin with, but that's also
> true for other mechanisms.
> 
> This has been attempted to solve at least for a decade, but so far this
> went nowhere.  The recent attempts, which started to integrate with the
> already existing RSEQ mechanism, have been at least going into the right
> direction. The full history is partially in the above mentioned mail thread
> and it's ancestors, but also in various threads in the LKML archives, which

it's -> its

> require archaeological efforts to retrieve.
> 
> When trying to morph the PoC into actual mergeable code, I stumbled over
> various shortcomings in the RSEQ code, which have been addressed in a
> separate effort. The latest iteration can be found here:
> 
>       https://lore.kernel.org/all/20250908212737.353775467@linutronix.de
> 
> That is a prerequisite for this series as it allows a tight integration
> into the RSEQ code without inflicting a lot of extra overhead into the hot
> paths.
> 
> The main change vs. the PoC and the previous attempts is that it utilizes a
> new field in the user space ABI rseq struct, which allows to reduce the
> atomic operations in user space to a bare minimum. If the architecture
> supports CPU local atomics, which protect against the obvious RMW race
> vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on
> x86, required.

Good!

> 
> The kernel user space ABI consists only of two bits in this new field:
> 
> 	REQUEST and GRANTED
> 
> User space sets REQUEST at the begin of the critical section. If it

beginning

> finishes the critical section without interruption then it can clear the
> bit and move on.
> 
> If it is interrupted and the interrupt return path in the kernel observes a
> rescheduling request, then the kernel can grant a time slice extension. The
> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
> non-atomic store operation. If it does not grant the extension only the
> REQUEST bit is cleared.
> 
> If user space observes the REQUEST bit cleared, when it finished the
> critical section, then it has to check the GRANTED bit. If that is set,
> then it has to invoke the rseq_slice_yield() syscall to terminate the

Does it "have" to ? What is the consequence of misbehaving ?

> extension and yield the CPU.
> 
> The code flow in user space is:
> 
>     	  // Simple store as there is no concurrency vs. the GRANTED bit
>        	  rseq->slice_ctrl = REQUEST;
> 
> 	  critical_section();
> 
> 	  // CPU local atomic required here:
> 	  if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
> 	     	// Non-atomic check is sufficient as this can race
> 		// against an interrupt, which revokes the grant
> 		//
> 		// If not set, then the request was either cleared by the kernel
> 		// without grant or the grant was revoked.
> 		//
> 		// If set, tell the kernel that the critical section is done
> 		// so it can reschedule
> 	  	if (rseq->slice_ctrl & GRANTED)
> 			rseq_slice_yield();

I wonder if we could achieve this without the cpu-local atomic, and
just rely on simple relaxed-atomic or volatile loads/stores and compiler
barriers in userspace. Let's say we have:

union {
	u16 slice_ctrl;
	struct {
		u8 rseq->slice_request;
		u8 rseq->slice_grant;
	};
};

With userspace doing:

rseq->slice_request = true;  /* WRITE_ONCE() */
barrier();
critical_section();
barrier();
rseq->slice_request = false; /* WRITE_ONCE() */
if (rseq->slice_grant)       /* READ_ONCE() */
   rseq_slice_yield();

In the kernel interrupt return path, if the kernel observes
"rseq->slice_request" set and "rseq->slice_grant" cleared,
it grants the extension and sets "rseq->slice_grant".

rseq_slice_yield() clears rseq->slice_grant.


> 	  }
> 
> The other details, which differ from earlier attempts and the PoC, are:
> 
>      - A separate syscall for terminating the extension to avoid side
>        effects and overloading of the already ill defined sched_yield(2)
> 
>      - A separate per CPU timer, which again does not inflict side effects
>        on the scheduler internal hrtick timer. The hrtick timer can be
>        disabled at run-time and an expiry can cause interesting problems in
>        the scheduler code when it is unexpectedly invoked.
> 
>      - Tight integration into the rseq exit to user mode code. It utilizes
>        the path when TIF_RESQ is not set at the end of exit_to_user_mode()

TIF_RSEQ

>        to arm the timer if an extension was granted. TIF_RSEQ indicates that
>        the task was scheduled and therefore would revoke the grant anyway.
> 
>      - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
>        model which is utilized by PREEMPT_RT.

Can you clarify why this attempt is "futile" ?

Thanks,

Mathieu

> 
>        It allows the extension to be granted when TIF_PREEMPT_LAZY is set,
>        but not TIF_PREEMPT.
> 
>        Pretending that this can be made work for TIF_PREEMPT on a fully
>        preemptible kernel is just wishful thinking as the chance that
>        TIF_PREEMPT is set in exit_to_user_mode() is close to zero for
>        obvious reasons.
> 
>        This only "works" by some definition of works, i.e. on a best effort
>        basis, for the PREEMPT_NONE model and nothing else. Though given the
>        problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long
>        running code sections, the days of these models should be hopefully
>        numbered and everything consolidated on the LAZY model.
> 
>        That makes this distinction moot and everything restricted to
>        TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice
>        extension mechanism into the scheduler hotpath. I'm sure there will
>        be attempts to do that as there is no lack of crazy folks out
>        there...
> 
>      - Actual documentation of the user space ABI and a initial self test.
> 
> The RSEQ modifications on which this series is based can be found here:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
> 
> Thanks,
> 
> 	tglx
> ---
>   Documentation/userspace-api/index.rst       |    1
>   Documentation/userspace-api/rseq.rst        |  129 ++++++++++++
>   arch/alpha/kernel/syscalls/syscall.tbl      |    1
>   arch/arm/tools/syscall.tbl                  |    1
>   arch/arm64/tools/syscall_32.tbl             |    1
>   arch/m68k/kernel/syscalls/syscall.tbl       |    1
>   arch/microblaze/kernel/syscalls/syscall.tbl |    1
>   arch/mips/kernel/syscalls/syscall_n32.tbl   |    1
>   arch/mips/kernel/syscalls/syscall_n64.tbl   |    1
>   arch/mips/kernel/syscalls/syscall_o32.tbl   |    1
>   arch/parisc/kernel/syscalls/syscall.tbl     |    1
>   arch/powerpc/kernel/syscalls/syscall.tbl    |    1
>   arch/s390/kernel/syscalls/syscall.tbl       |    1
>   arch/s390/mm/pfault.c                       |    3
>   arch/sh/kernel/syscalls/syscall.tbl         |    1
>   arch/sparc/kernel/syscalls/syscall.tbl      |    1
>   arch/x86/entry/syscalls/syscall_32.tbl      |    1
>   arch/x86/entry/syscalls/syscall_64.tbl      |    1
>   arch/xtensa/kernel/syscalls/syscall.tbl     |    1
>   include/linux/entry-common.h                |    2
>   include/linux/rseq.h                        |   11 +
>   include/linux/rseq_entry.h                  |  176 ++++++++++++++++
>   include/linux/rseq_types.h                  |   28 ++
>   include/linux/sched.h                       |    7
>   include/linux/syscalls.h                    |    1
>   include/linux/thread_info.h                 |   16 -
>   include/uapi/asm-generic/unistd.h           |    5
>   include/uapi/linux/prctl.h                  |   10
>   include/uapi/linux/rseq.h                   |   28 ++
>   init/Kconfig                                |   12 +
>   kernel/entry/common.c                       |   14 +
>   kernel/entry/syscall-common.c               |   11 -
>   kernel/rcu/tiny.c                           |    8
>   kernel/rcu/tree.c                           |   14 -
>   kernel/rcu/tree_exp.h                       |    3
>   kernel/rcu/tree_plugin.h                    |    9
>   kernel/rcu/tree_stall.h                     |    3
>   kernel/rseq.c                               |  293 ++++++++++++++++++++++++++++
>   kernel/sys.c                                |    6
>   kernel/sys_ni.c                             |    1
>   scripts/syscall.tbl                         |    1
>   tools/testing/selftests/rseq/.gitignore     |    1
>   tools/testing/selftests/rseq/Makefile       |    5
>   tools/testing/selftests/rseq/rseq-abi.h     |    2
>   tools/testing/selftests/rseq/slice_test.c   |  217 ++++++++++++++++++++
>   45 files changed, 991 insertions(+), 42 deletions(-)
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 4 months, 4 weeks ago
On Thu, Sep 11 2025 at 11:27, Mathieu Desnoyers wrote:
> On 2025-09-08 18:59, Thomas Gleixner wrote:
>> If it is interrupted and the interrupt return path in the kernel observes a
>> rescheduling request, then the kernel can grant a time slice extension. The
>> kernel clears the REQUEST bit and sets the GRANTED bit with a simple
>> non-atomic store operation. If it does not grant the extension only the
>> REQUEST bit is cleared.
>> 
>> If user space observes the REQUEST bit cleared, when it finished the
>> critical section, then it has to check the GRANTED bit. If that is set,
>> then it has to invoke the rseq_slice_yield() syscall to terminate the
>
> Does it "have" to ? What is the consequence of misbehaving ?

It receives SIGSEGV because that means that it did not follow the rules
and stuck an arbitrary syscall into the critical section.

> I wonder if we could achieve this without the cpu-local atomic, and
> just rely on simple relaxed-atomic or volatile loads/stores and compiler
> barriers in userspace. Let's say we have:
>
> union {
> 	u16 slice_ctrl;
> 	struct {
> 		u8 rseq->slice_request;
> 		u8 rseq->slice_grant;

Interesting way to define a struct member :)

> 	};
> };
>
> With userspace doing:
>
> rseq->slice_request = true;  /* WRITE_ONCE() */
> barrier();
> critical_section();
> barrier();
> rseq->slice_request = false; /* WRITE_ONCE() */
> if (rseq->slice_grant)       /* READ_ONCE() */
>    rseq_slice_yield();

That should work as it's strictly CPU local. Good point, now that you
said it it's obvious :)

Let me rework it accordingly.

> In the kernel interrupt return path, if the kernel observes
> "rseq->slice_request" set and "rseq->slice_grant" cleared,
> it grants the extension and sets "rseq->slice_grant".

They can't be both set. If they are then user space fiddled with the
bits.

>>      - A futile attempt to make this "work" on the PREEMPT_LAZY preemption
>>        model which is utilized by PREEMPT_RT.
>
> Can you clarify why this attempt is "futile" ?

Because on RT interrupts usually end up with TIF_PREEMPT set either due
to softirqs or interrupt threads. And no, we don't want to
overcomplicate things right now to make it "work" for real-time tasks in
the first place as that's just going to result either endless
discussions or subtle latency problems or both. For now allowing it for
the 'LAZY' case is good enough.

With the non-RT LAZY model that's not really a good idea either, because
when TIF_PREEMPT is set, then either the preempting task is in a RT
class or the to be preempted task already has overrun the LAZY granted
computation time and the scheduler sets TIF_PREEMPT to whack it over the
head.

Thanks,

        tglx
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Mathieu Desnoyers 4 months, 4 weeks ago
On 2025-09-11 16:18, Thomas Gleixner wrote:
> On Thu, Sep 11 2025 at 11:27, Mathieu Desnoyers wrote:
>> On 2025-09-08 18:59, Thomas Gleixner wrote:
[...]
>> Does it "have" to ? What is the consequence of misbehaving ?
> 
> It receives SIGSEGV because that means that it did not follow the rules
> and stuck an arbitrary syscall into the critical section.

Not following the rules could also be done by just looping for a long
time in userspace within or after the critical section, in which case
the timer should catch it.

> 
>> I wonder if we could achieve this without the cpu-local atomic, and
>> just rely on simple relaxed-atomic or volatile loads/stores and compiler
>> barriers in userspace. Let's say we have:
>>
>> union {
>> 	u16 slice_ctrl;
>> 	struct {
>> 		u8 rseq->slice_request;
>> 		u8 rseq->slice_grant;
> 
> Interesting way to define a struct member :)

This goes with the usual warning "this code has never even been
remotely close to a compiler, so handle with care" ;-)

> 
>> 	};
>> };
>>
>> With userspace doing:
>>
>> rseq->slice_request = true;  /* WRITE_ONCE() */
>> barrier();
>> critical_section();
>> barrier();
>> rseq->slice_request = false; /* WRITE_ONCE() */
>> if (rseq->slice_grant)       /* READ_ONCE() */
>>     rseq_slice_yield();
> 
> That should work as it's strictly CPU local. Good point, now that you
> said it it's obvious :)
> 
> Let me rework it accordingly.

I have two questions wrt ABI here:

1) Do we expect the slice requests to be done from C and higher level
    languages or only from assembly ?

2) Slice requests are a good fit for locking. Locking typically
    has nesting ability.

    We should consider making the slice request ABI a 8-bit
    or 16-bit nesting counter to allow nesting of its users.

3) Slice requests are also a good fit for rseq critical sections.
    Of course someone could explicitly increment/decrement the
    slice request counter before/after the rseq critical sections, but
    I think we could do better there and integrate this directly within
    the struct rseq_cs as a new critical section flag. Basically, a
    critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
    better name) set within its descriptor flags would behave as if
    the slice request counter is non-zero when preempted without
    requiring any extra instruction on the fast path. The only
    added overhead would be a check of the rseq->slice_grant flag
    when exiting the critical section to conditionally issue
    rseq_slice_yield().

    This point (3) is an optimization that could come as a future step
    if the overhead of incrementing the slice_request proves to be a
    bottleneck for rseq critical sections.

> 
>> In the kernel interrupt return path, if the kernel observes
>> "rseq->slice_request" set and "rseq->slice_grant" cleared,
>> it grants the extension and sets "rseq->slice_grant".
> 
> They can't be both set. If they are then user space fiddled with the
> bits.

Ah, yes, that's true if the kernel clears the slice_request when setting
the slice_grant.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 4 months, 4 weeks ago
On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
> On 2025-09-11 16:18, Thomas Gleixner wrote:
>> It receives SIGSEGV because that means that it did not follow the rules
>> and stuck an arbitrary syscall into the critical section.
>
> Not following the rules could also be done by just looping for a long
> time in userspace within or after the critical section, in which case
> the timer should catch it.

It's pretty much impossible to tell for the kernel without more
overhead, whether that's actually a violation of the rules or not.

The operation after the grant can be interrupted (without resulting in
scheduling), which is out of control of the task which got the extension
granted.

The timer is there to ensure that there is an upper bound to the grant
independent of the actual reason.

Going through a different syscall is an obvious deviation from the rule.

As far I understood the earlier discussions, scheduler folks want to
enforce that because of PREEMPT_NONE semantics, where a randomly chosen
syscall might not result in an immediate reschedule because the work,
which needs to be done takes arbitrary time to complete.

Though that's arguably not much different from

       syscall()
                -> tick -> NEED_RESCHED
        do_tons_of_work();
       exit_to_user()
          schedule();

except that in the slice extension case, the latency increases by the
slice extension time.

If we allow arbitrary syscalls to terminate the grant, then we need to
stick an immediate schedule() into the syscall entry work function. We'd
still need the separate yield() syscall to provide a side effect free
way of termination.

I have no strong opinions either way. Peter?

>>> rseq->slice_request = true;  /* WRITE_ONCE() */
>>> barrier();
>>> critical_section();
>>> barrier();
>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>> if (rseq->slice_grant)       /* READ_ONCE() */
>>>     rseq_slice_yield();
>> 
>> That should work as it's strictly CPU local. Good point, now that you
>> said it it's obvious :)
>> 
>> Let me rework it accordingly.
>
> I have two questions wrt ABI here:
>
> 1) Do we expect the slice requests to be done from C and higher level
>     languages or only from assembly ?

It doesn't matter as long as the ordering is guaranteed.

> 2) Slice requests are a good fit for locking. Locking typically
>     has nesting ability.
>
>     We should consider making the slice request ABI a 8-bit
>     or 16-bit nesting counter to allow nesting of its users.

Making request a counter requires to keep request set when the
extension is granted. So the states would be:

     request    granted
     0          0               Neutral
     >0         0               Requested
     >=0        1               Granted

That should work.

Though I'm not really convinced that unconditionally embeddeding it into
random locking primitives is the right thing to do.

The extension makes only sense, when the actual critical section is
small and likely to complete within the extension time, which is usually
only true for highly optimized code and not for general usage, where the
lock held section is arbitrary long and might even result in syscalls
even if the critical section itself does not have an obvious explicit
syscall embedded:

     lock(a)
        lock(b) <- Contention results in syscall

Same applies for library functions within a critical section.

That then immediately conflicts with the yield mechanism rules, because
the extension could have been granted _before_ the syscall happens, so
we'd have remove that restriction too.

That said, we can make the ABI a counter and split the slice control
word into two u16. So the decision function would be:

     get_usr(ctrl);
     if (!ctrl.request)
     	return;
     ....
     ctrl.granted = 1;
     put_usr(ctrl);

Along with documentation why this should only be used nested when you
know what you are doing.

> 3) Slice requests are also a good fit for rseq critical sections.
>     Of course someone could explicitly increment/decrement the
>     slice request counter before/after the rseq critical sections, but
>     I think we could do better there and integrate this directly within
>     the struct rseq_cs as a new critical section flag. Basically, a
>     critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>     better name) set within its descriptor flags would behave as if
>     the slice request counter is non-zero when preempted without
>     requiring any extra instruction on the fast path. The only
>     added overhead would be a check of the rseq->slice_grant flag
>     when exiting the critical section to conditionally issue
>     rseq_slice_yield().

Plus checking first whether rseq->slice.request is actually zero,
i.e. whether the rseq critical section was the outermost one. If not,
you cannot invoke the yield even if granted is true, right?

But mixing state spaces is not really a good idea at all. Let's not go
there.

Also you'd make checking of rseq_cs unconditional, which means extra
work in the grant decision function as it would then have to do:

         if (!usr->slice.ctrl.request) {
            if (!usr->rseq_cs)
               return;
            if (!valid_ptr(usr->rseq_cs))
               goto die;
            if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
               return;
            if (!(use->rseq_cs.flags & REQUEST))
               return;
         }

IOW, we'd copy half of the rseq cs handling into that code.

Can we please keep it independent and simple?

Thanks,

        tglx
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Mathieu Desnoyers 4 months, 3 weeks ago
[ For those just CC'd on this thread, the discussion is about time slice
   extension for userspace critical sections. We are specifically
   discussing the kernel ABI we plan to expose to userspace. ]

On 2025-09-12 12:31, Thomas Gleixner wrote:
> On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote:
>> On 2025-09-11 16:18, Thomas Gleixner wrote:
>>> It receives SIGSEGV because that means that it did not follow the rules
>>> and stuck an arbitrary syscall into the critical section.
>>
>> Not following the rules could also be done by just looping for a long
>> time in userspace within or after the critical section, in which case
>> the timer should catch it.
> 
> It's pretty much impossible to tell for the kernel without more
> overhead, whether that's actually a violation of the rules or not.
> 
> The operation after the grant can be interrupted (without resulting in
> scheduling), which is out of control of the task which got the extension
> granted.
> 
> The timer is there to ensure that there is an upper bound to the grant
> independent of the actual reason.

If the worse side-effect of this feature is that the slice extension
is not granted when users misbehave, IMHO this would increase the
likelihood of adoption compared to failure modes that end up killing the
offending processes.

> 
> Going through a different syscall is an obvious deviation from the rule.

AFAIU, the grant is cleared when a signal handler is delivered, which
makes it OK for signals to issue system calls even if they are nested
on top of a granted extension critical section.

> 
> As far I understood the earlier discussions, scheduler folks want to
> enforce that because of PREEMPT_NONE semantics, where a randomly chosen
> syscall might not result in an immediate reschedule because the work,
> which needs to be done takes arbitrary time to complete.
> 
> Though that's arguably not much different from
> 
>         syscall()
>                  -> tick -> NEED_RESCHED
>          do_tons_of_work();
>         exit_to_user()
>            schedule();
> 
> except that in the slice extension case, the latency increases by the
> slice extension time.
> 
> If we allow arbitrary syscalls to terminate the grant, then we need to
> stick an immediate schedule() into the syscall entry work function. We'd
> still need the separate yield() syscall to provide a side effect free
> way of termination.
> 
> I have no strong opinions either way. Peter?

If it happens to not be too bothersome to allow arbitrary system calls
to act as implicit rseq_slice_yield() rather than result in a
segmentation fault, I think it would make this feature more widely
adopted.

Another scenario I have in mind is a userspace critical section that
would typically benefit from slice extension, but seldomly requires
to issue a system call. In C and higher level languages, that could be
very much outside of the user control, such as accessing a
global-dynamic TLS variable located within a global-dynamic shared
object, which can trigger memory allocation under the hood on first
access.

Handling syscall within granted extension by killing the process
will likely reserve this feature to the niche use-cases.

> 
>>>> rseq->slice_request = true;  /* WRITE_ONCE() */
>>>> barrier();
>>>> critical_section();
>>>> barrier();
>>>> rseq->slice_request = false; /* WRITE_ONCE() */
>>>> if (rseq->slice_grant)       /* READ_ONCE() */
>>>>      rseq_slice_yield();
>>>
>>> That should work as it's strictly CPU local. Good point, now that you
>>> said it it's obvious :)
>>>
>>> Let me rework it accordingly.
>>
>> I have two questions wrt ABI here:
>>
>> 1) Do we expect the slice requests to be done from C and higher level
>>      languages or only from assembly ?
> 
> It doesn't matter as long as the ordering is guaranteed.

OK, so I understand that you intend to target higher level languages
as well, which makes my second point (nesting) relevant.

> 
>> 2) Slice requests are a good fit for locking. Locking typically
>>      has nesting ability.
>>
>>      We should consider making the slice request ABI a 8-bit
>>      or 16-bit nesting counter to allow nesting of its users.
> 
> Making request a counter requires to keep request set when the
> extension is granted. So the states would be:
> 
>       request    granted
>       0          0               Neutral
>       >0         0               Requested
>       >=0        1               Granted

Yes.

> 
> That should work.
> 
> Though I'm not really convinced that unconditionally embeddeding it into
> random locking primitives is the right thing to do.

Me neither. I wonder what would be a good approach to integrate this
with locking APIs. Here are a few ideas, some worse than others:

- Extend pthread_mutexattr_t to set whether the mutex should be
   slice-extended. Downside: if a mutex has some long and some
   short critical sections, it's really a one-size fits all decision
   for all critical sections for that mutex.

- Extend the pthread_mutex_lock/trylock with new APIs to allow
   specifying whether slice-extension is needed for the upcoming critical
   section.

- Just let the pthread_mutex_lock caller explicitly request the
   slice extension *after* grabbing the lock. Downside: this opens
   a window of a few instructions where preemption can happen
   and slice extension would have been useful. Should we care ?

> 
> The extension makes only sense, when the actual critical section is
> small and likely to complete within the extension time, which is usually
> only true for highly optimized code and not for general usage, where the
> lock held section is arbitrary long and might even result in syscalls
> even if the critical section itself does not have an obvious explicit
> syscall embedded:
> 
>       lock(a)
>          lock(b) <- Contention results in syscall

Nested locking is another scenario where _typically_ we'd want the
slice extension for the outer lock if it is expected to be a short
critical section, and sometimes hit futex while the extension is granted
and clear the grant if this happens without killing the process.

> 
> Same applies for library functions within a critical section.

Yes.

> 
> That then immediately conflicts with the yield mechanism rules, because
> the extension could have been granted _before_ the syscall happens, so
> we'd have remove that restriction too.

Yes.

> 
> That said, we can make the ABI a counter and split the slice control
> word into two u16. So the decision function would be:
> 
>       get_usr(ctrl);
>       if (!ctrl.request)
>       	return;
>       ....
>       ctrl.granted = 1;
>       put_usr(ctrl);
> 
> Along with documentation why this should only be used nested when you
> know what you are doing.

Yes.

This would turn the end of critical section into a
decrement-and-test-for-zero. It's only when the request counter
decrements back to zero that userspace should handle the granted
flag and yield.

> 
>> 3) Slice requests are also a good fit for rseq critical sections.
>>      Of course someone could explicitly increment/decrement the
>>      slice request counter before/after the rseq critical sections, but
>>      I think we could do better there and integrate this directly within
>>      the struct rseq_cs as a new critical section flag. Basically, a
>>      critical section with this new RSEQ_CS_SLICE_REQUEST flag (or
>>      better name) set within its descriptor flags would behave as if
>>      the slice request counter is non-zero when preempted without
>>      requiring any extra instruction on the fast path. The only
>>      added overhead would be a check of the rseq->slice_grant flag
>>      when exiting the critical section to conditionally issue
>>      rseq_slice_yield().
> 
> Plus checking first whether rseq->slice.request is actually zero,
> i.e. whether the rseq critical section was the outermost one. If not,
> you cannot invoke the yield even if granted is true, right?

Right.

> 
> But mixing state spaces is not really a good idea at all. Let's not go
> there.

I agree, let's keep this (3) for later if there is a strong use-case
justifying the complexity.

What is important for right now though is to figure out the behavior
with respect to an ongoing rseq critical section when a time slice
extension is granted: is the rseq critical section aborted or does
it keep going on return to userspace ?

> 
> Also you'd make checking of rseq_cs unconditional, which means extra
> work in the grant decision function as it would then have to do:
> 
>           if (!usr->slice.ctrl.request) {
>              if (!usr->rseq_cs)
>                 return;
>              if (!valid_ptr(usr->rseq_cs))
>                 goto die;
>              if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset))
>                 return;
>              if (!(use->rseq_cs.flags & REQUEST))
>                 return;
>           }
> 
> IOW, we'd copy half of the rseq cs handling into that code.
> 
> Can we please keep it independent and simple?

Of course.

So in summary, here is my current understanding:

- It would be good to support nested slice-extension requests,

- It would be preferable to allow arbitrary system calls to
   cancel an ongoing slice-extension grant rather than kill the
   process if we want the slice-extension feature to be useful
   outside of niche use-cases.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> 
>          tglx


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 4 months, 3 weeks ago
On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>> 2) Slice requests are a good fit for locking. Locking typically
>>>      has nesting ability.
>>>
>>>      We should consider making the slice request ABI a 8-bit
>>>      or 16-bit nesting counter to allow nesting of its users.
>> 
>> Making request a counter requires to keep request set when the
>> extension is granted. So the states would be:
>> 
>>       request    granted
>>       0          0               Neutral
>>       >0         0               Requested
>>       >=0        1               Granted
>

Second thoughts on this.

Such a scheme means that slice_ctrl.request must be read only for the
kernel because otherwise the user space decrement would need to be an
atomic dec_if_not_zero(). We just argued the one atomic operation away. :)

That means, the kernel can only set and clear Granted. That in turn
loses the information whether a slice extension was denied or revoked,
which was something the Oracle people wanted to have. I'm not sure
whether that was a functional or more a instrumentation feature.

But what's worse: this is a receipe for disaster as it creates obviously
subtle and hard to debug ways to leak an increment, which means the
request would stay active forever defeating the whole purpose.

And no, the kernel cannot keep track of the counter and observe whether
it became zero at some point or not. You surely could come up with a
convoluted scheme to work around that in form of sequence counters or
whatever, but that just creates extra complexity for a very dubious
value.

The point is that the time slice extension is just providing an
opportunistic priority ceiling mechanism with low overhead and without
guarantees.

Once a request is not granted or revoked, the performance of that
particular operation goes south no matter what. Nesting does not help
there at all, which is a strong argument for using KISS as the primary
engineering principle here.

The simple boolean request/granted pair is simple and very well
defined. It does not suffer from any of those problems.

If user space wants nesting, then it can do so on its own without
creating an ill defined and fragile kernel/user ABI. We created enough
of them in the past and all of them resulted in long term headaches.

> Handling syscall within granted extension by killing the process

I'm absolutely not opposed to lift the syscall restriction to make
things easier, but this is the wrong argument for it:

> will likely reserve this feature to the niche use-cases.

Having this used only by people who actually know what they are doing is
actually the preferred outcome.

We've seen it over and over that supposedly "easy" features result in
mindless overutilization because everyone and his dog thinks they need
them just because and for the very wrong reasons. The unconditional
usage of the most power hungry floating point extensions just because
they are available, is only one example of many.

Thanks,

        tglx
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Prakash Sangappa 4 months, 3 weeks ago

> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>     has nesting ability.
>>>> 
>>>>     We should consider making the slice request ABI a 8-bit
>>>>     or 16-bit nesting counter to allow nesting of its users.
>>> 
>>> Making request a counter requires to keep request set when the
>>> extension is granted. So the states would be:
>>> 
>>>      request    granted
>>>      0          0               Neutral
>>>> 0         0               Requested
>>>> =0        1               Granted
>> 
> 
> Second thoughts on this.
> 
> Such a scheme means that slice_ctrl.request must be read only for the
> kernel because otherwise the user space decrement would need to be an
> atomic dec_if_not_zero(). We just argued the one atomic operation away. :)
> 
> That means, the kernel can only set and clear Granted. That in turn
> loses the information whether a slice extension was denied or revoked,
> which was something the Oracle people wanted to have. I'm not sure
> whether that was a functional or more a instrumentation feature.

The denied indication was mainly instrumentation for observability to see
if a user application would attempt to set ‘REQUEST' again without yielding. 

> 
> But what's worse: this is a receipe for disaster as it creates obviously
> subtle and hard to debug ways to leak an increment, which means the
> request would stay active forever defeating the whole purpose.
> 
> And no, the kernel cannot keep track of the counter and observe whether
> it became zero at some point or not. You surely could come up with a
> convoluted scheme to work around that in form of sequence counters or
> whatever, but that just creates extra complexity for a very dubious
> value.
> 
> The point is that the time slice extension is just providing an
> opportunistic priority ceiling mechanism with low overhead and without
> guarantees.
> 
> Once a request is not granted or revoked, the performance of that
> particular operation goes south no matter what. Nesting does not help
> there at all, which is a strong argument for using KISS as the primary
> engineering principle here.
> 
> The simple boolean request/granted pair is simple and very well
> defined. It does not suffer from any of those problems.

Agree, I think keeping the API simple will be preferable. The request/granted
sequence makes sense. 


> 
> If user space wants nesting, then it can do so on its own without
> creating an ill defined and fragile kernel/user ABI. We created enough
> of them in the past and all of them resulted in long term headaches.

Guess user space should be able to handle nesting, possibly without the need of a counter?

AFAICS can’t the nested request, to extend the slice, be handled by checking 
if both ‘REQUEST’ & ‘GRANTED’ bits are zero?  If so,  attempt to request
slice extension.  Otherwise If either REQUEST or GRANTED bit Is set, then a slice
extension has been already requested or granted. 

> 
>> Handling syscall within granted extension by killing the process
> 
> I'm absolutely not opposed to lift the syscall restriction to make
> things easier, but this is the wrong argument for it:

Killing the process seems drastic, and could deter use of this feature.
Can the consequence of calling the system be handled by calling schedule()
in syscall entry path if extension was granted, as you were implying?

Thanks
-Prakash

> 
>> will likely reserve this feature to the niche use-cases.
> 
> Having this used only by people who actually know what they are doing is
> actually the preferred outcome.
> 
> We've seen it over and over that supposedly "easy" features result in
> mindless overutilization because everyone and his dog thinks they need
> them just because and for the very wrong reasons. The unconditional
> usage of the most power hungry floating point extensions just because
> they are available, is only one example of many.
> 
> Thanks,
> 
>        tglx

Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Mathieu Desnoyers 4 months, 2 weeks ago
On 2025-09-19 13:30, Prakash Sangappa wrote:
> 
> 
>> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>>      has nesting ability.
>>>>>
>>>>>      We should consider making the slice request ABI a 8-bit
>>>>>      or 16-bit nesting counter to allow nesting of its users.
>>>>
>>>> Making request a counter requires to keep request set when the
>>>> extension is granted. So the states would be:
>>>>
>>>>       request    granted
>>>>       0          0               Neutral
>>>>> 0         0               Requested
>>>>> =0        1               Granted
>>>
>>
>> Second thoughts on this.
>>
[...]
> 
>>
>> If user space wants nesting, then it can do so on its own without
>> creating an ill defined and fragile kernel/user ABI. We created enough
>> of them in the past and all of them resulted in long term headaches.
> 
> Guess user space should be able to handle nesting, possibly without the need of a counter?
> 
> AFAICS can’t the nested request, to extend the slice, be handled by checking
> if both ‘REQUEST’ & ‘GRANTED’ bits are zero?  If so,  attempt to request
> slice extension.  Otherwise If either REQUEST or GRANTED bit Is set, then a slice
> extension has been already requested or granted.

I think you are onto something here. If we want independent pieces of
software (e.g. libc and application) to allow nesting of time slice
extension requests, without having to deal with a counter and the
inevitable unbalance bugs (leak and underflow), we could require
userspace to check the value of the request and granted flags. If both
are zero, then it can set the request.

Then when userspace exits its critical section, it needs to remember
whether it has set a request or not, so it does not clear a request
too early if the request was set by an outer context. This requires
handing over additional state (one bit) from "lock" to "unlock" though.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Prakash Sangappa 4 months, 2 weeks ago

> On Sep 22, 2025, at 7:09 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2025-09-19 13:30, Prakash Sangappa wrote:
>>> On Sep 13, 2025, at 6:02 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> On Fri, Sep 12 2025 at 15:26, Mathieu Desnoyers wrote:
>>>> On 2025-09-12 12:31, Thomas Gleixner wrote:
>>>>>> 2) Slice requests are a good fit for locking. Locking typically
>>>>>>     has nesting ability.
>>>>>> 
>>>>>>     We should consider making the slice request ABI a 8-bit
>>>>>>     or 16-bit nesting counter to allow nesting of its users.
>>>>> 
>>>>> Making request a counter requires to keep request set when the
>>>>> extension is granted. So the states would be:
>>>>> 
>>>>>      request    granted
>>>>>      0          0               Neutral
>>>>>> 0         0               Requested
>>>>>> =0        1               Granted
>>>> 
>>> 
>>> Second thoughts on this.
>>> 
> [...]
>>> 
>>> If user space wants nesting, then it can do so on its own without
>>> creating an ill defined and fragile kernel/user ABI. We created enough
>>> of them in the past and all of them resulted in long term headaches.
>> Guess user space should be able to handle nesting, possibly without the need of a counter?
>> AFAICS can’t the nested request, to extend the slice, be handled by checking
>> if both ‘REQUEST’ & ‘GRANTED’ bits are zero?  If so,  attempt to request
>> slice extension.  Otherwise If either REQUEST or GRANTED bit Is set, then a slice
>> extension has been already requested or granted.
> 
> I think you are onto something here. If we want independent pieces of
> software (e.g. libc and application) to allow nesting of time slice
> extension requests, without having to deal with a counter and the
> inevitable unbalance bugs (leak and underflow), we could require
> userspace to check the value of the request and granted flags. If both
> are zero, then it can set the request.
> 
> Then when userspace exits its critical section, it needs to remember
> whether it has set a request or not, so it does not clear a request
> too early if the request was set by an outer context. This requires
> handing over additional state (one bit) from "lock" to "unlock" though.

Yes that is correct. Additional state will be required to track if slice extension
was requested in that context. 

-Prakash

> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by K Prateek Nayak 5 months ago
Hello Thomas,

On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Apart from a couple of nit picks, I couldn't spot anything out of place
and the overall approach looks solid. Please feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 5 months ago
On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>> For your convenience all of it is also available as a conglomerate from
>> git:
>> 
>>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
> Apart from a couple of nit picks, I couldn't spot anything out of place
> and the overall approach looks solid. Please feel free to include:
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks a lot for going through it and testing.

Do you have a real workload or a mockup at hand, which benefits
from that slice extension functionality?

It would be really nice to have more than a pretty lame selftest.

thanks,

        tglx
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by K Prateek Nayak 4 months, 4 weeks ago
Hello Thomas,

On 9/10/2025 8:20 PM, Thomas Gleixner wrote:
> On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>> For your convenience all of it is also available as a conglomerate from
>>> git:
>>>
>>>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>
>> Apart from a couple of nit picks, I couldn't spot anything out of place
>> and the overall approach looks solid. Please feel free to include:
>>
>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Thanks a lot for going through it and testing.
> 
> Do you have a real workload or a mockup at hand, which benefits
> from that slice extension functionality?

Not at the moment but we did have some interest for this feature
internally. Give me a week and I'll let you know if they had found a
use-case / have a prototype to test this.

In the meantime, Prakash should have a test bench that he used to
test his early RFC
https://lore.kernel.org/lkml/20241113000126.967713-1-prakash.sangappa@oracle.com/

-- 
Thanks and Regards,
Prateek
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Prakash Sangappa 4 months, 4 weeks ago

> On Sep 11, 2025, at 5:03 AM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Thomas,
> 
> On 9/10/2025 8:20 PM, Thomas Gleixner wrote:
>> On Wed, Sep 10 2025 at 16:58, K. Prateek Nayak wrote:
>>> On 9/9/2025 4:29 AM, Thomas Gleixner wrote:
>>>> For your convenience all of it is also available as a conglomerate from
>>>> git:
>>>> 
>>>>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>> 
>>> Apart from a couple of nit picks, I couldn't spot anything out of place
>>> and the overall approach looks solid. Please feel free to include:
>>> 
>>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> 
>> Thanks a lot for going through it and testing.
>> 
>> Do you have a real workload or a mockup at hand, which benefits
>> from that slice extension functionality?
> 
> Not at the moment but we did have some interest for this feature
> internally. Give me a week and I'll let you know if they had found a
> use-case / have a prototype to test this.
> 
> In the meantime, Prakash should have a test bench that he used to
> test his early RFC
> https://lore.kernel.org/lkml/20241113000126.967713-1-prakash.sangappa@oracle.com/
> 

(Have been AFK, and will be for few more days)

The above was with a database workload. Will coordinate with our database team to get it tested 
with the updated API from this patch series.

Thanks,
-Prakash

> -- 
> Thanks and Regards,
> Prateek
> 
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by Thomas Gleixner 5 months ago
On Tue, Sep 09 2025 at 00:59, Thomas Gleixner wrote:
> For your convenience all of it is also available as a conglomerate from
> git:
>
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Force pushed a new version into the branch, which addresses the initial
feedback and fallout.

Thanks,

        tglx
Re: [patch 00/12] rseq: Implement time slice extension mechanism
Posted by K Prateek Nayak 5 months ago
Hello Thomas,

On 9/9/2025 6:07 PM, Thomas Gleixner wrote:
> On Tue, Sep 09 2025 at 00:59, Thomas Gleixner wrote:
>> For your convenience all of it is also available as a conglomerate from
>> git:
>>
>>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
> 
> Force pushed a new version into the branch, which addresses the initial
> feedback and fallout.

Everything build fine now and the rseq selftests are happy too. Feel
free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek