[PATCH V4 0/6] Scheduler time slice extension

Prakash Sangappa posted 6 patches 7 months, 1 week ago
There is a newer version of this series
include/linux/entry-common.h | 11 ++--
include/linux/sched.h        | 23 +++++++++
include/trace/events/sched.h | 28 +++++++++++
include/uapi/linux/rseq.h    | 19 +++++++
kernel/entry/common.c        | 27 ++++++++--
kernel/rseq.c                | 97 ++++++++++++++++++++++++++++++++++++
kernel/sched/core.c          | 50 +++++++++++++++++++
kernel/sched/debug.c         |  1 +
kernel/sched/syscalls.c      |  5 ++
9 files changed, 253 insertions(+), 8 deletions(-)
[PATCH V4 0/6] Scheduler time slice extension
Posted by Prakash Sangappa 7 months, 1 week ago
A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, which can have undesirable affect
on performance. Having a way for the thread to request additional execution
time on cpu, so that it can complete the critical section will be useful in
such scenario. The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature.

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[5], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patchset implements above solution as suggested, with use of restartable
sequences(rseq) structure for API. Refer [3][4] for further discussions.

v1: 
https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/

v2:
https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
- Based on discussions in [3], expecting user application to call sched_yield()
  to yield the cpu at the end of the critical section may not be advisable as
  pointed out by Linus.  

  So added a check in return path from a system call to reschedule if time
  slice extension was granted to the thread. The check could as well be in
  syscall enter path from user mode.
  This would allow application thread to call any system call to yield the cpu. 
  Which system call should be suggested? getppid(2) works.

  Do we still need the change in sched_yield() to reschedule when the thread
  has current->rseq_sched_delay set?

- Added patch to introduce a sysctl tunable parameter to specify duration of
  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
  Can take a value in the range 0 to 100. Default is set to 50us.
  Setting this tunable to 0 disables the scheduler time slice extension feature.

v3:
https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
- Addressing review comments by Sebastian and Prateek.
  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
    struct task_struct near other bits so it fits in existing word.
  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
    'sched_time_delay'.
  * removed rseq_delay_resched_tick() call from hrtick_clear().
  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
    suggested by Sebastian.
  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.

v4:
- Changed default sched delay extension time to 30us
- Added patch to indicate to userspace if the thread got preempted in
  the extended cpu time granted. Uses another bit in rseq cs flags for it.
  This should help the application to check and avoid having to call a
  system call to yield cpu, especially sched_yield() as pointed out
  by Steven Rostedt.
- Moved tracepoint call towards end of exit_to_user_mode_loop().
- Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
  set higher then the default value of 30us.
- Patch to add an API to query if sched time extension feature is supported. 
  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
  is added, as suggested by Mathieu Desnoyers. 
  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.

Prakash Sangappa (6):
  Sched: Scheduler time slice extension
  Sched: Indicate if thread got rescheduled
  Sched: Tunable to specify duration of time slice extension
  Sched: Add scheduler stat for cpu time slice extension
  Sched: Add tracepoint for sched time slice extension
  Add API to query supported rseq cs flags

 include/linux/entry-common.h | 11 ++--
 include/linux/sched.h        | 23 +++++++++
 include/trace/events/sched.h | 28 +++++++++++
 include/uapi/linux/rseq.h    | 19 +++++++
 kernel/entry/common.c        | 27 ++++++++--
 kernel/rseq.c                | 97 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          | 50 +++++++++++++++++++
 kernel/sched/debug.c         |  1 +
 kernel/sched/syscalls.c      |  5 ++
 9 files changed, 253 insertions(+), 8 deletions(-)

-- 
2.43.5
Re: [PATCH V4 0/6] Scheduler time slice extension
Posted by Prakash Sangappa 7 months ago

> On May 13, 2025, at 2:45 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> A user thread can get preempted in the middle of executing a critical
> section in user space while holding locks, which can have undesirable affect
> on performance. Having a way for the thread to request additional execution
> time on cpu, so that it can complete the critical section will be useful in
> such scenario. The request can be made by setting a bit in mapped memory,
> such that the kernel can also access to check and grant extra execution time
> on the cpu. 
> 
> There have been couple of proposals[1][2] for such a feature, which attempt
> to address the above scenario by granting one extra tick of execution time.
> In patch thread [1] posted by Steven Rostedt, there is ample discussion about
> need for this feature.
> 
> However, the concern has been that this can lead to abuse. One extra tick can
> be a long time(about a millisec or more). Peter Zijlstra in response posted a 
> prototype solution[5], which grants 50us execution time extension only.
> This is achieved with the help of a timer started on that cpu at the time of
> granting extra execution time. When the timer fires the thread will be
> preempted, if still running. 
> 
> This patchset implements above solution as suggested, with use of restartable
> sequences(rseq) structure for API. Refer [3][4] for further discussions.
> 
> v1: 
> https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/
> 
> v2:
> https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
> - Based on discussions in [3], expecting user application to call sched_yield()
>  to yield the cpu at the end of the critical section may not be advisable as
>  pointed out by Linus.  
> 
>  So added a check in return path from a system call to reschedule if time
>  slice extension was granted to the thread. The check could as well be in
>  syscall enter path from user mode.
>  This would allow application thread to call any system call to yield the cpu. 
>  Which system call should be suggested? getppid(2) works.
> 
>  Do we still need the change in sched_yield() to reschedule when the thread
>  has current->rseq_sched_delay set?
> 
> - Added patch to introduce a sysctl tunable parameter to specify duration of
>  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
>  Can take a value in the range 0 to 100. Default is set to 50us.
>  Setting this tunable to 0 disables the scheduler time slice extension feature.
> 
> v3:
> https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
> - Addressing review comments by Sebastian and Prateek.
>  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
>    struct task_struct near other bits so it fits in existing word.
>  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
>    'sched_time_delay'.
>  * removed rseq_delay_resched_tick() call from hrtick_clear().
>  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
>    suggested by Sebastian.
>  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.
> 
> v4:
> - Changed default sched delay extension time to 30us
> - Added patch to indicate to userspace if the thread got preempted in
>  the extended cpu time granted. Uses another bit in rseq cs flags for it.
>  This should help the application to check and avoid having to call a
>  system call to yield cpu, especially sched_yield() as pointed out
>  by Steven Rostedt.
> - Moved tracepoint call towards end of exit_to_user_mode_loop().
> - Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
>  set higher then the default value of 30us.
> - Patch to add an API to query if sched time extension feature is supported. 
>  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
>  is added, as suggested by Mathieu Desnoyers. 
>  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.
> 

I had missed the references in the cover letter. Including here

[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/all/20250131225837.972218232@goodmis.org/
[4] https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
[5] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[6] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/


> Prakash Sangappa (6):
>  Sched: Scheduler time slice extension
>  Sched: Indicate if thread got rescheduled
>  Sched: Tunable to specify duration of time slice extension
>  Sched: Add scheduler stat for cpu time slice extension
>  Sched: Add tracepoint for sched time slice extension
>  Add API to query supported rseq cs flags
> 
> include/linux/entry-common.h | 11 ++--
> include/linux/sched.h        | 23 +++++++++
> include/trace/events/sched.h | 28 +++++++++++
> include/uapi/linux/rseq.h    | 19 +++++++
> kernel/entry/common.c        | 27 ++++++++--
> kernel/rseq.c                | 97 ++++++++++++++++++++++++++++++++++++
> kernel/sched/core.c          | 50 +++++++++++++++++++
> kernel/sched/debug.c         |  1 +
> kernel/sched/syscalls.c      |  5 ++
> 9 files changed, 253 insertions(+), 8 deletions(-)
> 
> -- 
> 2.43.5
> 
>