[PATCH V7 00/11] Scheduler time slice extension

Prakash Sangappa posted 11 patches 2 months, 1 week ago
.../admin-guide/kernel-parameters.txt         |  8 ++
Documentation/admin-guide/sysctl/kernel.rst   |  8 ++
arch/x86/Kconfig                              |  1 +
arch/x86/include/asm/thread_info.h            |  2 +
include/linux/entry-common.h                  | 18 ++--
include/linux/entry-kvm.h                     |  4 +-
include/linux/sched.h                         | 47 +++++++++-
include/linux/thread_info.h                   | 11 ++-
include/trace/events/sched.h                  | 31 +++++++
include/uapi/linux/prctl.h                    |  3 +
include/uapi/linux/rseq.h                     | 19 ++++
init/Kconfig                                  |  7 ++
kernel/Kconfig.preempt                        |  3 +
kernel/entry/common.c                         | 36 ++++++-
kernel/entry/kvm.c                            |  3 +-
kernel/rseq.c                                 | 71 ++++++++++++++
kernel/sched/core.c                           | 93 ++++++++++++++++++-
kernel/sched/debug.c                          |  4 +
kernel/sched/rt.c                             | 10 +-
kernel/sched/sched.h                          |  1 +
kernel/sched/syscalls.c                       |  4 +
kernel/sys.c                                  | 18 ++++
22 files changed, 380 insertions(+), 22 deletions(-)
[PATCH V7 00/11] Scheduler time slice extension
Posted by Prakash Sangappa 2 months, 1 week ago
Based on v6.16-rc3.

Patches 7-11 in this series are an attempt to implement the API/mechanism 
for an RT thread to indicate not to delay scheduling it when the thread
running on the cpu requests extending its time slice, as suggested by
Thomas Gleixner. This is required to address the concern that with the use 
of the proposed scheduler time slice extension feature, a normal thread 
can delay the scheduling of an RT thread.

This will require a new TIF flag(TIF_NEED_RESCHED_NODELAY), which will be 
set on the running thread when this RT thread gets woken up and is enqueued. 
The API is only allowed for use by RT(RR, FIFO) threads. 

Implementation of TIF_NEED_RESCHED_NODELAY patches is on the lines of
how TIF_NEED_RESCHED_LAZY was added. However, TIF_NEED_RESCHED_NODELAY
will be effective only with the scheduler time slice extension feature
(i.e, when CONFIG_RSEQ_RESCHED_DELAY config option is enabled).

Introduces prctl APIs to set and get the sched_nodelay flag. Adds a
new 1-bit member(sched_nodelay) to the struct task_struct to store this
flag, as there is no more room for a new PF* flag. This flag will be 
inherited across fork and exec. 

The API provides per-thread control to decide if it can be delayed
or not. Also, a kernel parameter is added to disable delaying scheduling of
all RT threads, if necessary, when the scheduler time slice extension feature
is enabled.

The above change is more of an RFC, looking for feedback on the
approach. 

Patches 1-6  have been updated based on comments from the V6 patch series.

---------------- cover letter previously sent --------------------------------
A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, which can have undesirable affect
on performance. Having a way for the thread to request additional execution
time on cpu, so that it can complete the critical section will be useful in
such scenario. The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature.

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[5], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patchset implements above solution as suggested, with use of restartable
sequences(rseq) structure for API. Refer [3][4] for further discussions.


v7:
- Addressed comments & suggestions from Thomas Gleixner & Prateek Nayak.
  Renamed 'sched_time_delay' to 'rseq_delay_resched'. Made it a 2-bit 
  member to store 3 states NONE, PROBE & REQUESTED as suggested by
  Thomas Gleixner. Also refactored some code in patch 1.
- Renamed the config option to 'CONFIG_RSEQ_RESCHED_DELAY' and
  added it in patch 1. Added SCHED_HRTICK dependency.
- Patches 7-11 are an attempt to implement the API/mechanism 
  Thomas suggested. They introduce a prctl() api which lets an RT thread
  indicate not to delay scheduling it when some thread running on
  the cpu requests extending its time slice.

v6:
https://lore.kernel.org/all/20250701003749.50525-1-prakash.sangappa@oracle.com/
- Rebased onto v6.16-rc3. 
  syscall_exit_to_user_mode_prepare() & __syscall_exit_to_user_mode_work()
  routines have been deleted. Moved changes to the consolidated routine
  syscall_exit_to_user_mode_work()(patch 1).
- Introduced a new config option for scheduler time slice extension
  CONFIG_SCHED_PREEMPT_DELAY which is dependent on CONFIG_RSEQ.
  Enabled by default(new patch 7). Is this reasonable?
- Modified tracepoint to a conditional tracepoint(patch 5), as suggested
  by Steven Rostedt.
- Added kernel parameters documentation for the tunable
  'sysctl_sched_preempt_delay_us'(patch 3)

v5:
https://lore.kernel.org/all/20250603233654.1838967-1-prakash.sangappa@oracle.com/
- Added #ifdef CONFIG_RSEQ and CONFIG_PROC_SYSCTL for sysctl tunable
  changes(patch 3).
- Added #ifdef CONFIG_RSEQ for schedular stat changes(patch 4).
- Removed deprecated flags from the supported flags returned, as
  pointed out by Mathieu Desnoyers(patch 6).
- Added IF_ENABLED(CONFIG_SCHED_HRTICK) check before returning supported
  delay resched flags.

v4:
https://lore.kernel.org/all/20250513214554.4160454-1-prakash.sangappa@oracle.com
- Changed default sched delay extension time to 30us
- Added patch to indicate to userspace if the thread got preempted in
  the extended cpu time granted. Uses another bit in rseq cs flags for it.
  This should help the application to check and avoid having to call a
  system call to yield cpu, especially sched_yield() as pointed out
  by Steven Rostedt.
- Moved tracepoint call towards end of exit_to_user_mode_loop().
- Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
  set higher then the default value of 30us.
- Patch to add an API to query if sched time extension feature is supported. 
  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
  is added, as suggested by Mathieu Desnoyers. 
  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.

v3:
https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
- Addressing review comments by Sebastian and Prateek.
  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
    struct task_struct near other bits so it fits in existing word.
  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
    'sched_time_delay'.
  * removed rseq_delay_resched_tick() call from hrtick_clear().
  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
    suggested by Sebastian.
  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.

v2:
https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
- Based on discussions in [3], expecting user application to call sched_yield()
  to yield the cpu at the end of the critical section may not be advisable as
  pointed out by Linus.  

  So added a check in return path from a system call to reschedule if time
  slice extension was granted to the thread. The check could as well be in
  syscall enter path from user mode.
  This would allow application thread to call any system call to yield the cpu. 
  Which system call should be suggested? getppid(2) works.

  Do we still need the change in sched_yield() to reschedule when the thread
  has current->rseq_sched_delay set?

- Added patch to introduce a sysctl tunable parameter to specify duration of
  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
  Can take a value in the range 0 to 100. Default is set to 50us.
  Setting this tunable to 0 disables the scheduler time slice extension feature.

v1: 
https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/


[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/all/20250131225837.972218232@goodmis.org/
[4] https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
[5] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[6] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/

Prakash Sangappa (11):
  sched: Scheduler time slice extension
  sched: Indicate if thread got rescheduled
  sched: Tunable to specify duration of time slice extension
  sched: Add scheduler stat for cpu time slice extension
  sched: Add tracepoint for sched time slice extension
  Add API to query supported rseq cs flags
  sched: Add API to indicate not to delay scheduling
  sched: Add TIF_NEED_RESCHED_NODELAY infrastructure
  sched: Add nodelay scheduling
  sched, x86: Enable nodelay scheduling
  sched: Add kernel parameter to enable delaying RT threads

 .../admin-guide/kernel-parameters.txt         |  8 ++
 Documentation/admin-guide/sysctl/kernel.rst   |  8 ++
 arch/x86/Kconfig                              |  1 +
 arch/x86/include/asm/thread_info.h            |  2 +
 include/linux/entry-common.h                  | 18 ++--
 include/linux/entry-kvm.h                     |  4 +-
 include/linux/sched.h                         | 47 +++++++++-
 include/linux/thread_info.h                   | 11 ++-
 include/trace/events/sched.h                  | 31 +++++++
 include/uapi/linux/prctl.h                    |  3 +
 include/uapi/linux/rseq.h                     | 19 ++++
 init/Kconfig                                  |  7 ++
 kernel/Kconfig.preempt                        |  3 +
 kernel/entry/common.c                         | 36 ++++++-
 kernel/entry/kvm.c                            |  3 +-
 kernel/rseq.c                                 | 71 ++++++++++++++
 kernel/sched/core.c                           | 93 ++++++++++++++++++-
 kernel/sched/debug.c                          |  4 +
 kernel/sched/rt.c                             | 10 +-
 kernel/sched/sched.h                          |  1 +
 kernel/sched/syscalls.c                       |  4 +
 kernel/sys.c                                  | 18 ++++
 22 files changed, 380 insertions(+), 22 deletions(-)

-- 
2.43.5
Re: [PATCH V7 00/11] Scheduler time slice extension
Posted by Prakash Sangappa 2 months ago
Any comments?
Thanks
-Prakash

> On Jul 24, 2025, at 9:16 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> Based on v6.16-rc3.
> 
> Patches 7-11 in this series are an attempt to implement the API/mechanism 
> for an RT thread to indicate not to delay scheduling it when the thread
> running on the cpu requests extending its time slice, as suggested by
> Thomas Gleixner. This is required to address the concern that with the use 
> of the proposed scheduler time slice extension feature, a normal thread 
> can delay the scheduling of an RT thread.
> 
> This will require a new TIF flag(TIF_NEED_RESCHED_NODELAY), which will be 
> set on the running thread when this RT thread gets woken up and is enqueued. 
> The API is only allowed for use by RT(RR, FIFO) threads. 
> 
> Implementation of TIF_NEED_RESCHED_NODELAY patches is on the lines of
> how TIF_NEED_RESCHED_LAZY was added. However, TIF_NEED_RESCHED_NODELAY
> will be effective only with the scheduler time slice extension feature
> (i.e, when CONFIG_RSEQ_RESCHED_DELAY config option is enabled).
> 
> Introduces prctl APIs to set and get the sched_nodelay flag. Adds a
> new 1-bit member(sched_nodelay) to the struct task_struct to store this
> flag, as there is no more room for a new PF* flag. This flag will be 
> inherited across fork and exec. 
> 
> The API provides per-thread control to decide if it can be delayed
> or not. Also, a kernel parameter is added to disable delaying scheduling of
> all RT threads, if necessary, when the scheduler time slice extension feature
> is enabled.
> 
> The above change is more of an RFC, looking for feedback on the
> approach. 
> 
> Patches 1-6  have been updated based on comments from the V6 patch series.
> 
> ---------------- cover letter previously sent --------------------------------
> A user thread can get preempted in the middle of executing a critical
> section in user space while holding locks, which can have undesirable affect
> on performance. Having a way for the thread to request additional execution
> time on cpu, so that it can complete the critical section will be useful in
> such scenario. The request can be made by setting a bit in mapped memory,
> such that the kernel can also access to check and grant extra execution time
> on the cpu. 
> 
> There have been couple of proposals[1][2] for such a feature, which attempt
> to address the above scenario by granting one extra tick of execution time.
> In patch thread [1] posted by Steven Rostedt, there is ample discussion about
> need for this feature.
> 
> However, the concern has been that this can lead to abuse. One extra tick can
> be a long time(about a millisec or more). Peter Zijlstra in response posted a 
> prototype solution[5], which grants 50us execution time extension only.
> This is achieved with the help of a timer started on that cpu at the time of
> granting extra execution time. When the timer fires the thread will be
> preempted, if still running. 
> 
> This patchset implements above solution as suggested, with use of restartable
> sequences(rseq) structure for API. Refer [3][4] for further discussions.
> 
> 
> v7:
> - Addressed comments & suggestions from Thomas Gleixner & Prateek Nayak.
>  Renamed 'sched_time_delay' to 'rseq_delay_resched'. Made it a 2-bit 
>  member to store 3 states NONE, PROBE & REQUESTED as suggested by
>  Thomas Gleixner. Also refactored some code in patch 1.
> - Renamed the config option to 'CONFIG_RSEQ_RESCHED_DELAY' and
>  added it in patch 1. Added SCHED_HRTICK dependency.
> - Patches 7-11 are an attempt to implement the API/mechanism 
>  Thomas suggested. They introduce a prctl() api which lets an RT thread
>  indicate not to delay scheduling it when some thread running on
>  the cpu requests extending its time slice.
> 
> v6:
> https://lore.kernel.org/all/20250701003749.50525-1-prakash.sangappa@oracle.com/
> - Rebased onto v6.16-rc3. 
>  syscall_exit_to_user_mode_prepare() & __syscall_exit_to_user_mode_work()
>  routines have been deleted. Moved changes to the consolidated routine
>  syscall_exit_to_user_mode_work()(patch 1).
> - Introduced a new config option for scheduler time slice extension
>  CONFIG_SCHED_PREEMPT_DELAY which is dependent on CONFIG_RSEQ.
>  Enabled by default(new patch 7). Is this reasonable?
> - Modified tracepoint to a conditional tracepoint(patch 5), as suggested
>  by Steven Rostedt.
> - Added kernel parameters documentation for the tunable
>  'sysctl_sched_preempt_delay_us'(patch 3)
> 
> v5:
> https://lore.kernel.org/all/20250603233654.1838967-1-prakash.sangappa@oracle.com/
> - Added #ifdef CONFIG_RSEQ and CONFIG_PROC_SYSCTL for sysctl tunable
>  changes(patch 3).
> - Added #ifdef CONFIG_RSEQ for schedular stat changes(patch 4).
> - Removed deprecated flags from the supported flags returned, as
>  pointed out by Mathieu Desnoyers(patch 6).
> - Added IF_ENABLED(CONFIG_SCHED_HRTICK) check before returning supported
>  delay resched flags.
> 
> v4:
> https://lore.kernel.org/all/20250513214554.4160454-1-prakash.sangappa@oracle.com
> - Changed default sched delay extension time to 30us
> - Added patch to indicate to userspace if the thread got preempted in
>  the extended cpu time granted. Uses another bit in rseq cs flags for it.
>  This should help the application to check and avoid having to call a
>  system call to yield cpu, especially sched_yield() as pointed out
>  by Steven Rostedt.
> - Moved tracepoint call towards end of exit_to_user_mode_loop().
> - Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
>  set higher then the default value of 30us.
> - Patch to add an API to query if sched time extension feature is supported. 
>  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
>  is added, as suggested by Mathieu Desnoyers. 
>  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.
> 
> v3:
> https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
> - Addressing review comments by Sebastian and Prateek.
>  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
>    struct task_struct near other bits so it fits in existing word.
>  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
>    'sched_time_delay'.
>  * removed rseq_delay_resched_tick() call from hrtick_clear().
>  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
>    suggested by Sebastian.
>  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.
> 
> v2:
> https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
> - Based on discussions in [3], expecting user application to call sched_yield()
>  to yield the cpu at the end of the critical section may not be advisable as
>  pointed out by Linus.  
> 
>  So added a check in return path from a system call to reschedule if time
>  slice extension was granted to the thread. The check could as well be in
>  syscall enter path from user mode.
>  This would allow application thread to call any system call to yield the cpu. 
>  Which system call should be suggested? getppid(2) works.
> 
>  Do we still need the change in sched_yield() to reschedule when the thread
>  has current->rseq_sched_delay set?
> 
> - Added patch to introduce a sysctl tunable parameter to specify duration of
>  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
>  Can take a value in the range 0 to 100. Default is set to 50us.
>  Setting this tunable to 0 disables the scheduler time slice extension feature.
> 
> v1: 
> https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/
> 
> 
> [1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
> [2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
> [3] https://lore.kernel.org/all/20250131225837.972218232@goodmis.org/
> [4] https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
> [5] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
> [6] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/
> 
> Prakash Sangappa (11):
>  sched: Scheduler time slice extension
>  sched: Indicate if thread got rescheduled
>  sched: Tunable to specify duration of time slice extension
>  sched: Add scheduler stat for cpu time slice extension
>  sched: Add tracepoint for sched time slice extension
>  Add API to query supported rseq cs flags
>  sched: Add API to indicate not to delay scheduling
>  sched: Add TIF_NEED_RESCHED_NODELAY infrastructure
>  sched: Add nodelay scheduling
>  sched, x86: Enable nodelay scheduling
>  sched: Add kernel parameter to enable delaying RT threads
> 
> .../admin-guide/kernel-parameters.txt         |  8 ++
> Documentation/admin-guide/sysctl/kernel.rst   |  8 ++
> arch/x86/Kconfig                              |  1 +
> arch/x86/include/asm/thread_info.h            |  2 +
> include/linux/entry-common.h                  | 18 ++--
> include/linux/entry-kvm.h                     |  4 +-
> include/linux/sched.h                         | 47 +++++++++-
> include/linux/thread_info.h                   | 11 ++-
> include/trace/events/sched.h                  | 31 +++++++
> include/uapi/linux/prctl.h                    |  3 +
> include/uapi/linux/rseq.h                     | 19 ++++
> init/Kconfig                                  |  7 ++
> kernel/Kconfig.preempt                        |  3 +
> kernel/entry/common.c                         | 36 ++++++-
> kernel/entry/kvm.c                            |  3 +-
> kernel/rseq.c                                 | 71 ++++++++++++++
> kernel/sched/core.c                           | 93 ++++++++++++++++++-
> kernel/sched/debug.c                          |  4 +
> kernel/sched/rt.c                             | 10 +-
> kernel/sched/sched.h                          |  1 +
> kernel/sched/syscalls.c                       |  4 +
> kernel/sys.c                                  | 18 ++++
> 22 files changed, 380 insertions(+), 22 deletions(-)
> 
> -- 
> 2.43.5
> 

Re: [PATCH V7 00/11] Scheduler time slice extension
Posted by Thomas Gleixner 2 months ago
On Wed, Aug 06 2025 at 16:03, Prakash Sangappa wrote:

Please don't top post and trim your replies. We all have this mail in
our inboxes.

> Any comments?

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#merge-window

and people are on vacation ....
Re: [PATCH V7 00/11] Scheduler time slice extension
Posted by Thomas Gleixner 2 months ago
On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
> Based on v6.16-rc3.

This is useless. At the point of posting, a massive amount of changes
had been queued for the 6.17 merge window. So why can't you submit
against the relevant tree (tip) as asked for in Documentation?

I'm going to look at it from a conceptual level nevertheless to spare
you the extra round.
Re: [PATCH V7 00/11] Scheduler time slice extension
Posted by Prakash Sangappa 1 month, 4 weeks ago

> On Aug 6, 2025, at 9:30 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> Based on v6.16-rc3.
> 
> This is useless. At the point of posting, a massive amount of changes
> had been queued for the 6.17 merge window. So why can't you submit
> against the relevant tree (tip) as asked for in Documentation?
> 
> I'm going to look at it from a conceptual level nevertheless to spare
> you the extra round.

Thanks,
-Prakash.