[RFC PATCH 0/4] Scheduler time slice extension

Prakash Sangappa posted 4 patches 1 week, 3 days ago
arch/x86/entry/syscalls/syscall_32.tbl |   1 +
arch/x86/entry/syscalls/syscall_64.tbl |   1 +
include/linux/entry-common.h           |  10 +-
include/linux/mm_types.h               |   4 +
include/linux/sched.h                  |  30 ++
include/linux/syscalls.h               |   2 +
include/linux/task_shared.h            |  63 +++++
include/uapi/asm-generic/unistd.h      |   4 +-
include/uapi/linux/task_shared.h       |  29 ++
init/Kconfig                           |  10 +
kernel/entry/common.c                  |  15 +-
kernel/fork.c                          |  12 +
kernel/sched/core.c                    |  28 ++
kernel/sched/debug.c                   |   4 +
kernel/sched/syscalls.c                |   7 +
kernel/sys_ni.c                        |   2 +
mm/Makefile                            |   1 +
mm/mmap.c                              |  13 +
mm/task_shared.c                       | 366 +++++++++++++++++++++++++
19 files changed, 593 insertions(+), 9 deletions(-)
create mode 100644 include/linux/task_shared.h
create mode 100644 include/uapi/linux/task_shared.h
create mode 100644 mm/task_shared.c
[RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week, 3 days ago
A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, can have undesirable affect on
performance. Having a way for the thread to request additional execution
time on cpu so that it can complete the critical section will be useful in
such scenario.  The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature. 

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[3], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patch set implements the above mentioned 50us extension time as posted
by Peter. But instead of using restartable sequences as API to set the flag
to request the extension, this patch proposes a new API with use of a per
thread shared structure implementation described below. This shared structure
is accessible in both users pace and kernel. The user thread will set the
flag in this shared structure to request execution time extension.

We tried this change with a database workload. Here are the results, comparing
runs with and without use of the 50us scheduler time slice extension. This
clearly shows there is benefit.

Test results:
=============
	Test system 2 socket AMD Genoa

	Lock table test:- a simple database test to grab table lock(spin lock).
	   Simulates sql query executions.
	   300 clients + 400 cpu hog tasks to generate load.

		Without extension : 182K SQL exec/sec
		With extension    : 262K SQL exec/sec
	   44% improvement.

	Swingbench - standard database benchmark 
	   Cached(database files on tmpfs) run, with 1000 clients.

		Without extension : 99K SQL exec/sec
		with extension    : 153K SQL exec/sec
	   55% improvement in throughput.


Shared structure mechanism:
==========================

A per thread structure is allocated from a page shared mapped between user
space and kernel. This will be useful in sharing thread specific information
between user space and kernel without the need for making system calls in
latency sensitive code path.

Implementation:

A new system call is added to request use of a shared structure by a user
thread. Kernel will allocate page(s), shared mapped with user space in
which per-thread shared structures will be allocated. These structures
are padded to 128 bytes. Multiple such shared structures will be allocated
from that page(upto 32 per 4k page) to accommodate requests from multiple
threads in a process, thus avoiding the need to allocate one page per thread. 
Additional pages are allocated as needed to accommodate more thread's
requesting the shared structure. The number of pages required will depend
on the number of threads of the process requesting/using shared structure.

These pages are pinned and so the kernel can access/update the shared
structure thru the kernel address, without the need for
copy_from_user/copy_to_user() calls. 

The system call will return a pointer(user address) to the per thread shared
structure. Application threads could save this per thread pointer in a TLS
variable and reference it.

Request for scheduler time extension described above, will be a use case 
of this shared structure API. The user thread will request execution time 
extension by setting a flag in the shared structure.

Additional members can be appended to the shared structure to implement 
new features that address other use cases. For example sharing thread's
time spent off cpu and on run queue(described in [4]). These would help
the user thread measure cpu time consumption accross some operation,
without having to call getrusage() system call or read /proc/pid/schedstat
frequently.

Another use case is to share user thread's state 'on' or 'off' cpu thru the
shared structure - which can be useful in implementing adaptive waits in
user space(as discussed in [1]). The waiter thread checks if the owner 
of the resource(lock) is 'on' cpu to continue spinning.

API:
===
	The system call 
	int task_getshared(int option, int flags, void __user *uaddr)

	Only supports TASK_SHAREDINFO option for now, 'flags' are not used.

	/* option */
	#define TASK_SHAREDINFO 1

	struct task_sharedinfo {
		volatile unsigned short sched_delay;    
	};

	#define TASK_PREEMPT_DELAY_REQ     1
	#define TASK_PREEMPT_DELAY_GRANTED 2
	#define TASK_PREEMPT_DELAY_DENIED  3


	Following call:
	__thread struct task_sharedinfo *ts;
	task_getshared(TASK_SHAREDINFO, 0, &ts);

	User task sets 'sched_delay' member to TASK_PREEMPT_DELAY_REQ to
	request scheduler time extension. Kernel sets 'sched_delay' to 
	TASK_PREEMPT_DELAY_GRANTED to indicate if the request for execution 
	time extension was granted or TASK_PREEMPT_DELAY_DENIED if denied.

[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[4] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/

Prakash Sangappa (4):
  Introduce per thread user-kernel shared structure
  Scheduler time extention
  Indicate if schedular preemption delay request is granted
  Add scheduler preemption delay granted stats

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/entry-common.h           |  10 +-
 include/linux/mm_types.h               |   4 +
 include/linux/sched.h                  |  30 ++
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  63 +++++
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/task_shared.h       |  29 ++
 init/Kconfig                           |  10 +
 kernel/entry/common.c                  |  15 +-
 kernel/fork.c                          |  12 +
 kernel/sched/core.c                    |  28 ++
 kernel/sched/debug.c                   |   4 +
 kernel/sched/syscalls.c                |   7 +
 kernel/sys_ni.c                        |   2 +
 mm/Makefile                            |   1 +
 mm/mmap.c                              |  13 +
 mm/task_shared.c                       | 366 +++++++++++++++++++++++++
 19 files changed, 593 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 include/uapi/linux/task_shared.h
 create mode 100644 mm/task_shared.c

-- 
2.43.5
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Peter Zijlstra 1 week, 2 days ago
On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:

> This patch set implements the above mentioned 50us extension time as posted
> by Peter. But instead of using restartable sequences as API to set the flag
> to request the extension, this patch proposes a new API with use of a per
> thread shared structure implementation described below. This shared structure
> is accessible in both users pace and kernel. The user thread will set the
> flag in this shared structure to request execution time extension.

But why -- we already have rseq, glibc uses it by default. Why add yet
another thing?
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week, 2 days ago

> On Nov 13, 2024, at 10:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> 
>> This patch set implements the above mentioned 50us extension time as posted
>> by Peter. But instead of using restartable sequences as API to set the flag
>> to request the extension, this patch proposes a new API with use of a per
>> thread shared structure implementation described below. This shared structure
>> is accessible in both users pace and kernel. The user thread will set the
>> flag in this shared structure to request execution time extension.
> 
> But why -- we already have rseq, glibc uses it by default. Why add yet
> another thing?

Mainly this provides pinned memory, where kernel can access without the concern of taking a page fault.There could be other use cases in future, where updating user memory in kernel context cannot afford to take pagefaults.  For example implementing  use case like providing the thread state, needs to be done in the context switch path when the thread is going off cpu. 

-Prakash
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Mathieu Desnoyers 1 week, 2 days ago
On 2024-11-13 13:50, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> 
>> This patch set implements the above mentioned 50us extension time as posted
>> by Peter. But instead of using restartable sequences as API to set the flag
>> to request the extension, this patch proposes a new API with use of a per
>> thread shared structure implementation described below. This shared structure
>> is accessible in both users pace and kernel. The user thread will set the
>> flag in this shared structure to request execution time extension.
> 
> But why -- we already have rseq, glibc uses it by default. Why add yet
> another thing?

Indeed, what I'm not seeing in this RFC patch series cover letter is an
explanation that justifies adding yet another per-thread memory area
shared between kernel and userspace when we have extensible rseq
already.

Peter, was there anything fundamentally wrong with your approach based
on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net

The main thing I wonder is whether loading the rseq delay resched flag
on return to userspace is too late in your patch. Also, I'm not sure it is
realistic to require that no system calls should be done within time extension
slice. If we have this scenario:

A) userspace grabs lock
    - set rseq delay resched flag
B) syscall
    - reschedule
     [...]
    - return to userspace, load rseq delay-resched flag from userspace (too late)

I would have thought loading the delay resched flag should be attempted much
earlier in the scheduler code. Perhaps we could do this from a page fault
disable critical section, and accept that this hint may be a no-op if the
rseq page happens to be swapped out (which is really unlikely). This is
similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
which needed to be accessed from the scheduler:

   https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
   https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/

And we'd leave the delay-resched load in place on return to userspace, so
in the unlikely scenario where it is swapped out, at least it gets paged
back at that point.

Feel free to let me know if I'm missing an important point and/or saying
nonsense here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Peter Zijlstra 1 week, 1 day ago
On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
> On 2024-11-13 13:50, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> > 
> > > This patch set implements the above mentioned 50us extension time as posted
> > > by Peter. But instead of using restartable sequences as API to set the flag
> > > to request the extension, this patch proposes a new API with use of a per
> > > thread shared structure implementation described below. This shared structure
> > > is accessible in both users pace and kernel. The user thread will set the
> > > flag in this shared structure to request execution time extension.
> > 
> > But why -- we already have rseq, glibc uses it by default. Why add yet
> > another thing?
> 
> Indeed, what I'm not seeing in this RFC patch series cover letter is an
> explanation that justifies adding yet another per-thread memory area
> shared between kernel and userspace when we have extensible rseq
> already.
> 
> Peter, was there anything fundamentally wrong with your approach based
> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net

Not that I can remember, but it's a long time ago :-)

> The main thing I wonder is whether loading the rseq delay resched flag
> on return to userspace is too late in your patch.

Too late how? It only loads it at the point we would've called
schedule() -- no point in looking at it otherwise, right?

> Also, I'm not sure it is
> realistic to require that no system calls should be done within time extension
> slice. If we have this scenario:

Well, the whole premise is that syscalls are too expensive. If they are
not, then you shouldn't be using this in the first place.
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Mathieu Desnoyers 1 week ago
On 2024-11-14 05:14, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>
>>>> This patch set implements the above mentioned 50us extension time as posted
>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>> to request the extension, this patch proposes a new API with use of a per
>>>> thread shared structure implementation described below. This shared structure
>>>> is accessible in both users pace and kernel. The user thread will set the
>>>> flag in this shared structure to request execution time extension.
>>>
>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>> another thing?
>>
>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>> explanation that justifies adding yet another per-thread memory area
>> shared between kernel and userspace when we have extensible rseq
>> already.
>>
>> Peter, was there anything fundamentally wrong with your approach based
>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> 
> Not that I can remember, but it's a long time ago :-)
> 
>> The main thing I wonder is whether loading the rseq delay resched flag
>> on return to userspace is too late in your patch.
> 
> Too late how? It only loads it at the point we would've called
> schedule() -- no point in looking at it otherwise, right?

[...]

For the specific return-to-userspace path, I think where you've placed
the delay-resched flag check is fine.

I'm concerned about other code paths that invoke schedule() besides
return-to-userspace. For instance:

raw_irqentry_exit_cond_resched():

         if (!preempt_count()) {
[...]
                 if (need_resched())
                         preempt_schedule_irq();
         }

AFAIU, this could be triggered by an interrupt handler exit when nested
over a page fault handler, exception handler, or system call.

We may decide that we cannot care less about those scenarios, and just
ignore the delay-resched flag, but it's relevant to take those into
consideration and clearly document the rationale behind our decision.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week ago

> On Nov 15, 2024, at 6:41 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-14 05:14, Peter Zijlstra wrote:
>> On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>> 
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> 
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>>> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>> Not that I can remember, but it's a long time ago :-)
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch.
>> Too late how? It only loads it at the point we would've called
>> schedule() -- no point in looking at it otherwise, right?
> 
> [...]
> 
> For the specific return-to-userspace path, I think where you've placed
> the delay-resched flag check is fine.
> 
> I'm concerned about other code paths that invoke schedule() besides
> return-to-userspace. For instance:
> 
> raw_irqentry_exit_cond_resched():
> 
>        if (!preempt_count()) {
> [...]
>                if (need_resched())
>                        preempt_schedule_irq();
>        }
> 
> AFAIU, this could be triggered by an interrupt handler exit when nested
> over a page fault handler, exception handler, or system call.
> 
> We may decide that we cannot care less about those scenarios, and just
> ignore the delay-resched flag, but it's relevant to take those into
> consideration and clearly document the rationale behind our decision.

Don’t think the delay-resched will address all scenarios where preemption can 
occur when in critical section.  We could aim to address frequent paths where
a task can get preempted.  Initially the intent was to prevent preemption mainly
at the end of time slice, if the thread is in a critical section in the user space and 
has requested delaying reschedule.. 

Another path to consider is the wakeups occurring on a different cpu which could 
enqueue  a thread and attempt to preempt this thread when it is running in the
critical section. Should it check if the thread running has been granted extra time 
i.e the ‘taskshrd_sched_delay’ has been set for the running thread, avoid setting 
TIF_NEED_RESCHED in resched_curr() and sending IPI, ie like lazy preemption? 
If ’taskshrd_sched_delay’ has been set we know it will get preempted  due to the 
timer soon, or the task would sched_yield(if it is a well behaving application).

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week, 2 days ago

> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-13 13:50, Peter Zijlstra wrote:
>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>> This patch set implements the above mentioned 50us extension time as posted
>>> by Peter. But instead of using restartable sequences as API to set the flag
>>> to request the extension, this patch proposes a new API with use of a per
>>> thread shared structure implementation described below. This shared structure
>>> is accessible in both users pace and kernel. The user thread will set the
>>> flag in this shared structure to request execution time extension.
>> But why -- we already have rseq, glibc uses it by default. Why add yet
>> another thing?
> 
> Indeed, what I'm not seeing in this RFC patch series cover letter is an
> explanation that justifies adding yet another per-thread memory area
> shared between kernel and userspace when we have extensible rseq
> already.

It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.

> 
> Peter, was there anything fundamentally wrong with your approach based
> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> 
> The main thing I wonder is whether loading the rseq delay resched flag
> on return to userspace is too late in your patch. Also, I'm not sure it is
> realistic to require that no system calls should be done within time extension
> slice. If we have this scenario:

I am also not sure if we need to prevent system calls in this scenario. 
Was that restriction mainly because of restartable sequence API implements it?

-Prakash

> 
> A) userspace grabs lock
>   - set rseq delay resched flag
> B) syscall
>   - reschedule
>    [...]
>   - return to userspace, load rseq delay-resched flag from userspace (too late)
> 
> I would have thought loading the delay resched flag should be attempted much
> earlier in the scheduler code. Perhaps we could do this from a page fault
> disable critical section, and accept that this hint may be a no-op if the
> rseq page happens to be swapped out (which is really unlikely). This is
> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
> which needed to be accessed from the scheduler:
> 
>  https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>  https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
> 
> And we'd leave the delay-resched load in place on return to userspace, so
> in the unlikely scenario where it is swapped out, at least it gets paged
> back at that point.
> 
> Feel free to let me know if I'm missing an important point and/or saying
> nonsense here.
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Peter Zijlstra 1 week, 1 day ago
On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
> 
> 
> > On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > On 2024-11-13 13:50, Peter Zijlstra wrote:
> >> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> >>> This patch set implements the above mentioned 50us extension time as posted
> >>> by Peter. But instead of using restartable sequences as API to set the flag
> >>> to request the extension, this patch proposes a new API with use of a per
> >>> thread shared structure implementation described below. This shared structure
> >>> is accessible in both users pace and kernel. The user thread will set the
> >>> flag in this shared structure to request execution time extension.
> >> But why -- we already have rseq, glibc uses it by default. Why add yet
> >> another thing?
> > 
> > Indeed, what I'm not seeing in this RFC patch series cover letter is an
> > explanation that justifies adding yet another per-thread memory area
> > shared between kernel and userspace when we have extensible rseq
> > already.
> 
> It mainly provides pinned memory, can be useful for  future use cases
> where updating user memory in kernel context can be fast or needs to
> avoid pagefaults.

'might be useful' it not good enough a justification. Also, I don't
think you actually need this.

See:

  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com

for a more elaborate scheme.

> > Peter, was there anything fundamentally wrong with your approach based
> > on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> > 
> > The main thing I wonder is whether loading the rseq delay resched flag
> > on return to userspace is too late in your patch. Also, I'm not sure it is
> > realistic to require that no system calls should be done within time extension
> > slice. If we have this scenario:
> 
> I am also not sure if we need to prevent system calls in this scenario. 
> Was that restriction mainly because of restartable sequence API implements it?

No, the whole premise of delaying resched was because people think that
syscalls are too slow. If you do not think this, then you shouldn't be
using this.
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week, 1 day ago

> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>> 
>> 
>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>> 
>> It mainly provides pinned memory, can be useful for  future use cases
>> where updating user memory in kernel context can be fast or needs to
>> avoid pagefaults.
> 
> 'might be useful' it not good enough a justification. Also, I don't
> think you actually need this.

Will get back with database benchmark results using rseq API for scheduler time extension.

> 
> See:
> 
>  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
> 
> for a more elaborate scheme.
> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>> 
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>> realistic to require that no system calls should be done within time extension
>>> slice. If we have this scenario:
>> 
>> I am also not sure if we need to prevent system calls in this scenario. 
>> Was that restriction mainly because of restartable sequence API implements it?
> 
> No, the whole premise of delaying resched was because people think that
> syscalls are too slow. If you do not think this, then you shouldn't be
> using this.

Agree.

Thanks,
-Prakash

Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Mathieu Desnoyers 1 week ago
On 2024-11-14 14:42, Prakash Sangappa wrote:
> 
> 
>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:

[...]

>>
>> See:
>>
>>   https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>
>> for a more elaborate scheme.
>>
>>>> Peter, was there anything fundamentally wrong with your approach based
>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>
>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>> realistic to require that no system calls should be done within time extension
>>>> slice. If we have this scenario:
>>>
>>> I am also not sure if we need to prevent system calls in this scenario.
>>> Was that restriction mainly because of restartable sequence API implements it?
>>
>> No, the whole premise of delaying resched was because people think that
>> syscalls are too slow. If you do not think this, then you shouldn't be
>> using this.
> 
> Agree.

I only partially agree with Peter here. I agree that we don't want to
add system calls on the delay-resched critical section fast path,
because this would have a significant performance hit.

But there are scenarios where issuing system calls from within that
critical section would be needed, even though those would not belong
to the fast path:

1) If a signal handler nests over a delay-resched critical section.
    That signal handler is allowed to issue system calls.

2) If the critical section fast-path is calling GNU C library API and/or
    a vDSO, which is typically fast, but can end up calling a system call
    as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a
    system call by killing the application punches a hole in the
    abstractions meant to be provided by GNU libc and vDSO.

I would recommend that we allow issuing system calls while the
delay-resched bit is set. However, we may not strictly need to honor
the delay-resched hint from a system call context, as those would
be expected to be either infrequent or a portability fallback,
which means the enhanced performance provided by delay-resched
really won't matter.

Another scenario to keep in mind are page faults happening within a
delay-resched critical section. This is a scenario where page fault
handling can explicitly reschedule. If this happens, I suspect we
really don't care about the delay-resched hint, but we should consider
whether this hint should be left as-is or cleared.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week ago

> On Nov 15, 2024, at 6:13 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-14 14:42, Prakash Sangappa wrote:
>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> [...]
> 
>>> 
>>> See:
>>> 
>>>  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>> 
>>> for a more elaborate scheme.
>>> 
>>>>> Peter, was there anything fundamentally wrong with your approach based
>>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>> 
>>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>>> realistic to require that no system calls should be done within time extension
>>>>> slice. If we have this scenario:
>>>> 
>>>> I am also not sure if we need to prevent system calls in this scenario.
>>>> Was that restriction mainly because of restartable sequence API implements it?
>>> 
>>> No, the whole premise of delaying resched was because people think that
>>> syscalls are too slow. If you do not think this, then you shouldn't be
>>> using this.
>> Agree.
> 
> I only partially agree with Peter here. I agree that we don't want to
> add system calls on the delay-resched critical section fast path,
> because this would have a significant performance hit.
> 
> But there are scenarios where issuing system calls from within that
> critical section would be needed, even though those would not belong
> to the fast path:
> 
> 1) If a signal handler nests over a delay-resched critical section.
>   That signal handler is allowed to issue system calls.
> 
> 2) If the critical section fast-path is calling GNU C library API and/or
>   a vDSO, which is typically fast, but can end up calling a system call
>   as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a
>   system call by killing the application punches a hole in the
>   abstractions meant to be provided by GNU libc and vDSO.
> 
> I would recommend that we allow issuing system calls while the
> delay-resched bit is set. However, we may not strictly need to honor
> the delay-resched hint from a system call context, as those would
> be expected to be either infrequent or a portability fallback,
> which means the enhanced performance provided by delay-resched
> really won't matter.
> 
> Another scenario to keep in mind are page faults happening within a
> delay-resched critical section. This is a scenario where page fault
> handling can explicitly reschedule. If this happens, I suspect we
> really don't care about the delay-resched hint, but we should consider
> whether this hint should be left as-is or cleared.
> 

There are no explicit checks right now that a system call occurred in the critical section, 
except under DEBUG_RSEQ, which probably will not apply for the delay-resched case.
 
But the question was should there be checks implemented, probably not 
due to scenarios you mentioned above.


> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Mathieu Desnoyers 1 week, 2 days ago
On 2024-11-13 15:10, Prakash Sangappa wrote:
> 
> 
>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>> This patch set implements the above mentioned 50us extension time as posted
>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>> to request the extension, this patch proposes a new API with use of a per
>>>> thread shared structure implementation described below. This shared structure
>>>> is accessible in both users pace and kernel. The user thread will set the
>>>> flag in this shared structure to request execution time extension.
>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>> another thing?
>>
>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>> explanation that justifies adding yet another per-thread memory area
>> shared between kernel and userspace when we have extensible rseq
>> already.
> 
> It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.

Does the targeted use-case (scheduler time slice extension) require
pinned memory, or just future use-cases ?

Does having a missing time slice extension hint for a short while in
case of high memory pressure (rseq page swapped out) have any measurable
impact compared to the overhead of the page faults which will be
happening in case of the high memory pressure required to trigger this
scenario ?

> 
>>
>> Peter, was there anything fundamentally wrong with your approach based
>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>
>> The main thing I wonder is whether loading the rseq delay resched flag
>> on return to userspace is too late in your patch. Also, I'm not sure it is
>> realistic to require that no system calls should be done within time extension
>> slice. If we have this scenario:
> 
> I am also not sure if we need to prevent system calls in this scenario.

I suspect that if we prohibit system calls from being issued from a
delay-resched userspace critical section, then loading the delay-resched
rseq flag on return to userspace is always fine, because the kernel only
reschedules on return from interrupt or trap.

But I see this no-syscall restriction as being very cumbersome for
userspace.

> Was that restriction mainly because of restartable sequence API implements it?

I suspect that this restriction is just to avoid loading the
delay-resched flag from the scheduler when reschedule is called
from an interrupt handler nested over a system call for preemptible
kernels, but Peter could tell us more.

One open question here is whether we want to pin memory for
each thread in the system to hold this shared data between
userspace and the scheduler. AFAIU, this is a trade-off between
slice extension accuracy in high memory usage scenarios and
pinned memory footprint impact. If the tradeoff goes towards
making this memory pinned, then we may want to consider pinning
the per-thread rseq area on rseq registration.

Another option to consider is to use rseq to index a userspace
per-cpu data structure, which will be used as shared memory
between kernel and userspace. Userspace could store this
delay-resched flag into the current CPU's shared data, and
the scheduler could load it from there. If pinning per-cpu
data is more acceptable than pinning per-thread data, then
it could be an improvement.

This could be a new ABI between kernel and userspace, e.g.:

struct rseq_percpu_area {
     __u32 sched_state;	/* includes time slice extension flag. */
     char end[];
};

Registered to the kernel with the following parameters:

- Address of rseq_percpu_area for CPU 0,
- The stride of the per-cpu indexing (see librseq mempool per-cpu
   allocator [1]),
- offsetof(struct rseq_percpu_area, end) to have the feature size
   for extensibility.

Thanks,

Mathieu

[1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf

> 
> -Prakash
> 
>>
>> A) userspace grabs lock
>>    - set rseq delay resched flag
>> B) syscall
>>    - reschedule
>>     [...]
>>    - return to userspace, load rseq delay-resched flag from userspace (too late)
>>
>> I would have thought loading the delay resched flag should be attempted much
>> earlier in the scheduler code. Perhaps we could do this from a page fault
>> disable critical section, and accept that this hint may be a no-op if the
>> rseq page happens to be swapped out (which is really unlikely). This is
>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
>> which needed to be accessed from the scheduler:
>>
>>   https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>>   https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
>>
>> And we'd leave the delay-resched load in place on return to userspace, so
>> in the unlikely scenario where it is swapped out, at least it gets paged
>> back at that point.
>>
>> Feel free to let me know if I'm missing an important point and/or saying
>> nonsense here.
>>
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week, 2 days ago

> On Nov 13, 2024, at 12:57 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-13 15:10, Prakash Sangappa wrote:
>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>> It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.
> 
> Does the targeted use-case (scheduler time slice extension) require
> pinned memory, or just future use-cases ?

Probably not for sched time slice extension. Although, we have not run the database workload using restartable sequence. We can try that and get back.

What about publishing thread state in the shared structure? It would require updating user memory in context switching code path when thread is going off the cpu. Here having pinned memory would help. Thread state is the other requirement we are interested in.

> 
> Does having a missing time slice extension hint for a short while in
> case of high memory pressure (rseq page swapped out) have any measurable

Yes, but the restartable sequence based implementation does copy_from_user(), which would go thru the pagefault. It may be ok here. Perpaps it would require disabling page faults in this case? If the page is not present, then the hint would most likely not matter and kernel can skip.

> impact compared to the overhead of the page faults which will be
> happening in case of the high memory pressure required to trigger this
> scenario ?

We would have to test that. 

> 
>>> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>> 
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>> realistic to require that no system calls should be done within time extension
>>> slice. If we have this scenario:
>> I am also not sure if we need to prevent system calls in this scenario.
> 
> I suspect that if we prohibit system calls from being issued from a
> delay-resched userspace critical section, then loading the delay-resched
> rseq flag on return to userspace is always fine, because the kernel only
> reschedules on return from interrupt or trap.
> 
> But I see this no-syscall restriction as being very cumbersome for
> userspace.
> 
>> Was that restriction mainly because of restartable sequence API implements it?
> 
> I suspect that this restriction is just to avoid loading the
> delay-resched flag from the scheduler when reschedule is called
> from an interrupt handler nested over a system call for preemptible
> kernels, but Peter could tell us more.

Hoping Peter can comment on this.

> One open question here is whether we want to pin memory for
> each thread in the system to hold this shared data between
> userspace and the scheduler. AFAIU, this is a trade-off between
> slice extension accuracy in high memory usage scenarios and
> pinned memory footprint impact. If the tradeoff goes towards
> making this memory pinned, then we may want to consider pinning
> the per-thread rseq area on rseq registration.
> 
> Another option to consider is to use rseq to index a userspace
> per-cpu data structure, which will be used as shared memory
> between kernel and userspace. Userspace could store this
> delay-resched flag into the current CPU's shared data, and
> the scheduler could load it from there. If pinning per-cpu
> data is more acceptable than pinning per-thread data, then
> it could be an improvement.

Interesting. Having pre cpu shared memory may not work for all use cases.

Thanks,
-Prakash

> 
> This could be a new ABI between kernel and userspace, e.g.:
> 
> struct rseq_percpu_area {
>    __u32 sched_state; /* includes time slice extension flag. */
>    char end[];
> };
> 
> Registered to the kernel with the following parameters:
> 
> - Address of rseq_percpu_area for CPU 0,
> - The stride of the per-cpu indexing (see librseq mempool per-cpu
>  allocator [1]),
> - offsetof(struct rseq_percpu_area, end) to have the feature size
>  for extensibility.
> 
> Thanks,
> 
> Mathieu
> 
> [1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf
> 
>> -Prakash
>>> 
>>> A) userspace grabs lock
>>>   - set rseq delay resched flag
>>> B) syscall
>>>   - reschedule
>>>    [...]
>>>   - return to userspace, load rseq delay-resched flag from userspace (too late)
>>> 
>>> I would have thought loading the delay resched flag should be attempted much
>>> earlier in the scheduler code. Perhaps we could do this from a page fault
>>> disable critical section, and accept that this hint may be a no-op if the
>>> rseq page happens to be swapped out (which is really unlikely). This is
>>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
>>> which needed to be accessed from the scheduler:
>>> 
>>>  https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>>>  https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
>>> 
>>> And we'd leave the delay-resched load in place on return to userspace, so
>>> in the unlikely scenario where it is swapped out, at least it gets paged
>>> back at that point.
>>> 
>>> Feel free to let me know if I'm missing an important point and/or saying
>>> nonsense here.
>>> 
>>> Thanks,
>>> 
>>> Mathieu
>>> 
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com


Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by K Prateek Nayak 1 week, 3 days ago
Hello Prakash,

Few questions around the benchmarks.

On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
> [..snip..] 
> 
> Test results:
> =============
> 	Test system 2 socket AMD Genoa
> 
> 	Lock table test:- a simple database test to grab table lock(spin lock).
> 	   Simulates sql query executions.
> 	   300 clients + 400 cpu hog tasks to generate load.

Have you tried running the 300 clients with a nice value of -20 and 400
CPU hogs with the default nice value / nice 19? Does that help this
particular case?

> 
> 		Without extension : 182K SQL exec/sec
> 		With extension    : 262K SQL exec/sec
> 	   44% improvement.
> 
> 	Swingbench - standard database benchmark
> 	   Cached(database files on tmpfs) run, with 1000 clients.

In this case, how does the performance fare when running the clients
under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs
"TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run?

I'm trying to understand what the performance looks like when using
existing features that inhibit preemption vs putting forward the
preemption when the userspace is holding a lock. Feel free to quote
the latency comparisons too if using the existing features lead to
unacceptable avg/tail latencies.

> 
> 		Without extension : 99K SQL exec/sec
> 		with extension    : 153K SQL exec/sec
> 	   55% improvement in throughput.
> 
> [..snip..]
> 

-- 
Thanks and Regards,
Prateek
Re: [RFC PATCH 0/4] Scheduler time slice extension
Posted by Prakash Sangappa 1 week, 2 days ago

> On Nov 12, 2024, at 9:43 PM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Prakash,
> 
> Few questions around the benchmarks.
> 
> On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
>> [..snip..] Test results:
>> =============
>> Test system 2 socket AMD Genoa
>> Lock table test:- a simple database test to grab table lock(spin lock).
>>   Simulates sql query executions.
>>   300 clients + 400 cpu hog tasks to generate load.
> 
> Have you tried running the 300 clients with a nice value of -20 and 400
> CPU hogs with the default nice value / nice 19? Does that help this
> particular case?

Have not tried this with the database. Will have to try it.


> 
>> Without extension : 182K SQL exec/sec
>> With extension    : 262K SQL exec/sec
>>   44% improvement.
>> Swingbench - standard database benchmark
>>   Cached(database files on tmpfs) run, with 1000 clients.
> 
> In this case, how does the performance fare when running the clients
> under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs
> "TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run?

Not tried SCHED_BATCH. 

With this run, there were about avg ‘166' TASK_PREEMPT_DELAY_GRANTED grants per task, collected from the scheduler stats captured at the end of the run. Test runs for about 5min.  Don't have the count of how many times preempt delay was requested. If the task completes the critical section, it clears the  TASK_PREEMPT_DELAY_REQ flag, so kernel would not see it many cases as this may not be near the end of the time slice. We would have to capture the count in the application. 


> 
> I'm trying to understand what the performance looks like when using
> existing features that inhibit preemption vs putting forward the
> preemption when the userspace is holding a lock. Feel free to quote
> the latency comparisons too if using the existing features lead to
> unacceptable avg/tail latencies.
> 
>> Without extension : 99K SQL exec/sec
>> with extension    : 153K SQL exec/sec
>>   55% improvement in throughput.
>> [..snip..]
> 
> -- 
> Thanks and Regards,
> Prateek