[v1] Scheduler time slice extension

[RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago

A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, can have undesirable affect on
performance. Having a way for the thread to request additional execution
time on cpu so that it can complete the critical section will be useful in
such scenario.  The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature. 

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[3], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patch set implements the above mentioned 50us extension time as posted
by Peter. But instead of using restartable sequences as API to set the flag
to request the extension, this patch proposes a new API with use of a per
thread shared structure implementation described below. This shared structure
is accessible in both users pace and kernel. The user thread will set the
flag in this shared structure to request execution time extension.

We tried this change with a database workload. Here are the results, comparing
runs with and without use of the 50us scheduler time slice extension. This
clearly shows there is benefit.

Test results:
=============
	Test system 2 socket AMD Genoa

	Lock table test:- a simple database test to grab table lock(spin lock).
	   Simulates sql query executions.
	   300 clients + 400 cpu hog tasks to generate load.

		Without extension : 182K SQL exec/sec
		With extension    : 262K SQL exec/sec
	   44% improvement.

	Swingbench - standard database benchmark 
	   Cached(database files on tmpfs) run, with 1000 clients.

		Without extension : 99K SQL exec/sec
		with extension    : 153K SQL exec/sec
	   55% improvement in throughput.


Shared structure mechanism:
==========================

A per thread structure is allocated from a page shared mapped between user
space and kernel. This will be useful in sharing thread specific information
between user space and kernel without the need for making system calls in
latency sensitive code path.

Implementation:

A new system call is added to request use of a shared structure by a user
thread. Kernel will allocate page(s), shared mapped with user space in
which per-thread shared structures will be allocated. These structures
are padded to 128 bytes. Multiple such shared structures will be allocated
from that page(upto 32 per 4k page) to accommodate requests from multiple
threads in a process, thus avoiding the need to allocate one page per thread. 
Additional pages are allocated as needed to accommodate more thread's
requesting the shared structure. The number of pages required will depend
on the number of threads of the process requesting/using shared structure.

These pages are pinned and so the kernel can access/update the shared
structure thru the kernel address, without the need for
copy_from_user/copy_to_user() calls. 

The system call will return a pointer(user address) to the per thread shared
structure. Application threads could save this per thread pointer in a TLS
variable and reference it.

Request for scheduler time extension described above, will be a use case 
of this shared structure API. The user thread will request execution time 
extension by setting a flag in the shared structure.

Additional members can be appended to the shared structure to implement 
new features that address other use cases. For example sharing thread's
time spent off cpu and on run queue(described in [4]). These would help
the user thread measure cpu time consumption accross some operation,
without having to call getrusage() system call or read /proc/pid/schedstat
frequently.

Another use case is to share user thread's state 'on' or 'off' cpu thru the
shared structure - which can be useful in implementing adaptive waits in
user space(as discussed in [1]). The waiter thread checks if the owner 
of the resource(lock) is 'on' cpu to continue spinning.

API:
===
	The system call 
	int task_getshared(int option, int flags, void __user *uaddr)

	Only supports TASK_SHAREDINFO option for now, 'flags' are not used.

	/* option */
	#define TASK_SHAREDINFO 1

	struct task_sharedinfo {
		volatile unsigned short sched_delay;    
	};

	#define TASK_PREEMPT_DELAY_REQ     1
	#define TASK_PREEMPT_DELAY_GRANTED 2
	#define TASK_PREEMPT_DELAY_DENIED  3


	Following call:
	__thread struct task_sharedinfo *ts;
	task_getshared(TASK_SHAREDINFO, 0, &ts);

	User task sets 'sched_delay' member to TASK_PREEMPT_DELAY_REQ to
	request scheduler time extension. Kernel sets 'sched_delay' to 
	TASK_PREEMPT_DELAY_GRANTED to indicate if the request for execution 
	time extension was granted or TASK_PREEMPT_DELAY_DENIED if denied.

[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[4] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/

Prakash Sangappa (4):
  Introduce per thread user-kernel shared structure
  Scheduler time extention
  Indicate if schedular preemption delay request is granted
  Add scheduler preemption delay granted stats

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/entry-common.h           |  10 +-
 include/linux/mm_types.h               |   4 +
 include/linux/sched.h                  |  30 ++
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  63 +++++
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/task_shared.h       |  29 ++
 init/Kconfig                           |  10 +
 kernel/entry/common.c                  |  15 +-
 kernel/fork.c                          |  12 +
 kernel/sched/core.c                    |  28 ++
 kernel/sched/debug.c                   |   4 +
 kernel/sched/syscalls.c                |   7 +
 kernel/sys_ni.c                        |   2 +
 mm/Makefile                            |   1 +
 mm/mmap.c                              |  13 +
 mm/task_shared.c                       | 366 +++++++++++++++++++++++++
 19 files changed, 593 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 include/uapi/linux/task_shared.h
 create mode 100644 mm/task_shared.c

-- 
2.43.5

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Peter Zijlstra 1 year, 2 months ago

On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:

> This patch set implements the above mentioned 50us extension time as posted
> by Peter. But instead of using restartable sequences as API to set the flag
> to request the extension, this patch proposes a new API with use of a per
> thread shared structure implementation described below. This shared structure
> is accessible in both users pace and kernel. The user thread will set the
> flag in this shared structure to request execution time extension.

But why -- we already have rseq, glibc uses it by default. Why add yet
another thing?

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago

> On Nov 13, 2024, at 10:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> 
>> This patch set implements the above mentioned 50us extension time as posted
>> by Peter. But instead of using restartable sequences as API to set the flag
>> to request the extension, this patch proposes a new API with use of a per
>> thread shared structure implementation described below. This shared structure
>> is accessible in both users pace and kernel. The user thread will set the
>> flag in this shared structure to request execution time extension.
> 
> But why -- we already have rseq, glibc uses it by default. Why add yet
> another thing?

Mainly this provides pinned memory, where kernel can access without the concern of taking a page fault.There could be other use cases in future, where updating user memory in kernel context cannot afford to take pagefaults.  For example implementing  use case like providing the thread state, needs to be done in the context switch path when the thread is going off cpu. 

-Prakash

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Mathieu Desnoyers 1 year, 2 months ago

On 2024-11-13 13:50, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> 
>> This patch set implements the above mentioned 50us extension time as posted
>> by Peter. But instead of using restartable sequences as API to set the flag
>> to request the extension, this patch proposes a new API with use of a per
>> thread shared structure implementation described below. This shared structure
>> is accessible in both users pace and kernel. The user thread will set the
>> flag in this shared structure to request execution time extension.
> 
> But why -- we already have rseq, glibc uses it by default. Why add yet
> another thing?

Indeed, what I'm not seeing in this RFC patch series cover letter is an
explanation that justifies adding yet another per-thread memory area
shared between kernel and userspace when we have extensible rseq
already.

Peter, was there anything fundamentally wrong with your approach based
on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net

The main thing I wonder is whether loading the rseq delay resched flag
on return to userspace is too late in your patch. Also, I'm not sure it is
realistic to require that no system calls should be done within time extension
slice. If we have this scenario:

A) userspace grabs lock
    - set rseq delay resched flag
B) syscall
    - reschedule
     [...]
    - return to userspace, load rseq delay-resched flag from userspace (too late)

I would have thought loading the delay resched flag should be attempted much
earlier in the scheduler code. Perhaps we could do this from a page fault
disable critical section, and accept that this hint may be a no-op if the
rseq page happens to be swapped out (which is really unlikely). This is
similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
which needed to be accessed from the scheduler:

   https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
   https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/

And we'd leave the delay-resched load in place on return to userspace, so
in the unlikely scenario where it is swapped out, at least it gets paged
back at that point.

Feel free to let me know if I'm missing an important point and/or saying
nonsense here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Peter Zijlstra 1 year, 2 months ago

On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
> On 2024-11-13 13:50, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> > 
> > > This patch set implements the above mentioned 50us extension time as posted
> > > by Peter. But instead of using restartable sequences as API to set the flag
> > > to request the extension, this patch proposes a new API with use of a per
> > > thread shared structure implementation described below. This shared structure
> > > is accessible in both users pace and kernel. The user thread will set the
> > > flag in this shared structure to request execution time extension.
> > 
> > But why -- we already have rseq, glibc uses it by default. Why add yet
> > another thing?
> 
> Indeed, what I'm not seeing in this RFC patch series cover letter is an
> explanation that justifies adding yet another per-thread memory area
> shared between kernel and userspace when we have extensible rseq
> already.
> 
> Peter, was there anything fundamentally wrong with your approach based
> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net

Not that I can remember, but it's a long time ago :-)

> The main thing I wonder is whether loading the rseq delay resched flag
> on return to userspace is too late in your patch.

Too late how? It only loads it at the point we would've called
schedule() -- no point in looking at it otherwise, right?

> Also, I'm not sure it is
> realistic to require that no system calls should be done within time extension
> slice. If we have this scenario:

Well, the whole premise is that syscalls are too expensive. If they are
not, then you shouldn't be using this in the first place.

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Mathieu Desnoyers 1 year, 2 months ago

On 2024-11-14 05:14, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>
>>>> This patch set implements the above mentioned 50us extension time as posted
>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>> to request the extension, this patch proposes a new API with use of a per
>>>> thread shared structure implementation described below. This shared structure
>>>> is accessible in both users pace and kernel. The user thread will set the
>>>> flag in this shared structure to request execution time extension.
>>>
>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>> another thing?
>>
>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>> explanation that justifies adding yet another per-thread memory area
>> shared between kernel and userspace when we have extensible rseq
>> already.
>>
>> Peter, was there anything fundamentally wrong with your approach based
>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> 
> Not that I can remember, but it's a long time ago :-)
> 
>> The main thing I wonder is whether loading the rseq delay resched flag
>> on return to userspace is too late in your patch.
> 
> Too late how? It only loads it at the point we would've called
> schedule() -- no point in looking at it otherwise, right?

[...]

For the specific return-to-userspace path, I think where you've placed
the delay-resched flag check is fine.

I'm concerned about other code paths that invoke schedule() besides
return-to-userspace. For instance:

raw_irqentry_exit_cond_resched():

         if (!preempt_count()) {
[...]
                 if (need_resched())
                         preempt_schedule_irq();
         }

AFAIU, this could be triggered by an interrupt handler exit when nested
over a page fault handler, exception handler, or system call.

We may decide that we cannot care less about those scenarios, and just
ignore the delay-resched flag, but it's relevant to take those into
consideration and clearly document the rationale behind our decision.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago


> On Nov 15, 2024, at 6:41 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-14 05:14, Peter Zijlstra wrote:
>> On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote:
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>> 
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> 
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>>> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>> Not that I can remember, but it's a long time ago :-)
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch.
>> Too late how? It only loads it at the point we would've called
>> schedule() -- no point in looking at it otherwise, right?
> 
> [...]
> 
> For the specific return-to-userspace path, I think where you've placed
> the delay-resched flag check is fine.
> 
> I'm concerned about other code paths that invoke schedule() besides
> return-to-userspace. For instance:
> 
> raw_irqentry_exit_cond_resched():
> 
>        if (!preempt_count()) {
> [...]
>                if (need_resched())
>                        preempt_schedule_irq();
>        }
> 
> AFAIU, this could be triggered by an interrupt handler exit when nested
> over a page fault handler, exception handler, or system call.
> 
> We may decide that we cannot care less about those scenarios, and just
> ignore the delay-resched flag, but it's relevant to take those into
> consideration and clearly document the rationale behind our decision.

Don’t think the delay-resched will address all scenarios where preemption can 
occur when in critical section.  We could aim to address frequent paths where
a task can get preempted.  Initially the intent was to prevent preemption mainly
at the end of time slice, if the thread is in a critical section in the user space and 
has requested delaying reschedule.. 

Another path to consider is the wakeups occurring on a different cpu which could 
enqueue  a thread and attempt to preempt this thread when it is running in the
critical section. Should it check if the thread running has been granted extra time 
i.e the ‘taskshrd_sched_delay’ has been set for the running thread, avoid setting 
TIF_NEED_RESCHED in resched_curr() and sending IPI, ie like lazy preemption? 
If ’taskshrd_sched_delay’ has been set we know it will get preempted  due to the 
timer soon, or the task would sched_yield(if it is a well behaving application).

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago


> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-13 13:50, Peter Zijlstra wrote:
>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>> This patch set implements the above mentioned 50us extension time as posted
>>> by Peter. But instead of using restartable sequences as API to set the flag
>>> to request the extension, this patch proposes a new API with use of a per
>>> thread shared structure implementation described below. This shared structure
>>> is accessible in both users pace and kernel. The user thread will set the
>>> flag in this shared structure to request execution time extension.
>> But why -- we already have rseq, glibc uses it by default. Why add yet
>> another thing?
> 
> Indeed, what I'm not seeing in this RFC patch series cover letter is an
> explanation that justifies adding yet another per-thread memory area
> shared between kernel and userspace when we have extensible rseq
> already.

It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.

> 
> Peter, was there anything fundamentally wrong with your approach based
> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> 
> The main thing I wonder is whether loading the rseq delay resched flag
> on return to userspace is too late in your patch. Also, I'm not sure it is
> realistic to require that no system calls should be done within time extension
> slice. If we have this scenario:

I am also not sure if we need to prevent system calls in this scenario. 
Was that restriction mainly because of restartable sequence API implements it?

-Prakash

> 
> A) userspace grabs lock
>   - set rseq delay resched flag
> B) syscall
>   - reschedule
>    [...]
>   - return to userspace, load rseq delay-resched flag from userspace (too late)
> 
> I would have thought loading the delay resched flag should be attempted much
> earlier in the scheduler code. Perhaps we could do this from a page fault
> disable critical section, and accept that this hint may be a no-op if the
> rseq page happens to be swapped out (which is really unlikely). This is
> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
> which needed to be accessed from the scheduler:
> 
>  https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>  https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
> 
> And we'd leave the delay-resched load in place on return to userspace, so
> in the unlikely scenario where it is swapped out, at least it gets paged
> back at that point.
> 
> Feel free to let me know if I'm missing an important point and/or saying
> nonsense here.
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Peter Zijlstra 1 year, 2 months ago

On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
> 
> 
> > On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > On 2024-11-13 13:50, Peter Zijlstra wrote:
> >> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
> >>> This patch set implements the above mentioned 50us extension time as posted
> >>> by Peter. But instead of using restartable sequences as API to set the flag
> >>> to request the extension, this patch proposes a new API with use of a per
> >>> thread shared structure implementation described below. This shared structure
> >>> is accessible in both users pace and kernel. The user thread will set the
> >>> flag in this shared structure to request execution time extension.
> >> But why -- we already have rseq, glibc uses it by default. Why add yet
> >> another thing?
> > 
> > Indeed, what I'm not seeing in this RFC patch series cover letter is an
> > explanation that justifies adding yet another per-thread memory area
> > shared between kernel and userspace when we have extensible rseq
> > already.
> 
> It mainly provides pinned memory, can be useful for  future use cases
> where updating user memory in kernel context can be fast or needs to
> avoid pagefaults.

'might be useful' it not good enough a justification. Also, I don't
think you actually need this.

See:

  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com

for a more elaborate scheme.

> > Peter, was there anything fundamentally wrong with your approach based
> > on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
> > 
> > The main thing I wonder is whether loading the rseq delay resched flag
> > on return to userspace is too late in your patch. Also, I'm not sure it is
> > realistic to require that no system calls should be done within time extension
> > slice. If we have this scenario:
> 
> I am also not sure if we need to prevent system calls in this scenario. 
> Was that restriction mainly because of restartable sequence API implements it?

No, the whole premise of delaying resched was because people think that
syscalls are too slow. If you do not think this, then you shouldn't be
using this.

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago


> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>> 
>> 
>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>> 
>> It mainly provides pinned memory, can be useful for  future use cases
>> where updating user memory in kernel context can be fast or needs to
>> avoid pagefaults.
> 
> 'might be useful' it not good enough a justification. Also, I don't
> think you actually need this.

Will get back with database benchmark results using rseq API for scheduler time extension.

> 
> See:
> 
>  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
> 
> for a more elaborate scheme.
> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>> 
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>> realistic to require that no system calls should be done within time extension
>>> slice. If we have this scenario:
>> 
>> I am also not sure if we need to prevent system calls in this scenario. 
>> Was that restriction mainly because of restartable sequence API implements it?
> 
> No, the whole premise of delaying resched was because people think that
> syscalls are too slow. If you do not think this, then you shouldn't be
> using this.

Agree.

Thanks,
-Prakash

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago


> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> 
> 
>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>> 
>>> 
>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>> 
>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>> thread shared structure implementation described below. This shared structure
>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>> flag in this shared structure to request execution time extension.
>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>> another thing?
>>>> 
>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>> explanation that justifies adding yet another per-thread memory area
>>>> shared between kernel and userspace when we have extensible rseq
>>>> already.
>>> 
>>> It mainly provides pinned memory, can be useful for  future use cases
>>> where updating user memory in kernel context can be fast or needs to
>>> avoid pagefaults.
>> 
>> 'might be useful' it not good enough a justification. Also, I don't
>> think you actually need this.
> 
> Will get back with database benchmark results using rseq API for scheduler time extension.

Sorry about the delay in response.

Here are the database swingbench numbers   - includes results with use of rseq API.

Test results:
=========
Test system 2 socket AMD Genoa

 	Swingbench - standard database benchmark 
		Cached(database files on tmpfs) run, with 1000 clients.

		Baseline(Without Sched time extension):  99K SQL exec/sec

		With Sched time extension: 
				Shared structure API use: 		153K SQL exec/sec  (Previously reported)
			55% improvement in throughput.

				Restartable sequences API use:	147K SQL exec/sec
 			48% improvement in throughput

While both show good performance benefit with scheduler time extension,
there is a 7% difference in throughput between Shared structure & Restartable sequences API. 
Use of shared structure is faster.



> 
>> 
>> See:
>> 
>> https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>> 
>> for a more elaborate scheme.
>> 
>>>> Peter, was there anything fundamentally wrong with your approach based
>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>> 
>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>> realistic to require that no system calls should be done within time extension
>>>> slice. If we have this scenario:
>>> 
>>> I am also not sure if we need to prevent system calls in this scenario. 
>>> Was that restriction mainly because of restartable sequence API implements it?
>> 
>> No, the whole premise of delaying resched was because people think that
>> syscalls are too slow. If you do not think this, then you shouldn't be
>> using this.
> 
> Agree.
> 
> Thanks,
> -Prakash

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Mathieu Desnoyers 1 year, 2 months ago

On 2024-12-09 15:36, Prakash Sangappa wrote:
> 
> 
>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>
>>
>>
>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>
>>>>
>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>>
>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>> flag in this shared structure to request execution time extension.
>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>> another thing?
>>>>>
>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>> explanation that justifies adding yet another per-thread memory area
>>>>> shared between kernel and userspace when we have extensible rseq
>>>>> already.
>>>>
>>>> It mainly provides pinned memory, can be useful for  future use cases
>>>> where updating user memory in kernel context can be fast or needs to
>>>> avoid pagefaults.
>>>
>>> 'might be useful' it not good enough a justification. Also, I don't
>>> think you actually need this.
>>
>> Will get back with database benchmark results using rseq API for scheduler time extension.
> 
> Sorry about the delay in response.
> 
> Here are the database swingbench numbers   - includes results with use of rseq API.
> 
> Test results:
> =========
> Test system 2 socket AMD Genoa
> 
>   	Swingbench - standard database benchmark
> 		Cached(database files on tmpfs) run, with 1000 clients.
> 
> 		Baseline(Without Sched time extension):  99K SQL exec/sec
> 
> 		With Sched time extension:
> 				Shared structure API use: 		153K SQL exec/sec  (Previously reported)
> 			55% improvement in throughput.
> 
> 				Restartable sequences API use:	147K SQL exec/sec
>   			48% improvement in throughput
> 
> While both show good performance benefit with scheduler time extension,
> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
> Use of shared structure is faster.

Can you share the code for both test cases ? And do you have relevant
perf profile showing where time is spent ?

Thanks,

Mathieu

> 
> 
> 
>>
>>>
>>> See:
>>>
>>> https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>>
>>> for a more elaborate scheme.
>>>
>>>>> Peter, was there anything fundamentally wrong with your approach based
>>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>>
>>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>>> realistic to require that no system calls should be done within time extension
>>>>> slice. If we have this scenario:
>>>>
>>>> I am also not sure if we need to prevent system calls in this scenario.
>>>> Was that restriction mainly because of restartable sequence API implements it?
>>>
>>> No, the whole premise of delaying resched was because people think that
>>> syscalls are too slow. If you do not think this, then you shouldn't be
>>> using this.
>>
>> Agree.
>>
>> Thanks,
>> -Prakash
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 1 month ago


> On Dec 9, 2024, at 1:17 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-12-09 15:36, Prakash Sangappa wrote:
>>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> 
>>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>> 
>>>>> 
>>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>>> 
>>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>>> flag in this shared structure to request execution time extension.
>>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>>> another thing?
>>>>>> 
>>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>>> explanation that justifies adding yet another per-thread memory area
>>>>>> shared between kernel and userspace when we have extensible rseq
>>>>>> already.
>>>>> 
>>>>> It mainly provides pinned memory, can be useful for  future use cases
>>>>> where updating user memory in kernel context can be fast or needs to
>>>>> avoid pagefaults.
>>>> 
>>>> 'might be useful' it not good enough a justification. Also, I don't
>>>> think you actually need this.
>>> 
>>> Will get back with database benchmark results using rseq API for scheduler time extension.
>> Sorry about the delay in response.
>> Here are the database swingbench numbers   - includes results with use of rseq API.
>> Test results:
>> =========
>> Test system 2 socket AMD Genoa
>>   Swingbench - standard database benchmark
>> Cached(database files on tmpfs) run, with 1000 clients.
>> Baseline(Without Sched time extension):  99K SQL exec/sec
>> With Sched time extension:
>> Shared structure API use: 153K SQL exec/sec  (Previously reported)
>> 55% improvement in throughput.
>> Restartable sequences API use: 147K SQL exec/sec
>>   48% improvement in throughput
>> While both show good performance benefit with scheduler time extension,
>> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
>> Use of shared structure is faster.
> 
> Can you share the code for both test cases ? And do you have relevant
> perf profile showing where time is spent ?
> 
> Thanks,
> 
> Mathieu
> 

The changes are in the database(Oracle DB). 
The test is swingbench. https://www.dominicgiles.com/downloads/

Our database team is running the benchmark. I have requested them to repeat the test and 
capture perf profile.

With restartable sequences API, once a thread registers the 'structure rseq', 
kernel will need to 'copy_from_user' to check if the thread is requesting an extension in 
exit_to_user_mode_loop(), this potentially adds overhead? Were as with shared structure
It is just a memory access. 

I was trying to reproduce the performance difference using a microbenchmark. 
Used a modified version of the test(extend-sched.c) the Steven Rosted’s posted here 
https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/

Modified test to include use of the Shared structure API and Restartable sequences API to request
scheduler time extension.  Increased number of data objects that the threads contend on 
and grab_lock(). Added some delay inside critical section to increase lock hold time. 
Test runs for 180secs. 
(Modified test included below)

The test spawns number of threads equal to number of cpus. They update a shared data object. 
Simultaneously ran a cpu hog program with # threads equal to number of cpus.

The test takes an argument  
0 - No scheduler time extension
1 - Use Shared structure API to request extension 
2 - Use Restartable sequences API to request extension.
Option 1 & 2 would require relevant kernel running. 

The test increments a count in the shared data object indicating number of times it could complete
the critical section. Higher the number is better.

Running this test on a 50 core VM shows lot of variance in the results. Generally we see
performance improvement of 4 to 6% with use of either APIs in this test.
Unfortunately It does not show  consistent difference between the two APIs.

Approximately :-

No extension:

#./extend-sched2 0 |egrep -e "^Ran|^Total|^avg”
Ran for 402660 times <---
Total wait time: 16132.358985
avg # grants 0 avg reqs 4026
avg # blocks 14440


With extension: using shared structure API

#./extend-sched2 1 |egrep -e "^Ran|^Total|^avg”
Ran for 425210 times <---
Total wait time: 16043.160130
avg # grants 2582 avg reqs 4252
avg # blocks 14920

About 5.6% improvement.

WIth extension: Using restartable sequences API

#./extend-sched2 2 |egrep -e "^Ran|^Total|^avg”
Ran for 423765 times <---
Total wait time: 16045.406387
avg # grants 2580 avg reqs 4237
avg # blocks 14851

About 5.3 % improvement. 

The perf profile of the test are similar between the two APIs. With restartable sequences, 
we see ‘_copy_from_user’ appear.

# Total Lost Samples: 0
#
# Samples: 890K of event 'cycles:P'
# Event count (approx.): 32581540690314
#
# Overhead       Samples  Command       Shared Object       Symbol
# ........  ............  ............  ..................  .............................................
#
   98.67%        877109  lckshrsq2scl  lckshrsq2scl        [.] grab_lock
    0.84%          7496  lckshrsq2scl  [kernel.kallsyms]   [k] asm_sysvec_apic_timer_interrupt
    0.03%           285  lckshrsq2scl  [kernel.kallsyms]   [k] native_write_msr
    0.02%           200  lckshrsq2scl  [kernel.kallsyms]   [k] native_read_msr
    0.02%           175  lckshrsq2scl  [kernel.kallsyms]   [k] pvclock_clocksource_read_nowd
    0.01%           128  lckshrsq2scl  [kernel.kallsyms]   [k] sync_regs
    0.01%           120  lckshrsq2scl  [kernel.kallsyms]   [k] get_jiffies_update
    0.01%           116  lckshrsq2scl  [kernel.kallsyms]   [k] __update_load_avg_cfs_rq
    0.01%           113  lckshrsq2scl  [kernel.kallsyms]   [k] update_curr
    0.01%           137  lckshrsq2scl  [kernel.kallsyms]   [k] srso_alias_safe_ret
    0.01%            96  lckshrsq2scl  [kernel.kallsyms]   [k] __update_load_avg_se
    0.01%            85  lckshrsq2scl  [kernel.kallsyms]   [k] update_process_times
    0.01%            76  lckshrsq2scl  [kernel.kallsyms]   [k] update_rq_clock_task
    0.01%           101  lckshrsq2scl  [kernel.kallsyms]   [k] __raw_callee_save___pv_queued_spin_unlock
    0.01%            73  lckshrsq2scl  [kernel.kallsyms]   [k] update_irq_load_avg
    0.01%            72  lckshrsq2scl  lckshrsq2scl          [.] run_thread
    0.01%            66  lckshrsq2scl  [kernel.kallsyms]   [k] rcu_pending
    0.01%            65  lckshrsq2scl  [kernel.kallsyms]   [k] perf_adjust_freq_unthr_context
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] sched_tick
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] update_load_avg
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] hrtimer_interrupt
..
    0.00%             14  lckshrsq2scl  [kernel.kallsyms]   [k] _copy_from_user
..
    0.00%             1  lckshrsq2scl  [kernel.kallsyms]   [k] _copy_to_user



Test program:
/*— extend-sched2.c — */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#include <errno.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/time.h>
#include <linux/rseq.h>
#include <sys/syscall.h>

#define barrier() asm volatile ("" ::: "memory")
#define rmb() asm volatile ("lfence" ::: "memory")
#define wmb() asm volatile ("sfence" ::: "memory")


/*----- shared struct ------ */
/*
* Shared page -  shared structure API/ABI
*/
#define __NR_task_getshared 463
#define TASK_SHAREDINFO 1


/* a per thread shared structure */
struct task_sharedinfo {
       volatile unsigned short sched_delay;
};

/*
* Returns address of the task_ushared structure in the shared page
* mapped between userspace and kernel, for the calling thread.
*/
struct task_sharedinfo *task_getshared()
{
       struct task_sharedinfo *ts;
       if (!syscall(__NR_task_getshared, TASK_SHAREDINFO, 0, &ts))
               return ts;
       return NULL;
}

/*------- restartable seq ------- */
#define RSEQ_CS_SCHED_DELAY 3
#define RSEQ_SCHED_DELAY ((1U << RSEQ_CS_SCHED_DELAY))

static __thread struct rseq myrseq;
static __thread struct rseq *extend_map_rseq;


int init_rseql()
{
       syscall(__NR_rseq, &myrseq, 32, 0, 0x31456);
       extend_map_rseq = &myrseq;
}

/*--------------------------------*/

static pthread_barrier_t pbarrier;
static __thread struct task_sharedinfo *extend_map;
int enable_extend;

static void init_extend_map(void)
{
       if (!enable_extend)
               return;
       if (enable_extend == 1) {
               extend_map = task_getshared();
               if (extend_map == NULL) {
                       printf("Failed to allocate shared struct\n");
               }
       }
       if (enable_extend == 2) {
               init_rseql();
               if (extend_map_rseq == NULL) {
                       printf("Failed to init startable seq\n");
               }
       }

}

/*called from main only */
static void test_extend(void)
{
       int x;

       init_extend_map();
       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;
       printf("Spinning...");
       if (enable_extend == 1) {
               extend_map->sched_delay = 1;
               while(extend_map->sched_delay != 2) {
                       x++;
                       barrier();
               }
       }

       if (enable_extend == 2) {
               extend_map_rseq->flags = RSEQ_SCHED_DELAY;
               while (extend_map_rseq->flags & RSEQ_SCHED_DELAY) {
                       x++;
                       barrier();
               }
       }
       printf("Done\n");

}

struct thread_data {
       unsigned long long                      start_wait;
       unsigned long long                      x_count;
       unsigned long long                      total;
       unsigned long long                      max;
       unsigned long long                      min;
       unsigned long long                      total_wait;
       unsigned long long                      max_wait;
       unsigned long long                      min_wait;
       unsigned long long                      blocks;
       unsigned long long                      dlygrant;
       struct data                            *data;
};

struct data {
       unsigned long long              x;
       unsigned long                   lock;
       bool                            done;
};

int nobj;
struct data *data;
struct thread_data *thrdata;

static inline unsigned long
cmpxchg(volatile unsigned long *ptr, unsigned long old, unsigned long new)
{
       unsigned long prev;

       asm volatile("lock; cmpxchg %b1,%2"
                    : "=a"(prev)
                    : "q"(new), "m"(*(ptr)), "0"(old)
                    : "memory");
       return prev;
}

static inline unsigned long
xchg(volatile unsigned long *ptr, unsigned long new)
{
       unsigned long ret = new;

       asm volatile("xchg %b0,%1"
                    : "+r"(ret), "+m"(*(ptr))
                    : : "memory");
       return ret;
}

static void extend(struct thread_data *tdata)
{
       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;

       if (enable_extend == 1)
               extend_map->sched_delay = 1;
       if (enable_extend == 2)
               extend_map_rseq->flags = RSEQ_SCHED_DELAY;
}

static void unextend(struct thread_data *tdata)
{
       unsigned long prev;

       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;

       if (enable_extend == 1) {
               prev = extend_map->sched_delay;
               extend_map->sched_delay = 0;
       //      prev = xchg(&extend_map->sched_delay, 0);
               if (prev == 2) {
                       tdata->dlygrant++;
                       //printf("Yield!\n");
                       sched_yield();
               }
       }

       if (enable_extend == 2) {
               prev = extend_map_rseq->flags;
               extend_map_rseq->flags = 0;
               if (!(prev & RSEQ_SCHED_DELAY)) {
                       tdata->dlygrant++;
                       //printf("Yield!\n");
                       sched_yield();
               }
       }
}

#define sec2usec(sec) (sec * 1000000ULL)
#define usec2sec(usec) (usec / 1000000ULL)

static unsigned long long get_time(void)
{
       struct timeval tv;
       unsigned long long time;

       gettimeofday(&tv, NULL);

       time = sec2usec(tv.tv_sec);
       time += tv.tv_usec;

       return time;
}

static void grab_lock(struct thread_data *tdata, struct data *data)
{
       unsigned long long start, end, delta;
       unsigned long long end_wait;
       unsigned long long last;
       unsigned long prev;

       if (!tdata->start_wait)
               tdata->start_wait = get_time();

       while (data->lock && !data->done)
               rmb();

       extend(tdata);
       start = get_time();
       prev = cmpxchg(&data->lock, 0, 1);
       if (prev) {
               unextend(tdata);
               tdata->blocks++;
               return;
       }
       end_wait = get_time();
       //printf("Have lock!\n");

       delta = end_wait - tdata->start_wait;
       tdata->start_wait = 0;
       if (!tdata->total_wait || tdata->max_wait < delta)
               tdata->max_wait = delta;
       if (!tdata->total_wait || tdata->min_wait > delta)
               tdata->min_wait = delta;
       tdata->total_wait += delta;

       data->x++;
       last = data->x;

       if (data->lock != 1) {
               printf("Failed locking\n");
               exit(-1);
       }

       // extend hold time
       for(int i = 0; i < 3000000; i++);

       prev = cmpxchg(&data->lock, 1, 0);
       end = get_time();
       if (prev != 1) {
               printf("Failed unlocking\n");
               exit(-1);
       }
       //printf("released lock!\n");
       unextend(tdata);

       delta = end - start;
       if (!tdata->total || tdata->max < delta)
               tdata->max = delta;

       if (!tdata->total || tdata->min > delta)
               tdata->min = delta;

       tdata->total += delta;
       tdata->x_count++;

       /* Let someone else have a turn */
       while (data->x == last && !data->done)
               rmb();
}

static void *run_thread(void *d)
{
       struct thread_data *tdata = d;
       struct data *data = tdata->data;

       init_extend_map();

       pthread_barrier_wait(&pbarrier);

       while (!data->done) {
               grab_lock(tdata, data);
       }
       return NULL;
}

/* arg1 == 1 use shared struct, arg1 ==2 use rseq */
int main (int argc, char **argv)
{
       unsigned long long total_wait = 0;
       unsigned long long total_blocks = 0;
       unsigned long long total_grants = 0;
       unsigned long long total_runs = 0;
       unsigned long long secs;
       pthread_t *threads;
       int cpus;

       enable_extend = 0;
       if (argc < 2) {
               printf("Usage:%s <0 - no extend, 1 = shared page, 2 = rseq>\n",
                                                        argv[0]);
               exit(1);
       }
       enable_extend = atoi(argv[1]);
       printf("enable extend %d\n", enable_extend);

       test_extend();

       cpus = sysconf(_SC_NPROCESSORS_CONF);
       nobj = cpus / 10;


       threads = calloc(cpus + 1, sizeof(*threads));
       if (!threads) {
               perror("threads");
               exit(-1);
       }

       thrdata = calloc(cpus + 1, sizeof(struct thread_data));
       if (!thrdata) {
               perror("Allocating tdata");
               exit(-1);
       }

       data = calloc(nobj + 1, sizeof(struct data));
       if (!data) {
               perror("Allocating tdata");
               exit(-1);
       }

       pthread_barrier_init(&pbarrier, NULL, cpus + 1);

       for (int i = 0; i < cpus; i++) {
               int ret;

               thrdata[i].data = &data[ i % nobj];
               ret = pthread_create(&threads[i], NULL, run_thread, &thrdata[i]);
               if (ret < 0) {
                       perror("creating threads");
                       exit(-1);
               }
       }

       pthread_barrier_wait(&pbarrier);
       sleep(180);

       printf("Finish up\n");
       for (int i = 0; i < nobj; i++)
               data[i].done = true;
       wmb();

       for (int i = 0; i < cpus; i++) {
               pthread_join(threads[i], NULL);
               printf("thread %i:\n", i);
               printf("   count:\t%lld\n", thrdata[i].x_count);
               printf("   total:\t%lld\n", thrdata[i].total);
               printf("     max:\t%lld\n", thrdata[i].max);
               printf("     min:\t%lld\n", thrdata[i].min);
               printf("   total wait:\t%lld\n", thrdata[i].total_wait);
               printf("     max wait:\t%lld\n", thrdata[i].max_wait);
               printf("     min wait:\t%lld\n", thrdata[i].min_wait);
               printf("     #blocks :\t%lld\n", thrdata[i].blocks);
               printf("     #dlygrnt:\t%lld\n", thrdata[i].dlygrant);
               total_wait += thrdata[i].total_wait;
               total_blocks += thrdata[i].blocks;
               total_grants += thrdata[i].dlygrant;
       }

       for(int i = 0; i < nobj; i++)
               total_runs += data[i].x;

       secs = usec2sec(total_wait);

       printf("Ran for %lld times\n", total_runs);
       printf("Total wait time: %lld.%06lld\n", secs, total_wait - sec2usec(secs));
       printf("avg # grants %lld avg reqs %lld \n", total_grants/cpus, total_runs/cpus);
       printf("avg # blocks %lld\n", total_blocks/cpus);
       return 0;
}

/*———————————*/

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year ago


> On Dec 16, 2024, at 10:59 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> 
> 
>> On Dec 9, 2024, at 1:17 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> 
>> On 2024-12-09 15:36, Prakash Sangappa wrote:
>>>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> 
>>>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>>>>> 
>>>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>>>> flag in this shared structure to request execution time extension.
>>>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>>>> another thing?
>>>>>>> 
>>>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>>>> explanation that justifies adding yet another per-thread memory area
>>>>>>> shared between kernel and userspace when we have extensible rseq
>>>>>>> already.
>>>>>> 
>>>>>> It mainly provides pinned memory, can be useful for  future use cases
>>>>>> where updating user memory in kernel context can be fast or needs to
>>>>>> avoid pagefaults.
>>>>> 
>>>>> 'might be useful' it not good enough a justification. Also, I don't
>>>>> think you actually need this.
>>>> 
>>>> Will get back with database benchmark results using rseq API for scheduler time extension.
>>> Sorry about the delay in response.
>>> Here are the database swingbench numbers   - includes results with use of rseq API.
>>> Test results:
>>> =========
>>> Test system 2 socket AMD Genoa
>>>  Swingbench - standard database benchmark
>>> Cached(database files on tmpfs) run, with 1000 clients.
>>> Baseline(Without Sched time extension):  99K SQL exec/sec
>>> With Sched time extension:
>>> Shared structure API use: 153K SQL exec/sec  (Previously reported)
>>> 55% improvement in throughput.
>>> Restartable sequences API use: 147K SQL exec/sec
>>>  48% improvement in throughput
>>> While both show good performance benefit with scheduler time extension,
>>> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
>>> Use of shared structure is faster.
>> 
>> Can you share the code for both test cases ? And do you have relevant
>> perf profile showing where time is spent ?
>> 
>> Thanks,
>> 
>> Mathieu
>> 
> 
> The changes are in the database(Oracle DB). 
> The test is swingbench. https://www.dominicgiles.com/downloads/
> 
> Our database team is running the benchmark. I have requested them to repeat the test and 
> capture perf profile.
> 

Update about the database testing comparing performance difference between the two APIs
Shared structure and Rseq. 

The workload shows variations in the runs. Repeating the tests number of times shows the
Performance on average is similar with use of either APIs to request Scheduler time extension.

He shared following results from the Lock table database workload, which simulates SQL query 
execution. This shows min, max, median and average of 5 runs with  scheduler time extension 
enabled and disabled.

Locktable test(# of query executions per sec).
Run					Min			Max			Median		Average
====================================================================
Disabled				162,911.10	187,543.40	177,035.30	175,675.24(Baseline)
Enabled(Shared struct)	167,896.20	237,709.50	191,269.50	195,113.80(+11%)
Enabled(Rseq)		166,410.40	251,349.30	191,917.60	199,740.40(+13%)

In this case Rseq shows better performance.

The conclusion seems to be both APIs perform similarly. 

How do we proceed?
We/our Database team wants the scheduler time extension feature integrated.
Not particular about the Shared structure API for this. 

I can submit the Rseq based patch PeterZ proposed. The  copy_from/to_user() calls
can be modified to use the nofault version.


> I was trying to reproduce the performance difference using a microbenchmark. 
> Used a modified version of the test(extend-sched.c) the Steven Rosted’s posted here 
> https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
...
> 
> The test increments a count in the shared data object indicating number of times it could complete
> the critical section. Higher the number is better.
> 
> Running this test on a 50 core VM shows lot of variance in the results. Generally we see
> performance improvement of 4 to 6% with use of either APIs in this test.
> Unfortunately It does not show  consistent difference between the two APIs.

...

-Prakash

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Mathieu Desnoyers 1 year, 2 months ago

On 2024-11-14 14:42, Prakash Sangappa wrote:
> 
> 
>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:

[...]

>>
>> See:
>>
>>   https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>
>> for a more elaborate scheme.
>>
>>>> Peter, was there anything fundamentally wrong with your approach based
>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>
>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>> realistic to require that no system calls should be done within time extension
>>>> slice. If we have this scenario:
>>>
>>> I am also not sure if we need to prevent system calls in this scenario.
>>> Was that restriction mainly because of restartable sequence API implements it?
>>
>> No, the whole premise of delaying resched was because people think that
>> syscalls are too slow. If you do not think this, then you shouldn't be
>> using this.
> 
> Agree.

I only partially agree with Peter here. I agree that we don't want to
add system calls on the delay-resched critical section fast path,
because this would have a significant performance hit.

But there are scenarios where issuing system calls from within that
critical section would be needed, even though those would not belong
to the fast path:

1) If a signal handler nests over a delay-resched critical section.
    That signal handler is allowed to issue system calls.

2) If the critical section fast-path is calling GNU C library API and/or
    a vDSO, which is typically fast, but can end up calling a system call
    as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a
    system call by killing the application punches a hole in the
    abstractions meant to be provided by GNU libc and vDSO.

I would recommend that we allow issuing system calls while the
delay-resched bit is set. However, we may not strictly need to honor
the delay-resched hint from a system call context, as those would
be expected to be either infrequent or a portability fallback,
which means the enhanced performance provided by delay-resched
really won't matter.

Another scenario to keep in mind are page faults happening within a
delay-resched critical section. This is a scenario where page fault
handling can explicitly reschedule. If this happens, I suspect we
really don't care about the delay-resched hint, but we should consider
whether this hint should be left as-is or cleared.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago


> On Nov 15, 2024, at 6:13 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-14 14:42, Prakash Sangappa wrote:
>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> [...]
> 
>>> 
>>> See:
>>> 
>>>  https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com
>>> 
>>> for a more elaborate scheme.
>>> 
>>>>> Peter, was there anything fundamentally wrong with your approach based
>>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>>>> 
>>>>> The main thing I wonder is whether loading the rseq delay resched flag
>>>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>>>> realistic to require that no system calls should be done within time extension
>>>>> slice. If we have this scenario:
>>>> 
>>>> I am also not sure if we need to prevent system calls in this scenario.
>>>> Was that restriction mainly because of restartable sequence API implements it?
>>> 
>>> No, the whole premise of delaying resched was because people think that
>>> syscalls are too slow. If you do not think this, then you shouldn't be
>>> using this.
>> Agree.
> 
> I only partially agree with Peter here. I agree that we don't want to
> add system calls on the delay-resched critical section fast path,
> because this would have a significant performance hit.
> 
> But there are scenarios where issuing system calls from within that
> critical section would be needed, even though those would not belong
> to the fast path:
> 
> 1) If a signal handler nests over a delay-resched critical section.
>   That signal handler is allowed to issue system calls.
> 
> 2) If the critical section fast-path is calling GNU C library API and/or
>   a vDSO, which is typically fast, but can end up calling a system call
>   as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a
>   system call by killing the application punches a hole in the
>   abstractions meant to be provided by GNU libc and vDSO.
> 
> I would recommend that we allow issuing system calls while the
> delay-resched bit is set. However, we may not strictly need to honor
> the delay-resched hint from a system call context, as those would
> be expected to be either infrequent or a portability fallback,
> which means the enhanced performance provided by delay-resched
> really won't matter.
> 
> Another scenario to keep in mind are page faults happening within a
> delay-resched critical section. This is a scenario where page fault
> handling can explicitly reschedule. If this happens, I suspect we
> really don't care about the delay-resched hint, but we should consider
> whether this hint should be left as-is or cleared.
> 

There are no explicit checks right now that a system call occurred in the critical section, 
except under DEBUG_RSEQ, which probably will not apply for the delay-resched case.
 
But the question was should there be checks implemented, probably not 
due to scenarios you mentioned above.


> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Mathieu Desnoyers 1 year, 2 months ago

On 2024-11-13 15:10, Prakash Sangappa wrote:
> 
> 
>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>> This patch set implements the above mentioned 50us extension time as posted
>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>> to request the extension, this patch proposes a new API with use of a per
>>>> thread shared structure implementation described below. This shared structure
>>>> is accessible in both users pace and kernel. The user thread will set the
>>>> flag in this shared structure to request execution time extension.
>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>> another thing?
>>
>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>> explanation that justifies adding yet another per-thread memory area
>> shared between kernel and userspace when we have extensible rseq
>> already.
> 
> It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.

Does the targeted use-case (scheduler time slice extension) require
pinned memory, or just future use-cases ?

Does having a missing time slice extension hint for a short while in
case of high memory pressure (rseq page swapped out) have any measurable
impact compared to the overhead of the page faults which will be
happening in case of the high memory pressure required to trigger this
scenario ?

> 
>>
>> Peter, was there anything fundamentally wrong with your approach based
>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>
>> The main thing I wonder is whether loading the rseq delay resched flag
>> on return to userspace is too late in your patch. Also, I'm not sure it is
>> realistic to require that no system calls should be done within time extension
>> slice. If we have this scenario:
> 
> I am also not sure if we need to prevent system calls in this scenario.

I suspect that if we prohibit system calls from being issued from a
delay-resched userspace critical section, then loading the delay-resched
rseq flag on return to userspace is always fine, because the kernel only
reschedules on return from interrupt or trap.

But I see this no-syscall restriction as being very cumbersome for
userspace.

> Was that restriction mainly because of restartable sequence API implements it?

I suspect that this restriction is just to avoid loading the
delay-resched flag from the scheduler when reschedule is called
from an interrupt handler nested over a system call for preemptible
kernels, but Peter could tell us more.

One open question here is whether we want to pin memory for
each thread in the system to hold this shared data between
userspace and the scheduler. AFAIU, this is a trade-off between
slice extension accuracy in high memory usage scenarios and
pinned memory footprint impact. If the tradeoff goes towards
making this memory pinned, then we may want to consider pinning
the per-thread rseq area on rseq registration.

Another option to consider is to use rseq to index a userspace
per-cpu data structure, which will be used as shared memory
between kernel and userspace. Userspace could store this
delay-resched flag into the current CPU's shared data, and
the scheduler could load it from there. If pinning per-cpu
data is more acceptable than pinning per-thread data, then
it could be an improvement.

This could be a new ABI between kernel and userspace, e.g.:

struct rseq_percpu_area {
     __u32 sched_state;	/* includes time slice extension flag. */
     char end[];
};

Registered to the kernel with the following parameters:

- Address of rseq_percpu_area for CPU 0,
- The stride of the per-cpu indexing (see librseq mempool per-cpu
   allocator [1]),
- offsetof(struct rseq_percpu_area, end) to have the feature size
   for extensibility.

Thanks,

Mathieu

[1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf

> 
> -Prakash
> 
>>
>> A) userspace grabs lock
>>    - set rseq delay resched flag
>> B) syscall
>>    - reschedule
>>     [...]
>>    - return to userspace, load rseq delay-resched flag from userspace (too late)
>>
>> I would have thought loading the delay resched flag should be attempted much
>> earlier in the scheduler code. Perhaps we could do this from a page fault
>> disable critical section, and accept that this hint may be a no-op if the
>> rseq page happens to be swapped out (which is really unlikely). This is
>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
>> which needed to be accessed from the scheduler:
>>
>>   https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>>   https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
>>
>> And we'd leave the delay-resched load in place on return to userspace, so
>> in the unlikely scenario where it is swapped out, at least it gets paged
>> back at that point.
>>
>> Feel free to let me know if I'm missing an important point and/or saying
>> nonsense here.
>>
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago


> On Nov 13, 2024, at 12:57 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2024-11-13 15:10, Prakash Sangappa wrote:
>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>> It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.
> 
> Does the targeted use-case (scheduler time slice extension) require
> pinned memory, or just future use-cases ?

Probably not for sched time slice extension. Although, we have not run the database workload using restartable sequence. We can try that and get back.

What about publishing thread state in the shared structure? It would require updating user memory in context switching code path when thread is going off the cpu. Here having pinned memory would help. Thread state is the other requirement we are interested in.

> 
> Does having a missing time slice extension hint for a short while in
> case of high memory pressure (rseq page swapped out) have any measurable

Yes, but the restartable sequence based implementation does copy_from_user(), which would go thru the pagefault. It may be ok here. Perpaps it would require disabling page faults in this case? If the page is not present, then the hint would most likely not matter and kernel can skip.

> impact compared to the overhead of the page faults which will be
> happening in case of the high memory pressure required to trigger this
> scenario ?

We would have to test that. 

> 
>>> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>> 
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>> realistic to require that no system calls should be done within time extension
>>> slice. If we have this scenario:
>> I am also not sure if we need to prevent system calls in this scenario.
> 
> I suspect that if we prohibit system calls from being issued from a
> delay-resched userspace critical section, then loading the delay-resched
> rseq flag on return to userspace is always fine, because the kernel only
> reschedules on return from interrupt or trap.
> 
> But I see this no-syscall restriction as being very cumbersome for
> userspace.
> 
>> Was that restriction mainly because of restartable sequence API implements it?
> 
> I suspect that this restriction is just to avoid loading the
> delay-resched flag from the scheduler when reschedule is called
> from an interrupt handler nested over a system call for preemptible
> kernels, but Peter could tell us more.

Hoping Peter can comment on this.

> One open question here is whether we want to pin memory for
> each thread in the system to hold this shared data between
> userspace and the scheduler. AFAIU, this is a trade-off between
> slice extension accuracy in high memory usage scenarios and
> pinned memory footprint impact. If the tradeoff goes towards
> making this memory pinned, then we may want to consider pinning
> the per-thread rseq area on rseq registration.
> 
> Another option to consider is to use rseq to index a userspace
> per-cpu data structure, which will be used as shared memory
> between kernel and userspace. Userspace could store this
> delay-resched flag into the current CPU's shared data, and
> the scheduler could load it from there. If pinning per-cpu
> data is more acceptable than pinning per-thread data, then
> it could be an improvement.

Interesting. Having pre cpu shared memory may not work for all use cases.

Thanks,
-Prakash

> 
> This could be a new ABI between kernel and userspace, e.g.:
> 
> struct rseq_percpu_area {
>    __u32 sched_state; /* includes time slice extension flag. */
>    char end[];
> };
> 
> Registered to the kernel with the following parameters:
> 
> - Address of rseq_percpu_area for CPU 0,
> - The stride of the per-cpu indexing (see librseq mempool per-cpu
>  allocator [1]),
> - offsetof(struct rseq_percpu_area, end) to have the feature size
>  for extensibility.
> 
> Thanks,
> 
> Mathieu
> 
> [1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf
> 
>> -Prakash
>>> 
>>> A) userspace grabs lock
>>>   - set rseq delay resched flag
>>> B) syscall
>>>   - reschedule
>>>    [...]
>>>   - return to userspace, load rseq delay-resched flag from userspace (too late)
>>> 
>>> I would have thought loading the delay resched flag should be attempted much
>>> earlier in the scheduler code. Perhaps we could do this from a page fault
>>> disable critical section, and accept that this hint may be a no-op if the
>>> rseq page happens to be swapped out (which is really unlikely). This is
>>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
>>> which needed to be accessed from the scheduler:
>>> 
>>>  https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>>>  https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
>>> 
>>> And we'd leave the delay-resched load in place on return to userspace, so
>>> in the unlikely scenario where it is swapped out, at least it gets paged
>>> back at that point.
>>> 
>>> Feel free to let me know if I'm missing an important point and/or saying
>>> nonsense here.
>>> 
>>> Thanks,
>>> 
>>> Mathieu
>>> 
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by K Prateek Nayak 1 year, 2 months ago

Hello Prakash,

Few questions around the benchmarks.

On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
> [..snip..] 
> 
> Test results:
> =============
> 	Test system 2 socket AMD Genoa
> 
> 	Lock table test:- a simple database test to grab table lock(spin lock).
> 	   Simulates sql query executions.
> 	   300 clients + 400 cpu hog tasks to generate load.

Have you tried running the 300 clients with a nice value of -20 and 400
CPU hogs with the default nice value / nice 19? Does that help this
particular case?

> 
> 		Without extension : 182K SQL exec/sec
> 		With extension    : 262K SQL exec/sec
> 	   44% improvement.
> 
> 	Swingbench - standard database benchmark
> 	   Cached(database files on tmpfs) run, with 1000 clients.

In this case, how does the performance fare when running the clients
under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs
"TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run?

I'm trying to understand what the performance looks like when using
existing features that inhibit preemption vs putting forward the
preemption when the userspace is holding a lock. Feel free to quote
the latency comparisons too if using the existing features lead to
unacceptable avg/tail latencies.

> 
> 		Without extension : 99K SQL exec/sec
> 		with extension    : 153K SQL exec/sec
> 	   55% improvement in throughput.
> 
> [..snip..]
> 

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 0/4] Scheduler time slice extension

Posted by Prakash Sangappa 1 year, 2 months ago


> On Nov 12, 2024, at 9:43 PM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Prakash,
> 
> Few questions around the benchmarks.
> 
> On 11/13/2024 5:31 AM, Prakash Sangappa wrote:
>> [..snip..] Test results:
>> =============
>> Test system 2 socket AMD Genoa
>> Lock table test:- a simple database test to grab table lock(spin lock).
>>   Simulates sql query executions.
>>   300 clients + 400 cpu hog tasks to generate load.
> 
> Have you tried running the 300 clients with a nice value of -20 and 400
> CPU hogs with the default nice value / nice 19? Does that help this
> particular case?

Have not tried this with the database. Will have to try it.


> 
>> Without extension : 182K SQL exec/sec
>> With extension    : 262K SQL exec/sec
>>   44% improvement.
>> Swingbench - standard database benchmark
>>   Cached(database files on tmpfs) run, with 1000 clients.
> 
> In this case, how does the performance fare when running the clients
> under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs
> "TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run?

Not tried SCHED_BATCH. 

With this run, there were about avg ‘166' TASK_PREEMPT_DELAY_GRANTED grants per task, collected from the scheduler stats captured at the end of the run. Test runs for about 5min.  Don't have the count of how many times preempt delay was requested. If the task completes the critical section, it clears the  TASK_PREEMPT_DELAY_REQ flag, so kernel would not see it many cases as this may not be near the end of the time slice. We would have to capture the count in the application. 


> 
> I'm trying to understand what the performance looks like when using
> existing features that inhibit preemption vs putting forward the
> preemption when the userspace is holding a lock. Feel free to quote
> the latency comparisons too if using the existing features lead to
> unacceptable avg/tail latencies.
> 
>> Without extension : 99K SQL exec/sec
>> with extension    : 153K SQL exec/sec
>>   55% improvement in throughput.
>> [..snip..]
> 
> -- 
> Thanks and Regards,
> Prateek